Pakkit.net
← Back to blog

Systems Thinking

A Zero-Downtime Upgrade Is Quorum Math and a One-Way Door

Upgrading a replicated datastore without downtime comes down to two things — keeping a quorum alive while you take one node at a time, and knowing exactly where rollback stops being "revert" and becomes "restore from backup."

  • Systems Thinking
  • Databases
  • Reliability
  • Automation

“Zero-downtime upgrade” sounds like a feature you enable. It’s not. It’s a property you earn by respecting two constraints that have nothing to do with the upgrade button and everything to do with how a replicated system stays consistent. Get them right and the cluster never blinks. Get either one wrong and you’ve either taken an outage or walked through a door you can’t walk back out of. I’ve automated this kind of upgrade on a replicated datastore — Cassandra is my go-to example here — and the lesson lives at the level of distributed-systems fundamentals, not any one product.

The first constraint is quorum, and it’s just arithmetic

A replicated store keeps multiple copies of each piece of data so it can survive losing some of them. The number that matters is the quorum: the count of replicas that have to agree for a read or write to succeed. With a replication factor of three, a quorum is two. That single fact dictates the entire upgrade procedure.

If I take one node down to upgrade it, the other two copies are still up — two out of three satisfies the quorum, so reads and writes keep succeeding the whole time. The database layer experiences exactly zero downtime. If I take two down at once, I’ve dropped below quorum, and now operations that need two agreeing replicas can’t complete. The upgrade just became an outage.

One node at a time isn’t caution. It’s the largest number you can remove while the math still works.

So “upgrade serially, one node at a time, wait for each to come fully back before touching the next” isn’t a conservative style choice — it’s the direct consequence of the quorum arithmetic. The procedure falls out of the constraint. Once you see it that way, you stop treating the serial rollout as slow and start treating it as correct.

The database staying up doesn’t mean your app does

Here’s the boundary that trips people: the cluster can hold quorum perfectly and your application can still see errors during the rollout, because of how the client is configured. If your app pins to a single node, or demands a consistency level higher than the quorum can satisfy with one node missing, then the node you took down for ten minutes is its outage even though the cluster was fine.

So zero-downtime is a two-sided contract. The server side keeps quorum; the client side has to use a topology-aware connection policy across multiple nodes and a consistency level that tolerates one replica being absent. I call this out explicitly because it’s a different owner — the upgrade automation can guarantee the cluster half and cannot guarantee the app half. Naming where your guarantee ends is part of doing the job honestly. An upgrade that’s “zero downtime” only on the side you control isn’t zero downtime to the user.

The second constraint is the one-way door

Every risky procedure has a moment where “undo” changes character, and the discipline is knowing exactly where that moment is. In a major datastore upgrade it’s vivid.

After you swap the binaries on a node, the data files on disk get rewritten into the new version’s on-disk format. Before that rewrite, rollback is cheap: revert the node to its pre-upgrade snapshot and you’re back. After the rewrite, the old version literally cannot read the new on-disk format — so rollback is no longer “revert,” it’s “restore the whole thing from backup,” which is slower, riskier, and a genuinely different incident.

That step is a one-way door. The cost of being wrong jumps discontinuously the moment you walk through it. The practical response is to treat the door with ceremony it doesn’t strictly demand minute-to-minute:

  • Take a fresh snapshot immediately before the irreversible step — ideally two layers, a machine-level snapshot and a data-level one, because they fail differently.
  • Sequence the rewrite as its own phase, run only after every node is happily on the new version, so you’re not compounding “is it upgraded?” with “is it rewritten?”
  • Know, out loud, which side of the door you’re on at every point in the runbook. “We can still revert cheaply” and “we are now restore-from-backup only” are different operating postures and the team should know which one is live.

Most painful production stories I’ve heard are someone discovering after the fact that they’d crossed a one-way door they didn’t know was there. Mapping those doors in advance is most of the safety.

Gate every phase and make the automation check itself

Because the steps have a strict order and different blast radii, I don’t let the automation run as one big sweep. Each phase is gated independently, and the automation asserts its own preconditions before it acts — which is the same panic-button philosophy applied to a multi-step migration:

  • Exactly one phase runs per invocation, so phases can’t accidentally run together or out of order.
  • The phase that restarts a node refuses to start unless the node is currently healthy and a snapshot exists. No snapshot, no restart.
  • After a node restarts, the automation verifies it actually came up on the new version before releasing the next node — it doesn’t assume the upgrade took, it checks. Forward progress is confirmed, not hoped for.

Automation that validates its own preconditions is the difference between a tool you can run on a live cluster and one you can only run while holding your breath.

The temporary state has its own rulebook

While the rollout is in flight, the cluster is straddling two versions, and that mixed-version window is its own operating mode with its own rules. Until every node is on the new version: no repairs, no topology changes (don’t add, remove, or replace nodes), and no schema changes. The system is in a transitional state that’s safe to pass through but not safe to do other risky things during.

That generalizes well beyond databases: any migration creates a temporary in-between state, and the in-between state usually has constraints the start and end states don’t. Writing those down — “here’s what you must not do while we’re half-migrated” — is part of designing the migration, not an afterthought.

Put it together and a zero-downtime upgrade stops being a leap of faith. It’s quorum arithmetic telling you the pace, a clear-eyed map of where the one-way doors are, gated automation that checks itself, and an explicit rulebook for the messy middle. None of it is exotic; all of it is the difference between an upgrade and an incident. The same respect-the-measurement, respect-the-constraint mindset runs through how I approach benchmarking and the rest of the private cloud work. If you’re planning a rolling upgrade and want a second set of eyes on where your one-way doors are, I’m easy to reach.