Infrastructure
Replication Across Failure Domains, Not Just Machines
Three copies of your data sitting in one room is one fire away from zero copies — what protects you isn't the replica count, it's whether the copies live in things that can fail independently.
- Infrastructure
- Reliability
- Backups
- Architecture
“We keep three copies” sounds like safety until you ask where the three copies live. If they’re three disks in one server, you’re one server away from zero. If they’re three servers in one rack, one rack. Three racks in one room, one room. The number of copies is the easy, comforting metric; the thing that actually determines what disaster you survive is whether the copies sit in failure domains that can fail independently. I had to work through this for a replicated database and its backups, and the reasoning generalizes to anything you’re trying to keep.
A replica count answers the wrong question
Replication factor protects you against losing a machine. That’s real and useful — a node dies, the others carry on. But it says nothing about losing the thing the machines share: the rack, the power feed, the site, the region, the cloud account. If all your replicas share a failure domain, your replication factor is a number that describes resilience to a failure mode that isn’t the one that takes you down.
Ask not “how many copies do I have,” but “what is the smallest single event that destroys all of them at once.” That event is your real exposure.
So the design question isn’t “how many replicas.” It’s “across which failure domains are they spread.” Three copies across three independent sites survives a site loss; three copies in one site survives a node loss and nothing bigger. Same count, wildly different guarantee.
Your backup has the same problem, harder
This bites hardest with backups, because a backup’s whole job is to survive the loss of the thing it’s backing up. A backup that lives in the same site as the data it protects shares that site’s fate — the flood, fire, or power event that takes the primary takes the backup with it. You didn’t make a backup; you made a second copy in the same blast radius.
The fix is the same principle: the backup copy has to live in a failure domain uncorrelated with the source. Another site, another region, an object store with cross-region replication, off-site storage — anywhere whose bad day isn’t the same as the primary’s bad day. “Off-box” is the floor; “off-site” is what actually buys you disaster recovery. If you can’t answer “which independent failure domain holds the backup,” you don’t have a DR story yet, you have a convenience copy.
”Complete” depends on who’s asking
Here’s a subtlety that surprised me and is worth internalizing. With per-site replication, each site can hold a complete logical copy of the data — every record is present, because the within-site replication already guarantees it. But your backup tooling might judge “completeness” against the whole cluster’s membership, and from one site’s storage it only sees that site’s nodes. So the tool reports “incomplete” even though no data is missing — it’s a node-count label measured against the full ring, not a statement about coverage.
That distinction matters because it changes what “incomplete” should make you do. Sometimes it means “you’re missing data” (act now). Sometimes it means “this store holds a full copy but not every node’s slice in one place” (fine, if your restore plan matches). Knowing which one you’re looking at is the difference between a false alarm and a real gap. Read the definition behind the metric before you trust the metric.
Choose the topology from your restore model
The clean way to decide is to start from how you intend to restore, because the backup topology is downstream of that:
- Per-node / per-site restore is your normal recovery path? Then per-site stores are fine — each site holds a complete copy, each node restores from its own site, and you get failure-domain isolation for free (one site’s backups don’t depend on the other’s). The cost is you give up one-command whole-cluster restore and you have to write a per-site runbook.
- You want one-command, whole-cluster, “restore everything from one place”? Then every node’s backup needs to land in one store all of them reach. The cleanest version is replicated object storage that also happens to be off-site, so you get the single-pane restore and survive a site loss in one decision.
- Stuck on shared file storage you already run? You can bridge the gap by replicating the per-site stores into one consolidated copy on a schedule — you keep the storage you have and regain whole-cluster restore, at the cost of a sync job to operate and watch.
There’s no universally right answer; there’s the answer that matches the disaster you actually need to survive and the restore you actually intend to run. Decide those first.
If you use shared network storage, get the boring parts right
A quick field note, since network file shares are a common and reasonable backend: they work, with caveats that are easy to miss and annoying to debug later. Identity matters — the process writing backups needs consistent ownership across every node, or permissions drift. File locking over the network is weaker than local, so stagger jobs (a randomized delay) rather than firing every node at the same instant onto a shared index. A “hard” mount blocks rather than silently corrupting when the server is unreachable, which is the safe choice but means a stuck server hangs the backup — so pair it with monitoring that notices a hung or aging run. And none of it changes the headline: a share in the same building as the data is not disaster recovery, no matter how reliable the share is.
The throughline is one question I now ask of anything I’m trying to keep: what single event kills every copy at once, and am I okay with that event? Replica counts feel like safety; independent failure domains are safety. It’s the same operations-as-design instinct from the private cloud notes and the homelab the-hard-way lessons. If you’re mapping your own blast radius and want a second set of eyes, I’m easy to reach.