Pakkit.net
← Back to blog

Infrastructure

Restore in Another Failure Domain, or It Isn't Disaster Recovery

A backup you can only restore in the same place it came from isn't disaster recovery — it's a convenience copy. Real DR means rehearsing a restore into a different failure domain, with everything that requires.

  • Infrastructure
  • Backups
  • Disaster Recovery
  • Reliability

There’s a comfortable lie a lot of backup setups tell: “we have backups, so we’re covered for disaster.” Then you look closely and the backups live in the same place as the thing they’re backing up, restore only into that same environment, and quietly depend on tooling and keys that would be gone in the exact disaster you’re insuring against. That’s not disaster recovery. It’s a convenience copy for the small mistakes. Real DR is the ability to restore into a different failure domain than the one that failed — and the only way to know you can is to rehearse it.

A backup in the same place isn’t insurance against that place

The whole premise of disaster recovery is that a failure domain — a datacenter, a region, a cluster, a site — goes away. If your backups, your restore process, and your dependencies all live inside that same domain, then the event that takes out production takes out your recovery with it. You’ve protected against fat-fingering a table; you have not protected against the building.

So the test of a backup strategy isn’t “do we have backups.” It’s “can we reconstruct service somewhere the disaster didn’t reach.” If the answer requires the failed domain to still be partly alive, it’s not DR — it’s a faster way to fix small problems, which is genuinely useful but is not the thing you told yourself you had.

So I built a second domain to restore into

To actually test cross-domain restore for a distributed database, I stood up two separate environments in my lab and treated them as two distinct datacenters — one to fail, one to recover into. The point was to stop reasoning about cross-datacenter restore and start doing it: take a backup in one, restore it into the other, and find out what breaks. Because something always breaks, and it’s never the part you worried about.

You don’t have a disaster recovery plan. You have a disaster recovery hypothesis, until the first time you actually restore into a domain that wasn’t there when the backup was made.

Building the second domain wasn’t the interesting part. Restoring across the gap was — because that’s where all the hidden same-domain assumptions surface, the ones that are invisible as long as backup and restore happen in the same place.

The assumptions that only break across the gap

Cross-domain restore flushes out dependencies you never knew you had, because they were always trivially satisfied at home:

  • Keys and secrets. If the backup is encrypted, the decryption key has to exist in the recovery domain. If that key lived only in the domain that just died, your backup is now a perfectly secure brick.
  • Tooling. The restore process itself needs its tools, credentials, and config present on the far side. “Run the restore script” assumes the script, and everything it calls, exists where you’re recovering.
  • Topology and addressing. A distributed system that restores into a differently-shaped environment has to cope with new addresses, new node identities, new network paths. Restores that assume the original topology fail in confusing ways.
  • Bandwidth and time. Moving real data across the gap takes real time. A restore that’s instant within a domain can be hours across one — and “hours” might blow your recovery objective.

Every one of these is invisible in a same-domain test and obvious in a cross-domain one. That’s the entire reason the rehearsal has to cross the gap to count.

The restore is the deliverable; the backup just enables it

This is the same drum I keep beating, with a DR twist: a backup is a claim, and a restore is the proof — but for disaster recovery, the proof only counts if it happens somewhere else. A restore you’ve successfully run into the same environment proves you can recover from a deletion. It does not prove you can recover from losing the environment. Those are different guarantees, and conflating them is how organizations discover at the worst possible moment that their “DR” was never tested against an actual D.

So the deliverable isn’t the backup job reporting success. It’s a rehearsed, timed, cross-domain restore that produced working service in a domain the original disaster couldn’t touch — keys present, tooling present, topology handled, inside your time budget.

Make the rehearsal routine

The fix is unglamorous and it’s a practice, not a product: actually run the cross-domain restore, on a schedule, and treat every assumption it breaks as a finding to fix before the real event. The first time you do it, expect it to fail on something small and stupid — a missing key, a tool that wasn’t there, an address hard-coded somewhere. That’s not the rehearsal going badly; that’s the rehearsal doing its entire job, on a day when failing is free.

This sits right next to two things I’ve written: that backups are really a restore problem, and that your backup job needs a smoke alarm so a silent failure doesn’t rot for months. The DR-specific addition is the where: it’s also about surviving the loss of a whole failure domain, and you only know you can when you’ve restored into one that wasn’t there before. If you want to pressure-test whether your backups are actually DR, I’m easy to reach.