Pakkit.net
← Back to blog

Infrastructure

Backups Are a Restore Problem

The size of your data is the least interesting thing about backing it up — what matters is whether you can restore, whether you'd notice a silently failing backup, and whether the thing you backed up was even consistent.

  • Infrastructure
  • Backups
  • Reliability
  • Operations

Someone asked me whether a particular dataset was “too big to back up properly,” and the number they were worried about was the row count. It’s the wrong number to worry about — and that wrongness is a good doorway into the whole topic, because almost everything people get anxious about with backups is the easy part, and the part they don’t think about is where backups actually fail. A backup is not a copy operation. It’s a restore operation that you happen to run later.

Row count is the wrong metric

“We have X million records” tells you almost nothing about backup difficulty. The things that actually drive backup time, storage, and — crucially — restore time are:

  • On-disk size (total bytes), not how many logical records that is.
  • File count, because enumerating and transferring many small files has its own cost.
  • How many machines hold a piece of the data.
  • Bandwidth to wherever the backup lands.

A “huge” record count can be a trivially small footprint, and a “small” dataset of enormous blobs can be a monster to move. Sizing a backup by row count is like sizing a move by how many items are on the inventory list instead of how much they weigh and how far the truck has to drive. Measure the thing that actually costs you.

Restore is the product; the backup is just inventory

Here’s the reframe that changes how you build the whole thing: nobody wants a backup. What people want is a restore. The backup is just the inventory you keep so the restore is possible. Once you hold that, the quality bar moves from “did the backup job report success?” to “can I actually bring the system back, and how long does it take?”

An untested backup is a belief, not a guarantee. The only thing that converts it into a guarantee is restoring it.

So restore drills aren’t optional polish — they’re the test that proves the thing works at all. Restore into a throwaway environment on a schedule. The first time you attempt a restore should not be the day you need it, because that’s the day you discover the backup was missing the schema, or the topology metadata, or one node’s worth of files, and now you’re learning that during an incident.

The dangerous failure is the silent one

The backup failure that actually hurts you isn’t the loud one — it’s the quiet one. A scheduled job that silently stops running looks exactly like a healthy system right up until you need the data that wasn’t captured for the last three weeks. This is the same trap as a stuck automated updater that looks healthy: the absence of a backup makes no noise.

So the monitoring has to be inverted. Don’t alert on failure — alert on age. “The most recent good backup is older than N hours” catches a silently-dead job; “the backup job failed” only catches jobs that fail loudly enough to report. Watch the freshness of the newest restorable backup, and you’ll catch the dead timer, the full disk, and the rotated credential all at once.

A backup is not a repair, and not a consistency fix

This one bites in distributed systems specifically. A restore is only as good as the data was when you captured it. If the system was already inconsistent — replicas out of sync, a repair process that hasn’t run — the backup faithfully preserves that inconsistency. Backups protect against loss; they do not protect against rot. You need the consistency mechanism running alongside backups, not as a substitute for them. Treating a backup as a fix for a health problem just gives you a well-preserved copy of the problem.

Lean on immutability, and put the copy somewhere else

Two design choices make backups cheap and safe:

  • Differential backups exploit immutability. Many storage engines write data into immutable files that never change after they’re written. That means after the first full backup, you only ever upload the new files — incrementals become nearly free, which is what makes a frequent cadence affordable.
  • The copy has to live somewhere the original can’t take down with it. A backup sitting on the same machine as the data dies with the machine. Object storage, a remote share, another site — anywhere whose failure is uncorrelated with the thing you’re protecting. A backup that shares fate with its source isn’t a backup, it’s a second copy of a single point of failure.

Provisioning the backup tool and running the schedule are different jobs

A small but important shape: installing and configuring the backup tooling is a convergent, do-it-the-same-everywhere task — perfect for configuration management. Running the recurring backups is a scheduling task, and that belongs on the machines themselves (a node-resident timer), not bolted onto your config tool fired from some control box. I’ll spare the full argument here because it’s really its own topic, but the short version: don’t make your configuration tool moonlight as a cron, or you’ve invented a single point of failure for your safety net.

Put it together and “can we back this up?” stops being about size and becomes a checklist: can we restore it (have we tried?), would we notice if it stopped, is the data consistent in the first place, and does the copy survive the original’s bad day. Get those right and the row count never mattered. It’s the same operations-as-design thinking I keep coming back to in the private cloud notes. If you want a second set of eyes on your restore plan — the plan, not the backup — I’m easy to reach.