Infrastructure

Your Backup Job Needs a Smoke Alarm

A backup that fails silently is worse than no backup, because it sells you false confidence — so the job has to verify it actually worked and shout when it doesn't, with the checks living in whatever actually runs the backup.

October 7, 2025

Infrastructure
Backups
Reliability
Operations

I inherited a backup setup that inferred success from a string in its output: if the run printed the right line, it counted as a good backup. A backup that hung halfway, or finished with a partial result, or quietly produced nothing looked exactly the same as a perfect one. That’s not a backup system — it’s a system that produces reassuring log lines. The fix turned into a small thesis I now apply everywhere: a backup job needs a smoke alarm. It has to verify it worked and make noise when it didn’t, and the verification has to live where the job actually runs.

Inferring success is the original sin

The seductive thing about “grep the output for SUCCESS” is that it usually works, which is exactly what makes it dangerous. It’s right often enough to earn your trust and wrong in precisely the cases that matter — the failures. A backup tool can print encouraging progress and still leave you with nothing restorable, because the thing that matters isn’t what it said, it’s what landed.

So the first principle is to stop inferring and start checking outcomes. Did the process exit non-zero? That’s a real signal, not a guess at one. Better still: did the backup actually appear in the destination afterward? The only honest definition of “the backup worked” is “a backup exists that I could restore from,” and the only way to know that is to look, not to read tea leaves in stdout.

A backup that can’t tell the difference between “done” and “looked done” isn’t protecting you. It’s flattering you.

Four checks turn a hope into a job

Hardening that setup came down to four things, in order, and they’re a decent template for any unattended job that has to actually succeed:

Preflight. Before starting, confirm the preconditions: enough free space in the staging area, the destination reachable, credentials present. Fail early with a clear message instead of dying halfway with a cryptic one. Most “backup failed at 3am” pages are really “precondition wasn’t met and nobody checked.”
Retries. Transient network and storage blips shouldn’t fail the whole run. A few bounded retries with a delay absorb the flakiness that’s normal in any real environment.
Verify. After the backup, confirm it landed before declaring success — check the artifact is present and well-formed, don’t just trust the exit of the copy. This is the step that catches the silent partial.
Alert. When it does fail, emit somewhere a human will actually see — a log aggregator, a chat webhook, the system journal. A failure nobody is told about is identical, operationally, to no failure detection at all.

Each is small. Together they convert “I think backups are happening” into “I’d know within a day if they stopped,” which is the entire game.

Put the checks where the job actually runs

Here’s the subtle bit that’s easy to get wrong. The routine backups didn’t run from my configuration tool — they fired from a scheduled timer on each machine, on their own, long after the provisioning run was over. So if I’d put the preflight and verification in the provisioning playbook, they’d have validated the install and then never run again. The thing that actually executes the backup every night is the timer, so the safety has to live in what the timer invokes, not in the tool that set the timer up.

So both the scheduled path and any on-demand path got routed through one hardened runner script that does preflight, backup-with-retries, and verify, and exits non-zero on any failure. A non-zero exit fails the scheduled unit, and a failed unit triggers the alert. Scheduled and manual runs behave identically because they’re literally the same code. The principle: put the guarantee in the component that does the work, not the one that deployed it. Otherwise you’ve tested the setup and left the recurring reality unguarded.

Real environments teach you what the lab couldn’t

Two refinements only showed up once this ran on an actual multi-node cluster, and they’re a good reminder that single-machine testing has blind spots:

“Already done” is success, not failure. If a node already had today’s backup (the timer beat me to it), the tool errored that the backup already existed. Retrying that is pointless — it’ll never not-exist. The runner had to learn that “already there” is an idempotent success, not a thing to retry and then alarm on.
Some checks are cluster-wide and false-fail per node. A full verify that reasons about the whole cluster can report failure right after a single node finishes, because its peers haven’t caught up yet. The per-node path needed a per-node check — “did this node’s backup land” — with the heavier cluster-wide verification run separately. The right check is scoped to what you actually did.

Neither bug was visible on one node in a clean lab. They surfaced the moment real topology and real timing got involved — which is its own argument for testing against something shaped like production.

The unmonitored backup is the one that’s already failing

Strip it all down and the real lesson is about confidence. The whole point of a backup is to let you sleep, and a silent-failing backup gives you the sleep without the safety — the worst possible trade, because you’ve stopped worrying about the exact thing that’s now broken. That’s why this is just backups being a restore problem viewed from the other end: it’s not enough that a backup can be restored, you have to know it’s still happening.

The thing protecting your production data is itself production. Give it preflight, retries, verification, and an alarm, and put all of that where the job actually runs — not in the script that installed it. A backup you don’t monitor is a backup that’s silently not happening, and you’ll find out which on the worst possible day. If you’ve been burned by a backup that lied about being fine, I’d like to hear how you caught it.