Pakkit.net
← Back to blog

Infrastructure

Eventual Consistency Has Homework

A replicated database had every replica quietly drifting apart because the anti-entropy repair that reconciles them had never once run. "Eventually consistent" isn't a property you get for free — it's a promise that depends on maintenance somebody has to schedule.

  • Infrastructure
  • Databases
  • Distributed Systems
  • Reliability

Doing a health review of a replicated, eventually-consistent database, I checked one number that stopped me: the fraction of data that had been repaired across replicas was zero. Not low — zero. The background reconciliation process that keeps replicas in agreement had never run, on any table, since the cluster was built. The cluster looked healthy, served reads and writes fine, and was quietly drifting out of consistency the entire time. It was a clean lesson in something easy to forget: “eventually consistent” is a promise with homework attached, and if nobody does the homework, the promise lapses.

”Eventually” is doing a lot of work in that phrase

Eventually-consistent systems accept writes on different replicas and reconcile them over time. The marketing-friendly word is “eventually” — replicas converge, given time. What’s easy to miss is that convergence isn’t automatic background magic for free; in many such systems it depends on an explicit anti-entropy process that compares replicas and repairs the differences. Normal reads and writes don’t fully do this on their own. Replicas can miss updates (a node was down, a write was dropped, a hint expired), and without something actively reconciling them, those divergences accumulate. “Eventually” quietly means “once the repair runs” — and the repair is a thing you have to schedule.

Eventual consistency isn’t “it fixes itself.” It’s “it fixes itself when you run the thing that fixes it.”

Healthy-looking is not the same as consistent

The unsettling part was how fine everything looked. Throughput was normal, nodes were up, queries returned. Nothing about day-to-day operation screamed “your replicas disagree,” because a single query just reads from whichever replica answers — and any one replica returns an answer, plausibly. The divergence only bites in specific ways: a read that hits a stale replica returns old data, a node failure loses writes that were never replicated elsewhere, or deleted data resurrects because the deletion never propagated everywhere. These are intermittent, hard-to-reproduce, and exactly the kind of bug you can’t debug after the fact. The absence of obvious symptoms is not evidence of consistency; it’s just the absence of the particular query that would reveal the drift.

Maintenance you can’t see is the easiest to skip

Why does this happen? Because the homework is invisible. Anti-entropy repair has no user-facing symptom when it’s not running — nothing turns red, no error fires. Contrast that with a backup that fails (you might get an alert) or a disk that fills (the system stops). Repair just… doesn’t happen, silently, and the cost accrues quietly until a node dies or a stale read surfaces at the worst time. Invisible maintenance is the first thing to fall off when a system is stood up under time pressure and then “works,” because nothing forces the issue. The systems that stay healthy are the ones where someone made the invisible maintenance visible and scheduled.

Schedule it, and verify it’s actually running

The fix is conceptually trivial and operationally easy to neglect: schedule the reconciliation on a regular cycle, and — critically — verify it’s actually completing, because “we set up repair” and “repair is running to completion” are different claims. A repair job that’s scheduled but silently failing leaves you exactly as drifted as no repair at all, with the added bonus of false confidence. So the health check isn’t “is repair configured?” but “when did every replica last successfully reconcile?” Make that a monitored number, the same way you’d monitor disk or backups, so a stalled repair surfaces instead of hiding.

The general principle: name the upkeep your guarantees require

The transferable lesson reaches past any one database. Lots of systems offer a guarantee — consistency, durability, freshness, correctness — that silently depends on a recurring process: anti-entropy repair, cache invalidation, compaction, re-indexing, certificate renewal, log rotation, backup verification. The guarantee holds if the upkeep runs and lapses quietly if it doesn’t, because the failure mode is usually invisible until something else goes wrong. So when you adopt a system that promises you something, ask the uncomfortable question: what has to keep running for this promise to stay true, and is it actually running? It’s the same spirit as your backup job needing a smoke alarm and treating your monitoring as production too — the quiet, scheduled work is exactly the work that disappears. If you’ve found a guarantee silently lapsing because its homework wasn’t getting done, I’d like to hear which one.