Infrastructure
Failover Hides the First Failure
Automatic failover is supposed to save you, and it does — but it also silently absorbs the first outage, so you cruise along on your last healthy path with no idea you're one failure from down. Monitor each path, not just the service.
- Infrastructure
- Reliability
- Monitoring
- Resilience
Automatic failover is one of those features that works so well it becomes dangerous. You point a client at a primary and a backup, the client quietly retries the backup when the primary stops answering, and the service stays up. Great — that’s the whole point. The trap is that the same mechanism that keeps you running also hides that anything broke. You can lose your primary and never notice, because failover did its job and nothing paged. Now you’re running on your last good path, one more failure from a real outage, and feeling fine about it.
Redundancy is also a silencer
We build redundancy to survive a failure. What’s easy to miss is that surviving a failure and noticing a failure are different goals, and redundancy actively works against the second one. A client that fails over from a primary to a backup experiences exactly what you designed: no user-visible error. The service-level view stays green. The very success of the failover is what keeps the failure off your radar.
Failover converts an outage into a secret. The service is up; you just don’t know how close to the edge you’re running.
So the dashboard that watches “is the service responding?” will happily report all-clear while half your redundancy is gone. You haven’t avoided the outage — you’ve deferred it and removed the warning.
”Up” and “healthy” are different questions
The fix starts with separating two questions you probably conflated:
- Is the service responding? — the user’s question, and the one failover is designed to keep answering “yes.”
- Is every path that’s supposed to be carrying it actually healthy? — the operator’s question, and the one that quietly went unanswered.
When those collapse into a single green light, you’ve built a system that tells you it’s fine right up until the moment both paths are gone. The first failure spent your safety margin, and nothing recorded the withdrawal.
Monitor the components, not just the composite
If a service is fronted by a primary and a failover, the health of each is its own signal — and a primary going down should alert even when the backup seamlessly covers it. The composite “service is up” check is necessary but not sufficient; it’s the sum that hides the missing term. What you want is to know the instant your redundancy degrades from “two healthy paths” to “one,” because that’s the moment your risk doubled while your user impact stayed zero.
Concretely, that means:
- Probe each member directly, not only the front door. A failover target you never test independently might already be dead — and then your “redundancy” is fiction.
- Alert on degraded-but-up, not just down. “Primary unreachable, serving from backup” is a real page, even at full service availability.
- Watch the failover happening. A spike in retries or a flip to the secondary is a signal worth surfacing, because it’s often the only evidence the primary went away.
The failover target you never tested is a guess
There’s a meaner version of this. Redundancy you don’t exercise tends to quietly rot — the backup that was healthy at setup drifts, its config goes stale, its certificate expires — and you find out only when the primary finally fails and the “backup” doesn’t catch. By then both paths are down and the seamless save you were counting on never comes. The only way to know your failover works is to have seen it work, on purpose, recently. This is the same discipline as treating a backup as a restore you haven’t tested yet and replicating across real failure domains — a redundancy you’ve never triggered is a hope, not a safety net.
Count your healthy paths, out loud
The habit I try to keep now is to always know how many independent healthy paths a critical service is actually running on, and to alert the moment that number drops — not the moment it hits zero. Failover is still wonderful; I want it. I just stop trusting it to tell me it fired. The save is invisible by design, so I make the degradation visible by choice. If you’ve been burned by discovering a dead primary only when the backup also fell over, I’d like to hear the story.