Pakkit.net
← Back to blog

Infrastructure

A Ping Is Not a Health Check

A load balancer happily reported a service healthy because it could ping the host — while the actual service was broken. A health check that doesn't exercise the real thing is theater, and the most common reason teams fall back to ping is that the real check rotted.

  • Infrastructure
  • Monitoring
  • Reliability
  • Load Balancing

I went looking for why a service kept failing despite its load balancer insisting everything was healthy, and found the gap in one line of config: the “health check” was an ICMP ping. The host answered pings, so the balancer marked it up — while the actual service on that host was dead. The ping was true and useless. It’s a small misconfiguration with a big lesson behind it: a health check that doesn’t exercise the real service isn’t checking health, it’s checking electricity.

Ping proves the lights are on, nothing more

A successful ping tells you a host has a network stack and is powered on. It tells you nothing about whether the service you care about is accepting requests, authenticating correctly, or returning valid responses. A box can answer pings flawlessly while the daemon you actually depend on has crashed, wedged, or started rejecting everything. “Up” at the ICMP layer and “working” at the service layer are completely different claims, and conflating them means your monitoring will cheerfully route traffic into a black hole.

Pinging a host to check a service is like rattling the front door to confirm the kitchen is cooking. Different floor entirely.

This generalizes past ICMP. A TCP-connect check that only confirms the port is open is the same mistake one layer up: the listener accepted a socket, which doesn’t mean the application behind it is healthy. Any check that stops short of the real behavior is a proxy you’re hoping correlates with health — and the day it stops correlating is the day it lets a broken service take traffic.

Check the actual function, not a proxy for it

A real health check exercises the thing that matters: send a representative request and confirm a correct response. For a request/response service, make a real request. For something that processes a specific protocol, speak that protocol and verify the reply. The check should fail for the same reasons your users would experience failure — that’s the entire point. If the service can be broken in a way your health check can’t detect, your health check has a blind spot exactly where it hurts.

This costs more than a ping. It needs a meaningful request, valid-enough credentials, and logic to judge the response. That cost is the price of a check that means something, and it’s far cheaper than routing live traffic to a dead backend because a ping said it was fine.

The real check rots, so people fall back to ping

Here’s the part I found most instructive: nobody chose a meaningless health check on purpose. The service originally had a real probe — a script that spoke the actual protocol and validated the response. But the platform underneath it upgraded its scripting runtime to a new major version, the old probe script wasn’t compatible, and it started failing on every run. Faced with a check that was now always red, the pragmatic fix was to point things at a plain ping, which is always green. The real check didn’t get fixed; it got bypassed.

That’s how most fake health checks are born — not from ignorance, but from a real check that broke and a deadline that didn’t care. The ping is the path of least resistance when the meaningful check rots.

Your probe lives in the environment it’s probing

The deeper trap: that probe script broke because it ran inside the very system it was checking, and that system changed under it. A health check embedded in an appliance, a sidecar, or a host is subject to that host’s upgrades, runtime versions, and dependencies. When the host moves to a new language runtime, your probe moves with it whether it’s ready or not. So a health check isn’t write-once — it’s code with a lifecycle, coupled to its environment, and it needs the same maintenance and testing as anything else you ship. A probe you wrote three years ago and never touched is a probe that may already be silently broken or silently downgraded to a ping.

Audit what your green actually means

The takeaway is a question worth asking of every health check you own: if the service were broken, would this check actually go red? If the answer is “well, it pings…” then your green light is decoration. Make checks exercise the real behavior, treat them as maintained code rather than set-and-forget config, and be suspicious of any check that’s suspiciously always-green. It’s the same theme as failover quietly hiding the first failure and treating your monitoring as production too — a check you don’t trust is worse than no check, because it makes you stop looking. If you’ve found a health check that was lying to you, I’d like to hear about it.