Pakkit.net
← Back to blog

Engineering Practice

Log Why It Failed, Not Just That It Failed

A legacy system I dug into got one thing exactly right — every failure logged a structured reason — a severity, a category, and a specific cause — plus a tag pointing at the exact code that emitted it. That turns a log from a wall of "denied" into something you can build dashboards and triage on.

  • Engineering Practice
  • Logging
  • Observability
  • Debugging

Most of the legacy system I was reverse-engineering was a cautionary tale, but one thing it did better than a lot of modern code: it never just logged that something failed. Every failure carried a structured reason — a severity, a category, and a specific cause — plus a short tag identifying the exact place in the code that emitted the line. That single discipline is the difference between a log you grep in despair and a log you can build dashboards on, triage from, and trust. “It failed” is almost useless. “It failed, here’s the kind of failure, here’s the specific cause, and here’s where” is operable.

”Failed” is not a diagnosis

A log line that says a request was denied, or an operation failed, tells you the outcome and nothing else. Was it denied because the input was malformed? Because a dependency was down? Because the user genuinely isn’t allowed? Each of those needs a completely different response — fix the caller, page the on-call, do nothing — and a bare “failed” forces you to go reconstruct which one it was, usually under pressure, often after the context is gone. The failure already knew why it happened at the moment it happened. Throwing that away and writing “failed” is discarding the most valuable information you’ll ever have about the event.

The code knows exactly why it’s failing at the instant it fails. A log that records only “failed” is deleting the answer on the way out the door.

Structure the reason so machines can read it

The system I studied wrote failure reasons as a small structured string — effectively severity, category, specific-cause. That structure is what makes the log more than prose. Because the category is a known token, you can count failures by category and watch for a spike. Because severity is separate, you can split “working as intended” rejections from “something is broken” errors at a glance. You get to ask questions like “are backend-error failures climbing?” without writing a fragile regex against free-form messages. Structured failure reasons turn your logs into a dataset you can aggregate, alert on, and trend — not just a transcript you read after something’s already gone wrong.

Distinguish “said no” from “couldn’t decide”

The single most useful distinction in that taxonomy was between a clean negative and an error. A clean negative means the system worked perfectly and the answer is no — the input was invalid, the user isn’t permitted, the thing genuinely shouldn’t proceed. An error means the system couldn’t determine the answer — a dependency timed out, a lookup was unreachable, something broke. These look identical from the outside (both are “it didn’t go through”) and they could not be more different operationally:

  • A rising tide of clean negatives is usually normal, or a problem with whoever’s calling you — not your incident.
  • A rising tide of errors means your system is degrading — a backend is down, a dependency is flaky — and it’s your page.

Conflating them is how teams either panic over normal rejections or sleep through a real outage. Separate them in the log and the difference becomes a filter, not a forensic exercise.

Point at the code that emitted it

The other detail I loved: each failure line carried a compact tag identifying the exact emit site — which file, which exit point. So a confusing log line wasn’t a mystery; you could jump straight from “this tag” to “this branch of the code” and see precisely what condition produced it. This is criminally underused. We write the same generic message from five different code paths and then wonder which one fired. A tiny, unique marker per failure site collapses “where did this come from?” from a search into a lookup. It costs almost nothing to add and pays off every single time someone reads the log in anger.

Two audiences, two messages

One more thing it got right: the failure reason in the logs was separate from the message returned to the end user. The user got a clean, appropriate message; the operator got the detailed diagnostic. Those are different audiences with different needs, and cramming both into one string serves neither — the user sees internal jargon, or the operator gets a sanitized message that hides the cause. Keep the human-facing message and the diagnostic reason as two distinct things.

Make failures first-class data

The habit worth stealing: treat a failure as a structured event you’re recording for future-you to query, not a string you’re emitting to feel thorough. Give it a severity, a category, a specific cause, and a pointer to where it happened — and keep it separate from whatever you show the user. It’s the logging side of why a job running isn’t a job working and a close cousin of tagging telemetry at the source so it’s queryable later. The few extra seconds at write time buy you the ability to actually operate the system. If you’ve got a failure-logging scheme that made triage easy, I’d like to hear how you structured it.