Engineering Practice
Diagnose Read-Only Before You Touch Anything
You can learn almost everything about a broken system without changing a single thing — and building the whole picture from read-only commands first is what keeps a diagnosis from becoming a second outage.
- Engineering Practice
- Operations
- Debugging
- Reliability
When something’s broken, the urge is to do something — restart it, change a setting, reseat a thing, flip a flag — because doing something feels like progress. I’ve trained myself out of that, and the rule I replaced it with is simple: gather the entire picture using read-only commands first, and don’t change anything until I can explain what’s actually happening. Almost every system will tell you what’s wrong if you ask without touching, and the asking is free, reversible, and safe in a way that doing is not.
Most of a diagnosis is observation, not action
Modern systems are extraordinarily introspectable. Storage, hypervisors, databases, networks — they all expose deep read-only state: health summaries, performance counters, queue depths, error logs, congestion metrics, membership status. You can usually reconstruct a remarkably complete story of a failure entirely from things that only read. The investigation and the intervention are separate phases, and the first one carries no risk.
I ran a whole storage-latency root-cause once without changing a single byte: read the cluster health, read the per-device congestion counters, read the rebuild status, read the network stats. By the time I understood it, I knew exactly which one device was the problem and why — and I hadn’t touched anything that could make it worse. The fix came after the understanding, gated and deliberate, not as a panicked first move.
Changing things destroys evidence
The deeper reason to look before you touch isn’t just safety — it’s that acting erases the very state you need to diagnose. Restart the service and you’ve thrown away the in-memory condition that was the clue. Clear the thing and the pattern’s gone. Now you’re debugging a system that no longer exhibits the bug, with the evidence wiped by your own first move.
The restart that “fixes” it also deletes the reason it broke. You’ve traded an answer for a reprieve, and the problem comes back without the breadcrumbs.
Read-only investigation is non-destructive in both senses: it can’t make the outage worse, and it can’t destroy the evidence trail. You capture the failed state fully before you start altering it, so that whatever you do next is informed instead of hopeful.
”Do something” is how one outage becomes two
The real danger of acting first is that a wrong action on a system that’s already unhealthy can turn one problem into two. The classic version is making a remote change that severs your own access — now you have the original fault plus no way to reach the box. (I’ve written about the change that locks you out specifically, because it’s such a common own-goal.) But it generalizes: any unconsidered write to a degraded system risks adding a second failure on top of the first.
Read-only commands can’t do that. The worst case of running a status query is that it tells you nothing new. The worst case of a hasty change is a bigger incident than the one you started with. That asymmetry is the entire argument: when the downside of looking is zero and the downside of touching is unbounded, look a lot before you touch once.
The read/write split should be visible in your tools
This habit is worth baking into tooling, not just discipline. I like operational tools that physically separate the read-only verbs from the mutating ones — a clearly safe “inspect” surface you can run freely, and a separate, guarded “change” surface that asks for confirmation. When the safe operations are obviously safe, you reach for them first by default, and the dangerous ones require you to mean it.
That’s the same instinct behind building ops tools that are safe by construction and giving automation a dry-run mode before an execute mode: make the non-destructive path the easy, obvious one, so investigation is frictionless and intervention is deliberate. Tools that blur the line train you to touch first; tools that draw it train you to look first.
The loop I actually run
When something breaks, before I change anything:
- Read the health and status of the system and its components — what does it say about itself?
- Pull the counters and logs that cover the symptom window — find the pattern, not just the latest red line.
- Form a specific explanation I can state out loud: this component, doing this, because of that.
- Only then plan a change — gated, reversible where possible, smallest first — and predict what it should do before I run it.
- Verify against the read-only state again afterward, instead of trusting that “I did the thing” means “the thing worked.” Because it ran is not it worked.
None of this is slower in the way that matters. It feels slower because you’re not mashing buttons in the first thirty seconds, but it’s dramatically faster to a correct fix and far less likely to spawn a sequel outage. The calm part — reading first — is where the speed actually comes from. If you’ve got a system doing something baffling and want a second set of eyes before anyone touches it, I’m easy to reach.