Pakkit.net
← Back to blog

Engineering Practice

Writing the Runbook Is the Test

Writing a step-by-step install or operations guide is the fastest way to find the holes in your own understanding — if you can't write the rollback step, you don't actually have a rollback, you have a hope.

  • Engineering Practice
  • Documentation
  • Operations
  • Runbooks

I used to think writing the runbook was the chore you did after the real work — paperwork to satisfy whoever needs paperwork. I’ve flipped on that completely. Writing the step-by-step install or operations guide is one of the best tests of whether you actually understand the thing you built. The act of writing it down, in the order someone else would execute it, surfaces every gap you’d been papering over with “and then it just works.” If you can’t write a step cleanly, you don’t understand that step yet — and better to discover that at the keyboard than at 2am.

You can’t write what you don’t understand

There’s nowhere to hide in a procedure. Prose lets you wave at hard parts — “configure the storage backend appropriately.” A runbook step can’t: what command, what values, what does success look like? The moment you try to write the exact instruction, you find the place where your mental model is fuzzy, because you physically cannot type the step.

That’s the gift. The friction of writing precisely is a detector for the spots you only thought you understood. Every “wait, what exactly happens here?” is a real gap you’d otherwise have hit live, with stakes. Writing the runbook front-loads that discovery into a calm moment instead of a crisis. The struggle to write a step clearly is the learning; if it’s hard to write, that’s the part to go understand better, not gloss over.

If you can’t write the step, you don’t understand the step. The blank cursor is telling you where your knowledge actually ends.

The rollback step is where the bluffing dies

The single most clarifying section of any runbook is rollback. It’s easy to write the happy path — do this, do that, it works. Writing “and here’s how to undo it when it goes wrong” forces you to confront whether you actually can. More than once, drafting the rollback step is where I’ve discovered the honest answer was “I hadn’t thought about that,” which means there was no rollback — just optimism.

So I now anchor any risky procedure on a concrete, tested rollback before the procedure runs at all. For anything touching real machines that usually means: take a snapshot first, and name it explicitly as the rollback anchor. “Revert the snapshot” is a real, exercised undo. “Carefully reverse the fourteen steps you just did, under pressure, hoping you remember them all” is not a plan — it’s a prayer with extra steps. If you can’t write the rollback, you don’t have one yet, and that’s the most important thing the runbook just taught you.

A good runbook has a shape, and the shape is safety

Writing enough of these, a structure emerges — and it’s not bureaucratic box-ticking, it’s the accumulated scar tissue of operations done wrong. The pattern I reach for now:

  • Snapshot / backup first, explicitly flagged as the rollback anchor. The undo has to exist before the do.
  • Preview before you commit. A dry-run or check mode that shows what would change without changing it. Read it, confirm it’s what you meant, then run for real. (Build the preview before the engine.)
  • Validate on one target before the fleet. Run against a single host, confirm it’s healthy, then roll out. Blast radius is a choice you make in the sequencing.
  • Make it idempotent. Running the procedure twice should be safe and boring, not a second, conflicting mutation. People will re-run it.
  • Stop on any error. A loud “if anything looks wrong, halt here, roll back the affected target, and escalate” beats plowing ahead and compounding the damage.
  • Name the escalation path. When it goes sideways, who gets called, with what information. (And keep that roster generic and current — names and roles drift.)

That sequence isn’t ceremony. Each item is a specific way operations have gone wrong, encoded so they don’t go wrong again.

Write it while you do it, not after

The timing matters as much as the writing. Write the runbook as you perform the procedure the first time, not from memory a week later. Doing it live, you capture the real commands, the actual output, the surprises, the “oh, you also have to do X” steps that you’ll have completely forgotten by the time you sit down to “document it properly.”

Memory smooths over the bumps — and the bumps are the valuable part. The detour you had to take, the precondition that wasn’t obvious, the error you hit and fixed: those are exactly what the next person needs and exactly what you’ll lose if you wait. Writing in the moment also keeps you honest, because you’re describing what genuinely happens, not the idealized version your memory will happily invent. The runbook written live is a record; the one written from memory is historical fiction.

The runbook is a thinking tool first

So I’ve stopped treating the runbook as the deliverable you grind out afterward. It’s a thinking tool I use during the work to find the holes in my own understanding while the stakes are still low. The document that comes out is a nice bonus — genuinely useful to the next person — but the real value is what writing it did to my own grasp of the system.

If you want to know whether you truly understand something you built, try to write the runbook for it. The places you stumble are the places you don’t actually understand yet, and finding them at the keyboard is enormously cheaper than finding them in production. The chore turns out to be the test — and documentation turns out to be infrastructure. If you’ve had a runbook expose a hole you didn’t know was there, I’d love to hear about it.