Pakkit.net
← Back to blog

Infrastructure

Some Failures Only Clear on a Cold Boot

A warm reboot doesn't reset everything. When hardware gets wedged in a stuck state, a driver reload or a soft restart can leave the fault in place — and a full power-drain is the one move that actually clears it.

  • Infrastructure
  • Hardware
  • Homelab
  • Debugging

A storage controller in one of my homelab servers came up with all its drives missing. The card itself was healthy — it enumerated on the bus, the driver loaded, it answered management commands. But every disk behind it was simply not there. I tried the increasingly aggressive software fixes: rescan, reload the driver, soft reboot. Nothing. The drives stayed gone. What finally brought them back was the bluntest tool available — a full cold power cycle, power genuinely off, standby drained, then back on. The fault was a wedged hardware state that a warm restart physically cannot reset, and learning to reach for the cold boot sooner saved hours.

A warm reboot doesn’t reset everything

The mental model most of us carry is that rebooting is rebooting — power-cycle the OS and everything starts fresh. It isn’t true. A warm reboot restarts the operating system, but a lot of hardware underneath keeps its state across that kind of restart: controllers, expanders, and peripherals can stay powered and stay in whatever confused condition they were in. You restarted the software; the wedged silicon never noticed.

In my case the controller talked to the drives through an intermediary chip that had gotten stuck. It was answering just enough to look alive, but it wasn’t actually presenting the disks. A soft reboot left that chip powered and stuck the whole time. The OS came back clean and the hardware came back exactly as broken as before, because nothing had actually cut power to the stuck part.

”It’s healthy in software” can be a trap

What made this genuinely confusing is that every software-level check said the hardware was fine. The controller was on the bus. The driver was bound and current. It responded to commands. By every diagnostic the operating system could run, there was nothing wrong with the card. The problem lived below the layer the OS could see — in the link between the controller and the drives — and that layer doesn’t show up green or red in your usual tools.

“The driver is healthy” and “the device is doing its job” are different claims. A component can be perfectly alive at the layer you can see and completely stuck at the layer you can’t.

This is the tell for a whole category of hardware faults: software says everything’s fine, but the actual function is broken. When the diagnostics are clean and the thing still doesn’t work, stop trusting the diagnostics’ scope and suspect a layer beneath them.

A cold cycle resets things a reboot can’t

The reason a cold power cycle works where a reboot doesn’t is physical, not magical. Fully removing power — and draining the residual standby power, not just hitting reset — forces stuck components to actually reinitialize from nothing. The wedged intermediary chip, deprived of power entirely, had no choice but to come back up clean and rediscover the drives properly. A reset signal it could ignore; an absence of power it could not.

The general principle: the deeper and more “stuck” a hardware fault feels, the more it wants a true power drain rather than a soft restart. Reset buttons and driver reloads operate within the running power envelope. A cold cycle changes the power envelope itself, which is the only thing that reaches state that survives a warm restart.

Move the cold boot earlier in the ladder

My actual mistake wasn’t technical — it was ordering. I worked up the escalation ladder from gentlest to most disruptive: rescan, then driver reload, then soft reboot, and only then the cold power cycle. That’s a sensible instinct for a system in service, because the gentle options are less disruptive. But for the specific signature of “device wedged below the OS, healthy in software,” the gentle options were never going to work, and trying them in order just burned time.

So I rewrote my own runbook: for an all-disks-missing-behind-a-healthy-controller symptom, the cold power cycle isn’t the last resort, it’s an early move. The lesson generalizes. When the fault pattern matches “stuck hardware state,” skip the soft options that can’t reach it and go to the one that can. Escalation ladders are good defaults, but a default applied to the wrong failure class is just procrastination with extra steps.

What this leaves you with

Two durable takeaways, beyond the obvious “try turning it fully off”:

  • Know which restart you’re actually performing. Rescan, driver reload, warm reboot, and cold power cycle reset progressively more state. They are not interchangeable, and reaching for a weaker one against a fault that needs a stronger one wastes time you don’t have during an outage.
  • Watch for recurrence. A fault that needed a cold boot to clear is a fault that can come back — a marginal component working its way toward failing for good. Clearing it is a reprieve, not a cure, so the cold boot that fixed it is also your cue to start watching that part and planning its replacement.

This is exactly the kind of thing a homelab teaches that a managed platform never will — the messy operational parts where hardware fails in ways your dashboards don’t anticipate. And it’s a reminder that “it powered on” is not “it’s working”: verify the function, not just the lights. If you’ve got a box that’s stuck in a way no reboot fixes, I’m happy to compare war stories.