Systems Thinking
Choose Storage by How It Fails, Not by Its Spec Sheet
A feature list like "all-flash" tells you how storage behaves on a good day. What actually matters is how it behaves degraded, under load, and on the worst device you bought — and those don't fit on the spec sheet.
- Systems Thinking
- Infrastructure
- Storage
- Reliability
- Homelab
I run two small three-node storage clusters with opposite personalities. One is all-flash, which sounds obviously faster. The other is spinning disks, which sounds obviously slower. For a stretch, the spinning-disk cluster was out-writing the all-flash one — by a lot. The spec sheets predicted the opposite. The reason they were wrong is the whole point of this post: a spec sheet describes the good day, and storage is defined by its bad days.
The spec sheet describes the happy path
“All-flash,” “10-gigabit,” “NVMe cache” — these are real and they matter, but they’re all descriptions of how the system performs when everything is healthy, idle, and within its comfort zone. That’s the least interesting moment in a storage system’s life. You don’t lie awake about the happy path.
What you actually live with is the other stuff: how it behaves while rebuilding after a failure, how it degrades under sustained load, what happens when one device is weaker than the rest, and whether the redundancy you paid for is real. None of that is on the box. So buying storage by the feature list is optimizing for the one scenario that was never going to be the problem.
”Fast” hardware lost to “slow” hardware
The all-flash cluster was slow not because flash is slow, but because it was built on consumer SSDs and was busy rebuilding after a hardware hiccup. Its write path funneled through a single saturated consumer cache device, and degraded-state writes crawled. Meanwhile the spinning-disk cluster was healthy, idle, fully redundant, and behaving exactly as a well-built rotational cluster should — unspectacular but steady.
The all-flash cluster spent on flash and cheaped out on drive grade. The HDD cluster spent on enterprise drives and stayed on spinning disk. One is slow by accident; the other is slow by design — and “slow by design” is the one that surprises you less.
The headline number (“all-flash”) was the least predictive thing about either cluster. Drive grade, redundancy model, and current health told the whole story, and not one of those is what you shop for when you shop by feature.
Slow-by-design beats slow-by-accident
There’s a real distinction hiding in that comparison. The healthy HDD cluster is slow in a way I can predict, plan around, and explain: rotational latency is a known floor, three-copy replication has a known cost, and none of it changes unexpectedly. The all-flash cluster’s slowness was a surprise — an emergent product of cheap parts plus a degraded state — and surprises are what hurt in operations.
When I evaluate storage now, “how does this fail, and is the failure boring?” outranks “how fast is this at its best?” A system whose worst case is understood and gradual is easier to run than one whose best case is dazzling and whose worst case is a mystery. Predictable-and-modest beats brilliant-and-volatile when you’re the one carrying the pager.
Redundancy is a design choice with a price tag
The two clusters also made different bets on safety, and the bet shows up in latency. The healthier one keeps three copies of everything and acknowledges writes against a quorum of them — safer, and it pays for that safety in write latency on every single operation. The other keeps fewer copies. Neither is “right”; they’re different points on the durability-versus-latency curve.
That tradeoff is invisible on a spec sheet that just says “redundant.” Redundancy isn’t a checkbox, it’s a dial, and where you set it determines both how much failure you survive and how much every write costs. You can’t choose storage well without deciding, on purpose, how much latency you’re willing to trade for how many copies.
Match the failure profile to the workload
The practical upshot is that “best storage” is a meaningless ranking. There’s only “best for this workload’s tolerance for this kind of failure.” A workload that needs predictable, modest latency and strong durability is well served by the boring, healthy, over-replicated cluster. A workload that needs raw write speed wants flash — but enterprise flash, because the failure mode of consumer flash under sustained writes will eventually become your problem.
So the evaluation questions I actually ask are about failure, not features:
- How does it behave mid-rebuild, not just at idle?
- What’s the floor under sustained load, and what sets that floor?
- What’s the weakest device I’m putting in it, and what happens when it saturates?
- How many copies, acknowledged how, and what does that cost every write?
- When something breaks, is the degradation gradual and legible, or sudden and opaque?
Buy the bad day
The throughline: pick storage by imagining its worst Tuesday, not its launch-day benchmark. The spec sheet is marketing for the happy path; the failure profile is what you’ll actually operate. This is the same instinct as treating a migration as a risk assessment rather than a feature comparison, and why I’d rather build the rehearsal than extrapolate from a clean number. You learn a system’s real character by running it through a bad day — which is most of why I keep a homelab around to break things in. If you’re weighing a storage decision and want to pressure-test the failure modes, I’m happy to talk it through.