Infrastructure
The Slowest Replica Sets Your Write Latency
In synchronously replicated storage a write isn't finished until every copy acknowledges it, so the slowest device in the cluster sets write latency for everyone — and reads will happily hide that from you.
- Infrastructure
- Storage
- Performance
- Homelab
- Reliability
I spent an afternoon chasing write latency on a small hyperconverged storage cluster in my homelab. Reads were coming back in under a millisecond. Writes were sitting around thirty-five milliseconds — bad enough that everything running on it felt like wading through wet sand. The instinct is to look for the slow part as if it’s one bad setting. The real answer was simpler and more structural: a replicated write has to be acknowledged by every copy, so the slowest copy decides how fast the write finishes. One weak device in one node was taxing the entire cluster.
A mirrored write waits for everyone
Synchronous replication is the whole point of these systems: every block you write exists on more than one node, so losing a node doesn’t lose data. The cost is baked into the same mechanism. When the client writes, the cluster doesn’t acknowledge it until the data is committed to each replica — usually to a fast write cache or journal on every node holding a copy. The write is only “done” when the last of those acknowledgements comes back.
That means the latency you see isn’t the average across nodes. It’s the maximum. The healthiest, fastest node in the cluster can’t help you here; the write is gated by whichever copy is slowest to commit. Replication turns your write path into a “wait for the slowest of N” operation, and that’s a very different performance profile than a single disk.
Reads will lie to you about cluster health
Here’s the asymmetry that made the diagnosis confusing. A read only needs one copy. The cluster can serve it from whichever replica answers first, often out of memory or a read cache, and it never has to wait on a congested node. So reads stayed fast while writes were terrible — sub-millisecond next to thirty-five.
Fast reads sitting beside slow writes isn’t a contradiction. It’s a fingerprint: it points straight at the replicated write path and away from “the whole array is dying.”
Once I stopped treating “the storage is slow” as one symptom and split it into “reads are fine, writes are not,” the search space collapsed. The problem had to live somewhere that writes touch and reads don’t — the per-node write commit, not the data path as a whole.
One weak device taxes the whole cluster
When I looked per node, the capacity disks were clean across the board. The congestion was concentrated on a single node’s write-cache device — a consumer-grade SSD doing a job that wanted an enterprise one. Consumer SSDs fall off a cliff under sustained writes once their own internal cache fills, and they have shallow queues. The write path journals to that cache device on every mirrored write, so when it saturated, it didn’t just slow that node down. It set the floor for every write in the cluster, because every write had to wait for that copy too.
That’s the part worth internalizing: in a replicated cluster there is no such thing as one node’s performance problem. The slow node’s latency becomes everyone’s latency on writes. You don’t get to average it away.
Rebuilds are when the weak link shows
The latency was especially ugly that day because the cluster was mid-rebuild — re-replicating data back onto a node after a hardware hiccup. Rebuild traffic is write traffic, and it was piling onto the same saturated cache device that was already the bottleneck. Normal VM writes were now queued behind a flood of rebuild writes, all funneling through the worst device in the cluster.
Degraded state is exactly when a marginal component stops being marginal. At idle, a mediocre cache disk is merely unimpressive. Under a rebuild, it’s the thing gating the entire system, and the rebuild itself crawls because it’s bottlenecked on the same disk it’s trying to heal. A cluster’s true floor is the latency it shows while it’s recovering, not the latency it shows on a calm afternoon.
The cache tier is the lever, not the config
The temptation after a session like this is to go tuning — flags, queue depths, replication policy. None of that was the fix. The write path is only ever as fast as the device it journals to, so the highest-value change in one of these clusters is almost always the write-cache device: an enterprise, write-intensive drive with power-loss protection instead of a consumer part. Space-saving features like dedup and compression aren’t free here either; they add write amplification on the exact path that’s already the bottleneck.
The general shape of the lesson: find the one device every write has to wait on, and spend there. Tuning around a fundamentally slow commit device is rearranging deck chairs.
How I reason about replicated storage now
- Split reads from writes before anything else. Fast reads with slow writes points at the replicated commit path, not the whole system.
- Find the slowest replica. Per-node, per-device — the cluster’s write latency is the max across copies, so go find the max.
- Treat degraded state as the real benchmark. Measure latency during a rebuild, because that’s when a weak device dictates the floor.
- Spend on the commit device. The write-cache/journal tier is the lever; everything else is downstream of how fast a write can be acknowledged.
This is the kind of failure mode you only really feel once you run your own storage, which is most of why I keep a private cloud and homelab around to break. It rhymes with two other things I’ve written: that replication is about surviving failure domains, and that the messy operational parts are the ones worth learning. If you’ve got a storage cluster doing something inexplicable, I’m happy to compare notes.