Systems Thinking

A Null Result Is Still a Result

I ran a benchmark to prove a tuning change made things faster, and it came back inconclusive — twice. The honest finding wasn't a speedup; it was that the test couldn't show one, because the workload never exercised the thing I'd tuned. Reporting that is more valuable than torturing the data into a story.

June 5, 2026

Systems Thinking
Performance
Benchmarking
Honesty

I set out to prove, with a number, that a set of kernel and OS tuning changes made a database faster. I built the lab, ran a clean campaign, and the result was… inconclusive. So I cleaned it up and ran it again — still inconclusive, for a different reason. The temptation in that moment is enormous: massage the numbers, pick the run that looks good, ship a “12% faster” headline. The honest path was to report that the benchmark couldn’t show what I wanted, and explain why. That null result turned out to be more useful than the win I was chasing.

Inconclusive is data, not failure

There’s a strong pull to treat a benchmark that doesn’t show your hoped-for result as a failed benchmark — something to redo until it “works.” But “we tuned it and saw no measurable difference” is a real finding. It might mean the change doesn’t help in this scenario, or that the test can’t detect the help, or that something else is dominating. All three are worth knowing, and all three are buried if you keep re-rolling until you get a number you like. The discipline is to let the result be what it is and then ask why — not to keep shaking the dice.

“No measurable difference” is an answer. Refusing to accept it is how a benchmark becomes a search for a number you already decided on.

A benchmark only measures what the workload exercises

The why, in my case, was the most transferable part. The tuning I was testing only matters under specific conditions — memory pressure, very large numbers of open file handles, sustained saturation. My benchmark created none of those: it was a modest, cache-resident workload on a small box. So the tuned settings had nothing to bite on. No amount of rerunning would surface a benefit, because the workload never touched the thing I’d changed. Worse, I couldn’t even manufacture the right conditions — the test machines didn’t have enough disk to stage a dataset large enough to create the pressure the tuning addresses.

That’s a general law of measurement: a test can only reveal effects in the dimensions it actually exercises. If you tune for behavior under load and then test without load, you’ve built an instrument that’s blind to the exact thing you’re measuring. Before trusting any benchmark — including one that does show a result — ask whether the workload genuinely stresses the mechanism you changed. If it doesn’t, both a null result and a positive one are meaningless.

Confounded comparisons measure the confounder

The first run had a second lesson. The two sides of my comparison had drifted into unequal starting states — one carried more accumulated data than the other — and the “difference” I saw just tracked that imbalance, not the tuning. When the apparent effect flips direction depending on which side happens to hold more data, you’re not measuring your variable; you’re measuring the thing you forgot to hold constant. A comparison is only as good as the equality of everything you didn’t change. If you can’t establish that, the numbers aren’t a result — they’re noise with a decimal point.

Justify the change on the right axis

Here’s where the null result became genuinely useful. The tuning was still worth doing — not for speed, but for correctness and risk. The settings I’d changed prevent failure modes under production-scale load: stalls, resource exhaustion, documented anti-patterns. My benchmark couldn’t show a speedup because speed was never the point; the point was not falling over under conditions the lab couldn’t reproduce. Once I stopped trying to sell it as a performance win and started framing it as insurance against a known failure mode, the case was solid and didn’t depend on a number I couldn’t honestly produce.

The lesson: when you can’t prove the benefit you expected, check whether you were measuring the right benefit. Sometimes the real justification is reliability, not performance, and forcing it into a speedup story is both dishonest and weaker than the truth.

Report what you actually found

So I wrote it up as inconclusive, explained exactly why the lab couldn’t show the effect, noted that the change was safe and non-regressive, and made the correctness-not-speed case for doing it anyway. That’s less satisfying than a triumphant graph and far more trustworthy — and it tells the next person precisely what would be needed to actually measure it. Honest measurement is the foundation everything else stands on; it’s why you profile before you tune, why the load generator is part of the benchmark, and why the same input giving a different result is a signal worth chasing. If you’ve reported an honest null result instead of a flattering one, I respect it — tell me about it.