Engineering Practice

Profile Before You Tune

Tuning without a baseline isn't optimization, it's superstition — the only way to know a change helped is a controlled experiment that holds everything constant except the one variable you're testing.

August 26, 2025

Engineering Practice
Performance
Benchmarking
Systems Thinking

Performance tuning attracts folklore. “Set this kernel parameter.” “Bump that buffer.” “Everyone knows this flag makes it faster.” Some of it is even true — but you can’t know which parts, on your system, with your workload, unless you measure. Tuning without a baseline isn’t optimization; it’s superstition with a config file. The discipline that turns it into engineering is the controlled experiment: change exactly one thing, hold everything else still, and measure before and after.

A number with nothing to compare it to means nothing

The first temptation is to apply a tuning, run a load test, see a big throughput number, and declare victory. But a number on its own is meaningless. Is it good? Compared to what? Without a baseline — the same workload on the untuned system — you have a number and a feeling, not a result. The feeling is usually “it seems faster,” which is the least reliable instrument in the building.

So the very first run is always the stock, untuned configuration. That’s not a formality you do before the real work; it is the real work. The baseline is the ruler. Every tuned result only means something as a delta from it, and a delta is the only thing worth reporting.

“It’s faster” is a vibe. “It’s faster than the baseline, same workload, one variable changed” is a result. Only one of those is worth acting on.

Change one variable, or you’ve learned nothing

The cardinal rule of an apples-to-apples comparison: the only thing that differs between two runs is the thing you’re testing. Same dataset size, same operation count or duration, same concurrency profile, same schema, same client, same hardware. If you change the tuning and the dataset and the thread count between runs, a difference in the result tells you nothing — you can’t attribute it to any one cause.

This sounds obvious and is constantly violated, usually by accident. You tune the server, but you also happen to run the second test at a different scale, or after the cache warmed up, or with the client doing other work. Now your “30% faster” might be the tuning, or might be the warm cache, and you’ll never know. Discipline here is mostly about not changing things you didn’t mean to.

Reset the state between runs, on purpose

The subtle killer of controlled experiments is leftover state. The first run populates data, warms caches, fragments storage, leaves the system in a different condition than it started. Run the second test on top of that and it’s not a clean comparison — the second run inherited the first run’s mess.

The fix is to reset to a known-clean state between runs. A snapshot you revert to, a teardown-and-rebuild, a wipe of the working set — whatever gets each run starting from the same starting line. I lean on snapshots for this: capture a pristine state once, and revert to it before each iteration so every run begins identically. Without that, you’re not measuring the tuning; you’re measuring the tuning plus the accumulated residue of every test before it.

Measure the things that actually characterize behavior

“Faster” isn’t one number. A useful comparison captures the dimensions that tell you how the system behaves under load, not just a single headline:

Throughput — operations per second at a given concurrency.
Latency distribution — not just the mean, but the tail (95th, 99th percentile). A change that improves the average while wrecking the tail is often a regression for real users, and the mean will hide it.
Resource behavior — CPU, memory, and for managed-runtime systems, garbage-collection frequency and pauses. A throughput win bought with brutal GC pauses isn’t the win it looks like.

Capture the same set for every run so the comparison is real. And probe the system at more than one point — ramp the load until throughput plateaus, because the most meaningful latency comparisons live right around that knee, not down in the easy range where everything looks fine.

Record enough that the run is reproducible

A benchmark you can’t reproduce is an anecdote. Six months later “we found tuning X helped” is worthless if nobody can tell what “helped” meant or rerun it. So I write down, with every result: the exact workload (op count or duration, concurrency, dataset), the software and OS versions on both client and server, which run was baseline and which was tuned, and the raw output, not just my summary.

That record is what lets a result survive. It turns “I think this flag helped once” into “here’s the experiment, here are the numbers, here’s how to rerun it” — which is the difference between folklore and knowledge. (It pairs with making sure the load generator itself isn’t the bottleneck — a measurement tool that’s saturated is just measuring itself.)

The mindset, not the tool

This isn’t really about databases or kernels or whatever you happen to be tuning. It’s the scientific method wearing ops clothes: form a hypothesis (“this change improves throughput”), build a controlled test (baseline, one variable, clean state, repeated), measure honestly (the distribution, not just the mean), and let the numbers — not the folklore, not the vibe — decide.

Do that and tuning becomes genuinely satisfying, because you know what worked and by how much, and you can defend it. Skip it and you’re just rearranging config flags and hoping, which feels like progress and usually isn’t. Profile first, then tune. If you’ve got a tuning result that survived a real controlled test — or one that embarrassingly didn’t — I’d love to hear it.