Engineering Practice

Stop Extrapolating, Build the Rehearsal

When the runtime of a big operation is the thing you're most unsure about, an estimate scaled up from a tiny sample is a guess wearing a number's clothes — build a production-scale rehearsal instead, and watch the real bottleneck show up somewhere you didn't expect.

June 3, 2026

Engineering Practice
Performance
Testing
Databases

I had a long-running data operation to plan — a one-pass rewrite of millions of records — and the only thing anyone could say about how long it would take was a number extrapolated from a tiny development dataset. “It’s about this fast on a thousand rows, so times it by the row count.” That kind of estimate feels like data because it has a number in it, but it’s a guess, and the guess was the riskiest part of the whole plan. So I stopped extrapolating and built a rehearsal at production scale. The rehearsal didn’t just give me a real number — it told me the thing I’d been optimizing wasn’t the bottleneck at all.

A linear extrapolation hides every nonlinear cost

The reason “measure small, multiply” goes wrong is that the interesting costs aren’t linear. At a thousand rows everything fits in memory, nothing contends, and no background process has woken up yet. At ten million rows you’ve crossed thresholds the small sample never touched: caches stop covering the working set, coordination overhead piles up, and maintenance processes kick in and start competing with your job. The small run measured a different system — one without any of the pressure that actually governs the big one.

A timing extrapolated across three orders of magnitude isn’t a measurement. It’s a hypothesis with a decimal point.

So the move is to build the dataset at, or near, the real scale and run the actual operation against it. Turn the guess into an observation. That sounds obvious until you notice how often plans ride on the multiply-it-up number because building the rehearsal felt like too much work.

Don’t let the harness become the thing you’re measuring

Here’s the subtle part that nearly tripped me. To rehearse the operation, I first had to create millions of realistic records — and the realistic records had to go through the same transformation the real ones would. The naive approach was to generate them by running each one through the exact production code path. But that path was slow (it was the whole reason I was measuring), so generating the test data that way would have been a multi-hour event in itself — and worse, it would have loaded the system in a way that polluted the very timing I was trying to capture.

The fix was to reproduce the transformation a cheaper way for data generation only — same output, byte-for-byte, but produced through a fast local path instead of the slow production one. That kept the setup quick and, more importantly, kept the measurement clean: the system under test wasn’t also busy being the test-data factory.

But reproducing a transformation “equivalently” is exactly the kind of claim that’s quietly wrong. So I gated it: before generating a single bulk record, the harness took a sample, ran it through both the fast local path and the real production path, and aborted the whole run if they ever disagreed. Trust, but verify — and verify before you’ve spent hours generating data on a faulty assumption.

A test harness is also code, and it can be wrong in ways that make your results confidently meaningless. Prove the harness matches reality before you believe the numbers it produces.

The bottleneck is rarely the code you optimized

Then the payoff. I’d spent real effort speeding up the core transformation — and at scale, that effort turned out to barely move the wall-clock of the big operation. The limiter wasn’t the transformation’s CPU cost at all. It was the deliberate throttle I’d put on the operation to protect the live system, plus the coordination and maintenance overhead that throttle implied. The fast version and the slow version of the core function finished the bulk job in nearly the same time, because neither was the constraint.

That’s not a disappointment, it’s the whole reason you rehearse. The optimization did matter — just for a different workload (the live, latency-sensitive read path), not for the throughput of the bulk job. Measuring at scale told me where my effort actually paid off and where it didn’t, which is information you simply cannot get from a microbenchmark of one function. The same lesson runs through why I argue your load generator is part of the experiment and why recomputing in the hot path is worth profiling before assuming.

What I do now before quoting a duration

When someone needs a number for how long a large operation will take, the routine is:

Build the dataset at production scale, not a convenient fraction of it.
Generate the test data the cheap way, and prove that shortcut is faithful before trusting it.
Run the real operation, with the real safety throttles in place, because the throttle is often the actual constraint.
Report the measured number and what limited it, not just the time. “Five hours, bounded by the write throttle” is far more useful than “five hours,” because it tells the next person which knob changes the answer.

Extrapolation is fine for an order-of-magnitude sanity check. It is not fine as the basis for a plan when the duration is the scary unknown. The rehearsal costs you a day and saves you from confidently scheduling around a number that was never real. If you’re sizing a big migration and want a second opinion on the method, I’m easy to reach.