Pakkit.net
← Back to blog

Engineering Practice

Migrating Live Data Without Touching the Callers

When external callers depend on a function by name and you can't change their queries, you migrate by changing what the function does behind a stable signature — in reversible steps, with one explicit point of no return.

  • Engineering Practice
  • Databases
  • Migration
  • Reliability

The hardest data migrations aren’t the ones with the most rows. They’re the ones where you don’t control the callers. I had to retire an in-place data transformation — a function that other teams invoked by name, directly against a live database, with query strings I couldn’t change and couldn’t get changed on any useful timeline. You can’t take a maintenance window on a system with writers you don’t own. So the whole migration had to be invisible from the outside: same function names, same signatures, different behavior underneath. Here’s the shape of doing that safely.

The constraint dictates the design

The moment your interface has callers you don’t control, “just change the API” is off the table, and that single fact drives everything. The function names and signatures become a frozen contract. What you can change is the implementation behind them, in place. Most databases and runtimes let you redefine a function without renaming it — and that redefinition, not a schema migration, becomes your primary tool.

When you can’t change the callers, the migration has to happen entirely behind a stable name. The signature is a promise; the body is yours.

This reverses the usual advice to introduce a new, versioned function and migrate callers over. That advice assumes you have the callers. With uncontrolled external consumers, a parallel function nobody adopts is just dead code. Meet them where they are.

Detect, don’t flag

A migration that converts data needs to tell converted rows from unconverted ones. The tempting move is to add a marker column or a sentinel value — but that’s a schema change, and schema changes are their own coordination problem. The better move, when it’s available, is to detect state from something intrinsic to the data.

In my case every transformed value, when reversed, started with a known fixed prefix. So the “smart” version of the function could look at any value and decide: does this have the shape and the marker of a transformed value? Convert it. Does it not? Pass it through untouched. No new column, no sentinel, no migration of the schema — the data told me its own state. When you can derive “is this done yet?” from the value itself, you’ve removed an entire class of coordination.

Make every step a reversible swap — until one commit point

I broke the migration into phases, and the discipline that made it safe was being ruthless about which phases were reversible. Each redefinition of the function was a clean swap: if a step misbehaved, I could redefine it back to the previous body and be exactly where I started. That keeps almost the entire migration in “undo is cheap” territory.

The exception — and you must know exactly where it is — was the single step that started letting new data land in the new format. Before that step, the data was uniform and reversible. After it, the data was mixed-state, which meant the read-side function now had to stay smart enough to handle both forever, and the only true rollback was restoring from a snapshot (which, on a live system with uncontrolled writers, means losing whatever they wrote since the snapshot). That’s the one-way door: one step where the cost of being wrong jumps, gated behind explicit sign-off and a tested read path. Everything before it, I could experiment with freely. Knowing precisely which step is the door is most of the safety.

There’s an ordering subtlety worth calling out, because it generalizes: the step that makes reads tolerant of both formats must be live and fully settled across the whole system before the step that starts producing the new format. Flip that order and a reader hits a value it doesn’t understand. Sequence the tolerance before the change that requires it.

The backfill has to assume a live writer

Lazy conversion — only converting rows as they happen to be touched — never finishes; cold data stays in the old format forever. So you need an active backfill that walks the whole dataset. And because the system is live, that backfill cannot assume it’s the only one writing. The pattern that makes it safe:

  • Conditional writes. Update a row only if it still holds the value you read. If a real writer changed it in the meantime, your write is rejected and you skip it — you never clobber a live update. (Most databases expose this as a compare-and-set or conditional update.)
  • Idempotent. Re-running over an already-converted row is a no-op, so a retry or a resume never double-applies.
  • Resumable. Walk the keyspace in ranges and checkpoint your position, so an interruption resumes from the last range instead of starting over.
  • Throttled. Rewriting millions of rows is I/O the live system feels — background compaction, index maintenance, replication. Throttle to a rate the production workload doesn’t notice, and accept that this makes the backfill take hours. (That throttle, not your conversion code, is the thing that sets the wall-clock — the measurement rig matters more than the clever function, the way your load generator is part of the experiment.)

Watch for the thing you can’t update in place

The hazard that nearly bit me: some of the data I needed to convert was part of the primary key. You generally can’t update a primary key in place — changing it means deleting the old row and inserting a new one under the new key, which is a real data remodel with different routing, not a quiet field update. That’s a categorically harder operation than rewriting a regular column, and it’s worth finding before you plan, because it can change the whole approach (sometimes the right call is to leave key-embedded data in its old format indefinitely, if the access pattern still works).

The general lesson: inventory which parts of your data are cheap to change in place and which are structurally fixed, because they need completely different plans, and discovering a fixed one mid-migration is how a “simple backfill” becomes an outage.

Put together, an invisible live migration is: freeze the contract, detect state intrinsically, keep every step a reversible swap until one well-marked commit point, and run a backfill that assumes a live writer is fighting you for every row. None of it is fast, and that’s correct — the speed you want here is the speed of being able to stop. If you’re planning a migration against callers you can’t pause, I’m happy to compare notes.