Automation
Pay the Expensive Setup Cost Once
A tool that rebuilt the same slow SSH tunnel on every run taught me to make expensive setup persistent and shared — reused across calls, health-checked before it's trusted, and rebuilt only when it's actually dead.
- Automation
- SSH
- Performance
- Tooling
I had a command-line tool that felt sluggish, and for a while I blamed the work it was doing. I was wrong. The work was fast. The slow part was the setup it built, used once, and threw away — every single invocation. The fix wasn’t a faster setup. It was a setup I stopped repeating.
The cost was in the rebuild, not the work
The tool reached a network my machine couldn’t route to directly, so each run opened an SSH SOCKS tunnel through a jump host: authenticate, wait up to twenty seconds for the local port to come up, do one quick operation, tear the tunnel down. Run a handful of commands back to back and you pay that twenty-second tax every time, even though the actual operations took a second each.
That’s the trap with per-invocation setup: it’s invisible in a single run and infuriating across many. The tunnel wasn’t the feature; it was overhead I was re-buying on every call.
Make the expensive thing persistent and shared
The reframe: the tunnel should outlive the command that needs it. Spawn it detached so it survives the process exiting, and let the next invocation — any tool, not just the one that created it — find and reuse the live one. A tiny registry does the bookkeeping: a small on-disk index mapping “the thing I need” (a given jump host) to “where the live thing is” (its local port and process id).
First call pays the setup. Every call after it does zero setup and starts working immediately. Same tunnel, many consumers, one bill.
Never trust a cached resource blindly
Here’s where naive caching bites you: a persistent thing can die, and a stale pointer to a dead resource is worse than no cache at all. So reuse is earned, not assumed. Before handing back a pooled tunnel, it has to pass a quick liveness check:
- the process id in the registry is still alive,
- that process is actually the tunnel and not some unrelated program that inherited the id,
- the local port actually accepts a connection.
If any check fails, the entry is reaped and a fresh tunnel takes its place. The checks are cheap; the wrong assumption is expensive.
A cache without a liveness check isn’t an optimization, it’s a latent outage with good intentions.
Heal automatically, and keep it alive
A half-dead tunnel — process up, forwarding silently broken — is the nastiest state, because it passes a lazy check and then fails the real work. Two habits keep it honest: keepalives so a healthy tunnel doesn’t quietly rot behind a flaky link, and automatic restart so a dead one repairs itself without a human in the loop. The caller shouldn’t have to know whether it got a fresh tunnel or a reused one. It should just get a working one.
Shared state needs a lock
The moment a resource is shared, concurrency stops being hypothetical. Two runs firing at once — say a parallel batch job — will race to create the tunnel and can scribble over each other in the registry. A file lock around the create-or-reuse step serializes that, so the index can’t be corrupted by a photo-finish.
It’s tempting to wave this off because the shared state is “just a little JSON file.” A little JSON file is exactly the thing you’ll find corrupted at the worst possible moment. If it’s shared and mutable, it gets a lock.
Keep an escape hatch
Pooling is the right default, not a law. Sometimes you want the old throwaway behavior — a clean, isolated tunnel for debugging, or a one-off you don’t want touching the shared pool. A simple “fresh” flag that bypasses reuse keeps the optimization from turning into a cage. Optimize the common path; leave the door open for the uncommon one.
The shape shows up everywhere
None of this is specific to SSH tunnels. The same pattern — pay once, verify before reuse, heal automatically, lock the shared bit — is what database connection pools, reused auth tokens, warm caches, and persistent browser sessions in a test suite are all quietly doing. Once you’ve named it, you start spotting per-invocation setup costs all over your tooling, and most of them are removable.
The payoff isn’t just speed; it’s that fast tools get used and slow ones get avoided. If you’ve got a sluggish bit of automation and suspect the cost is in the setup it keeps rebuilding, I’m happy to compare notes — it pairs with how I think about automation safety more broadly.