Pakkit.net
← Back to blog

Automation

First-Boot Automation Races the Network

The hardest part of "configure itself on first boot" isn't the configuration — it's that the machine wakes up faster than its network does, so the automation has to wait for the network to be truly ready and then prove it before doing anything.

  • Automation
  • Networking
  • Provisioning
  • Reliability

I built a pipeline to stand up fresh machines that configure themselves on first boot — come up, get on the network, pull a playbook, run it, done. The configuration logic was the easy part. The part that ate the time was a race I should have seen coming: a machine boots fast, and its network comes up slow, and if the automation fires before the network is genuinely ready, it fails in confusing, intermittent ways. First-boot automation isn’t really a configuration problem. It’s a timing problem dressed up as one.

”Booted” and “on the network” are different events

We talk about a machine “coming up” as one event. It’s at least two, and they don’t finish together. The OS reaches a usable state quickly — processes running, disks mounted, your first-boot job eager to go. The network takes longer: an address has to be assigned or negotiated, routes established, DNS configured. For a stretch right after boot, the machine is up but not connected, and that window is exactly when a naive first-boot script charges ahead and faceplants.

The failure is maddening because it’s intermittent. Sometimes the network wins the race and everything works; sometimes the script wins and it fails. Same image, same config, different outcome depending on which finished first this time. That non-determinism is the tell that you’re looking at a startup-ordering race, not a logic bug. (It’s a cousin of same input, different result — when identical setups disagree, look for the hidden timing variable.)

A machine is ready to run your automation long before the network is ready to carry it. The gap between those two moments is where first-boot scripts go to die.

Order it after the network, with the init system

The first line of defense is to not start until the init system says the network is actually up. Modern init systems expose a “network is online” target precisely for this, and a first-boot job should declare that it runs after it. Don’t start “at boot” — start “after the network is online.”

That ordering is necessary and, by itself, not sufficient — which is the trap. “Network online” usually means an interface has an address and a route, but that’s a weaker promise than “the specific host I need to reach is reachable and answering right now.” Slow address assignment, DNS that isn’t warmed up yet, a route that exists but isn’t carrying traffic — plenty can still be in flux after the init system declares victory. Ordering gets you to the right starting line; it doesn’t guarantee the track is clear.

Then prove the path before you use it

So the script itself has to verify reachability before it acts — not assume the ordering was enough. The pattern that made first boot reliable was an explicit readiness gate: before fetching anything, poll the source it depends on until it’s genuinely usable, with backoff and a ceiling:

  • Resolve the name. Can I turn the source’s hostname into an address yet? DNS is frequently the last thing to come up, and the most common thing to be briefly broken right after boot.
  • Connect to it. Can I actually open a connection to the port I need — not just ping, but reach the service that has to answer?
  • Back off and retry. If either fails, wait and try again, up to a sane limit, rather than failing on the first miss. Transient unreadiness is the expected state for the first few seconds, not an error.

Only once the path is proven does the real work — fetch the playbook, run it — begin. That readiness gate is the difference between “works on a fast network, flakes on a slow one” and “works.” You’re not assuming the network is ready; you’re confirming it, which is the same instinct as verifying a live value instead of trusting that you set it.

Make it run exactly once, and leave a trail

Two more properties separate a toy from something you’ll trust on real machines:

  • Run once, idempotently. First-boot means first boot. A guard marker that’s written on success, and checked on start, keeps the job from re-running on every subsequent reboot. Self-disable after it completes. Otherwise “first-boot config” quietly becomes “every-boot config,” which is a surprise nobody enjoys.
  • Log where you can find it after the fact. When a machine comes up wrong, you weren’t watching — it happened unattended at 2am. A first-boot job has to write a durable log of what it did, what it waited on, and where it failed, or debugging it is pure archaeology. The whole event is invisible unless the job narrates it.

The general shape: wait, verify, then act

Strip away the provisioning specifics and this is a pattern that shows up anywhere automation runs in a freshly-created, not-yet-settled environment: a new container, a scaled-up instance, a cold-started function. The environment is present before it’s ready, and the cheap-but-wrong move is to assume presence implies readiness.

The reliable shape is always the same three beats: wait for the dependency to be ordered ahead of you, verify it’s actually usable before you lean on it, and only then do the work — while running once and logging enough to debug unattended. Build that and first boot becomes boring, which on a machine that configures itself with nobody watching is the highest praise there is. If you’ve fought your own boot-time-versus-network race and found a clean way through, I’d love to hear it.