Automation
When the Failure Happens Before Your Code Runs
A CI job that dies before it executes a line of your code isn't your bug — it's infrastructure, and the fix is telling transient failures apart from real ones instead of chasing your own tail.
- Automation
- CI/CD
- Reliability
- Docker
A pipeline went red and the trace pointed at a job that never ran a single line of the thing it was supposed to test. It died during setup — pulling its own container image — with an authentication error against the registry. The instinct in that moment is to start debugging your code, your config, your YAML. Don’t. A failure that happens before your code runs is almost never your code. Learning to recognize that category quickly is worth more than any single fix.
First, locate the failure in the lifecycle
Every CI job has a setup phase before your script runs: provision a runner, pull the image, mount the workspace, wire up services. When the job dies in that phase, the error has a different owner than when your test suite fails. The trace usually says so if you read to the end — “failed to pull image,” “authentication required,” “could not prepare executor.” Those are infrastructure words, not application words.
A build that fails before your script starts is reporting on the platform, not on your change. Read the trace far enough to know which one you’re looking at.
That single distinction saves hours. I’ve watched people rewrite working code because a runner couldn’t authenticate to an image registry — fixing a bug that was never in the repo. The cheapest debugging move is to ask “did my code even run?” before asking “what’s wrong with my code?”
Transient or real? Look at the pattern, not the instance
The next question is whether the failure is a one-off blip or a real, persistent problem, and you answer it by looking at the history, not the single red run. I pulled the last several runs of the failing job on the same branch. Same image, same config — succeeding sometimes, failing others. That pattern is the signature of a transient problem: an intermittent auth/availability hiccup on a shared dependency, not a broken credential (which would fail every time) and not a code bug (which wouldn’t depend on the weather).
- Fails every time, same spot → a real, deterministic problem. Fix the cause.
- Fails sometimes, same spot, identical inputs → transient infrastructure. Make it survivable.
- Started failing right after your change → now it might actually be you.
Confirming “intermittent” changed the whole response. You don’t root-cause a coin flip the same way you root-cause a logic error.
Make pipelines self-heal — but know it’s a mitigation
For a transient infra failure, the durable repo-side move is to let the job retry itself on that specific class of failure. Most CI systems can scope a retry to “system failure” (the setup-phase kind) without blindly retrying real test failures — which is exactly the line you want, because you do not want to auto-retry a legitimately failing test until it flukes green.
But I want to be honest about what that retry is: a mitigation, not a cure. It stops an intermittent blip from being pipeline-fatal; it does not fix the flaky dependency underneath. I treated it as exactly that — ship the retry so people stop getting paged, and file the real work against whoever owns the shaky infrastructure. Conflating “I made the symptom stop” with “I fixed it” is how flaky systems become permanent. The retry buys calm; the root cause still owes you a fix.
Remove the fragile dependency instead of retrying around it
The better fix, when you can reach it, is to stop depending on the fragile thing at all. In my case the flaky piece was an image source that required authentication that kept hiccuping. The fix that actually held was repointing at a more reliable, anonymously-readable mirror of the same public images. No auth handshake to fail means no auth failure to retry around.
That’s the general pattern: a retry tolerates an unreliable dependency, but removing the unreliable dependency is strictly better when it’s an option. Ask whether you even need the brittle path before you build cleverness to survive it. The most robust dependency is the one that has the fewest ways to say no — the same instinct behind deploying with the access you already have.
The container-in-container tax nobody warns you about
Once the image pulls finally worked, a second layer of latent failures surfaced — the kind that had been hiding behind the pull failure and never had a chance to run before. Building containers inside CI (docker-in-docker) on a shared runner pool is genuinely fiddly, and the gotchas are worth naming because they’re universal:
- TLS/port mismatches between the daemon and client — pick plaintext-local or full-TLS deliberately, don’t half-configure it.
- The daemon can’t bind its socket when the host’s socket is already mounted in; run it listening on a TCP port instead.
- Readiness races — the daemon advertises “up” before it actually accepts connections, so poll until it really answers instead of trusting the start signal.
- Heterogeneous runners — if your job tag matches two different runner types, it’ll land on whichever and fail nondeterministically. Pin to the purpose-built pool.
- Bind mounts cross a boundary — a volume mount inside the nested daemon reads from the daemon’s filesystem, not your job’s, so copy files in over the API instead.
None of those are your application either. They’re the cost of the platform, and the lesson is the same as the headline: when CI breaks, figure out which layer broke before you touch your code.
The throughline across all of it — read the trace to find where in the lifecycle the failure lives, judge transient-vs-real from the pattern, mitigate blips but fix causes, and prefer removing a fragile dependency to engineering around it. It pairs with how I think about automation needing a panic button: the goal isn’t a pipeline that never fails, it’s one whose failures you can read at a glance. If you’ve got a CI ghost story of your own, I’m easy to reach.