Pakkit.net
← Back to blog

Systems Thinking

Config Management Is Not a Scheduler

Provisioning, scheduling, and orchestration are three different jobs, and the moment you make your configuration tool moonlight as a cron you've invented a single point of failure for work that was supposed to be reliable.

  • Systems Thinking
  • Automation
  • Operations
  • Reliability

I was setting up recurring maintenance work across a small fleet — the kind of routine job that needs to run on every machine on a schedule — and caught myself reaching for the obvious shortcut: fire the configuration-management tool from a cron job on a control box, and let it go run the task everywhere. It works in a demo. It’s also the wrong shape, and noticing why it’s the wrong shape clarified something I now apply everywhere: provisioning, scheduling, and orchestration are three different jobs, and tools are good at one of them, not all three.

Three jobs that look similar and aren’t

It’s easy to lump these together because they all “make things happen on servers.” But they have different shapes:

  • Provisioning / configuration management converges a machine to a desired state, idempotently. Run it once or ten times, you end up in the same place. It answers “is this box configured the way I declared?”
  • Scheduling triggers work on a recurring cadence. It answers “has the thing run recently, and will it run again on time?” Its whole job is time.
  • Orchestration coordinates a sequence across multiple machines — fan out, gather results, handle partial failure. It answers “did this multi-step, multi-host operation complete as a unit?”

A configuration-management tool is built for the first. It is not a clock, and it is not a workflow engine. When you press it into those roles, the seams show.

Why “cron fires the config tool” is a trap

Stand up that shortcut and look at what you’ve actually built. The schedule now lives on one box — the control node, or worse, somebody’s laptop. The recurring job that was supposed to be reliable across the whole fleet now depends on:

  • that one box being awake and healthy at trigger time,
  • its network path to every target staying open,
  • whatever credentials and jump hosts it needs all being valid,
  • and a human remembering that this critical schedule lives on a machine nobody thinks of as production.

You took a job that should run independently on each machine and gave it a single point of failure that isn’t even labeled as one.

That’s the core problem: you’ve coupled the reliability of recurring work to the availability of an orchestration path that has nothing to do with the work itself. The day the control box is down for patching, the backups (or whatever the job was) silently don’t run — and silent non-execution is the exact failure mode that hurts most, because nothing alarms.

Let recurring work live where it runs

The better shape is to put the schedule on the machine that does the work. A node-resident timer (systemd timers are the modern, observable choice — they log to the journal and you can list when each last ran and will next run) means each machine is responsible for its own cadence. No central trigger, no fan-out-at-trigger-time, no SPOF. If one node is down, only that node’s job is affected, and it catches up on its own when it comes back.

This is the same instinct as keeping a service’s health logic inside the service: put the responsibility where the work is, so the work doesn’t depend on a chain of external things being simultaneously healthy at one instant in time.

So use each tool for its actual job

The clean division, the one I now reach for by default:

  • Use configuration management to install and configure the job — lay down the script, the timer unit, the credentials, identically and idempotently on every machine. This is exactly what it’s great at, and it’s how the schedule itself gets deployed consistently.
  • Let a node-resident scheduler run the cadence. The config tool’s output includes the timer; the timer, living on the box, owns the recurrence.
  • Reach for orchestration only when the operation genuinely spans hosts as a unit — a coordinated rollout, a step that must gather state from everywhere before proceeding. Then a real orchestration path earns its keep. For “run the same independent job everywhere on a schedule,” it’s overkill that adds a coordinator you didn’t need.

There’s a tidy heuristic hiding in here: ad-hoc, run-it-now work is fine to drive centrally — that’s a human deciding to act once. Routine work should be autonomous on the node. “Do this now” and “do this every day forever” are different requests, and the second one should never depend on the first one’s tooling being available.

The meta-lesson is one I keep relearning: when something feels slightly awkward to wire up — like bending a config tool into a scheduler — the awkwardness is usually the tool telling you it’s the wrong tool. The fix isn’t more glue; it’s matching the job to the thing built for it. Provision with the provisioner, schedule with the scheduler, orchestrate only when you must. If you’ve untangled your own “everything runs off one cron box” setup, I’d like to hear how it went.