Infrastructure

Find Out Who's Using It Before You Touch It

A database cluster everyone called a non-prod test rig turned out to have a live external client writing to it around the clock. The label was wrong, and only the connection data told the truth. Before maintenance on anything shared, verify who's actually connected — don't trust the name on the box.

December 16, 2025

Infrastructure
Databases
Operations
Investigation

A maintenance plan landed on my desk for a database cluster that everyone — the ticket, the docs, the team — described as a non-production rehearsal rig, safe to snapshot and disrupt. Something made me check who was actually connected to it before I touched it. The answer: a live external application, owned by a different team, writing to it heavily and continuously, with sessions that had been up for weeks. The “test cluster” was serving a real workload. The label was wrong, and the only thing that told the truth was the connection data. That gap — between what a system is called and what it’s doing — is why “find out who’s using it before you touch it” is a rule I now apply to anything shared.

The name on the box is a claim, not a fact

Systems get labeled once and then drift. A cluster stood up for testing quietly becomes a dependency. An environment named “dev” ends up with something real pointing at it. The label reflects intent at creation time, not current reality, and nobody updates it when a new consumer shows up — because the new consumer usually doesn’t tell anyone. So “it’s just the test rig” is a hypothesis about the present based on a decision in the past. Treat it as one.

“Nobody uses that” is a hypothesis. The connection table is the experiment.

Ask the system, not the documentation

The thing about live consumers is that they leave evidence — you just have to look at the system instead of the wiki. Established connections, client/session stats, request counters, uptime: these are ground truth about who’s talking to a service right now. In my case, read-only inspection showed a specific external host holding persistent connection pools, a request count in the tens of thousands, and a write-heavy traffic pattern that had been running for weeks. None of that was in any document. All of it was one query away on the box itself.

The general technique, for almost anything shared:

A database? Look at active sessions and client stats — who’s connected, from where, doing how much.
A service or port? Look at established connections and recent access logs.
A shared file or mount? Look at open handles and recent access times.
A scheduled job or queue? Look at who’s enqueuing and consuming.

You’re not asking permission or reading intent — you’re observing behavior. The system always knows who’s using it, even when no human does.

”How much” and “how long” matter as much as “who”

It wasn’t just that a client existed — it was that the client was doing real, sustained work. A single idle connection might be a forgotten health check you can ignore. Tens of thousands of writes over a multi-week uptime is a production dependency you absolutely cannot disrupt without coordination. So the investigation isn’t just “is anything connected?” but “what is it doing, how much, and for how long?” Volume and duration tell you whether you found a ghost or a heartbeat. A heavy, long-lived writer is the strongest possible signal that the “non-prod” label is fiction.

Trace the consumer back to an owner

Finding the connection is step one; finding the human responsible is step two, and it’s what makes the discovery actionable. An unfamiliar client address means cross-referencing — reverse DNS, internal inventories, code search for the address or the credential — until you can name the team that owns it. The goal is to turn “some host is writing to this” into “this platform, owned by that team, depends on this — loop them in before any maintenance.” A dependency you can’t attribute is a dependency you can’t safely coordinate around, so the detective work of mapping connection to owner is part of the job, not a nicety.

Verify, then re-scope the work

What did the discovery change? Everything. A maintenance that was framed as “disrupt a test rig freely” became “coordinate a window with a live consumer, or — better — stand up a genuinely isolated copy to rehearse on first.” The work didn’t get cancelled; it got re-scoped to reality, which is exactly what you want to happen before you act, not after an outage. This is the operational cousin of trusting the running system over the stale config and of building the rehearsal instead of extrapolating: the map is not the territory, and a five-minute look at who’s actually connected can save you from taking down something you didn’t know was load-bearing. Before you touch anything shared, ask it who’s using it. If you’ve ever discovered a “dead” system was very much alive, I’d like to hear the story.