Pakkit.net
← Back to blog

Engineering Practice

Query the Data Without Standing Up the System

When you only need to run queries against a system's data, you usually don't need to run the system — a search is a wire protocol plus a matching engine, and the matching engine often ships as a library you can point straight at a file.

  • Engineering Practice
  • Tooling
  • LDAP
  • Debugging

I had an LDAP data export and a fistful of search filters I needed to run against it. The reflex move — the one I started to make — was to stand up a directory server, load the data in, point a client at it, and run my searches. That’s a lot of moving parts to answer a question that’s really just “which of these entries match this filter.” It turns out you almost never need the server. You need the part of the server that evaluates the query, and that part is usually sitting in a library you can call directly.

The reflex is to boot the whole thing

When you want to query a system’s data, the obvious path is to run the system. Need to test LDAP filters? Stand up a directory. Want to poke at some rows? Spin up the database. It feels correct because that’s how the data is normally served — through the running service.

But “how it’s normally served” and “what’s required to answer my question” are two different things, and conflating them is how a five-minute task becomes an afternoon of provisioning.

A search is two parts, and you only need one

An LDAP search — like a lot of queries — is really two separable things:

  • A wire protocol: connect, bind, open a socket, send a request, get a response back. This is the part that needs a running server.
  • A matching engine: the logic that decides which entries actually satisfy the filter, with all the correct semantics — case sensitivity, substring rules, ordering, the lot.

Only the first part needs the server. The second part is pure evaluation: given these entries and this filter, which ones match? If you can feed entries to the matching engine straight from a file, you’ve skipped the entire protocol — no daemon, no socket, no port, no temporary instance quietly spun up behind the scenes.

For LDAP that tool is ldifsearch: it parses an LDIF file into entries in memory and runs the same standards-compliant filter matching a real server would, in process. You get correct results because it’s literally the server’s matching logic, minus the server.

The tool you need is often hiding in a library

Here’s the part that saved me. I’d assumed ldifsearch was locked inside a big commercial directory product I’d have to go chase down. It isn’t — it ships in the open-source UnboundID LDAP SDK, free and on GitHub. The commercial product is built on that SDK and just bundles the same tool. I didn’t need the heavyweight package, a license, or anyone’s permission; I needed a JAR and a JRE.

Before you provision infrastructure to answer a question, find out which part of the system actually answers it. That part is often a library, not a platform.

And it’s rarely the only option. OpenDJ ships its own offline ldifsearch. In Python, ldap3’s mock strategy evaluates standard filters in memory in a few lines. The matching logic has been extracted and reused in several places, precisely because “evaluate a filter against some entries” is a useful thing to do without a directory in the loop.

”Spin up a throwaway server” is the heavier option, not the only one

To be fair: running a real instance is sometimes the right call. If you also want to test binds, access controls, or replication behavior, you need the protocol and the policy layer, so a containerized server with the data loaded is a legitimate choice. That’s testing the system.

But that’s a heavier tool for a different job. When the question is purely “do these records match this filter,” the offline matcher is faster, has fewer moving parts, and — this is underrated — coexists peacefully with whatever’s already running, because it never touches the network. It’s also the only option that works on a locked-down or air-gapped box where you can’t open a port anyway.

The general habit

This generalizes well past LDAP. A lot of “I need to run the system to check the data” problems are actually “I need the system’s evaluation logic, and that logic is available standalone”:

  • Querying structured data? A parsing library plus a query expression usually beats loading it into a running engine.
  • Validating documents against a schema? The validator is a library; you don’t need the service that normally enforces it.
  • Checking how a config or rule would be interpreted? The interpreter is often importable on its own.

The move is to separate the transport and lifecycle of a system from its evaluation logic, and notice that you usually only need the second one. It’s cheaper, it’s more portable, and it keeps you from standing up infrastructure whose only job was to answer one question and then get torn down.

If you’ve got a favorite “I thought I needed the whole server and I really didn’t” story, I’d enjoy hearing it — collecting those is half of how I learn which heavy tools have a light alternative hiding inside them.