Skip to main content

Reachability Analysis for Agent Action Spaces: Eval Coverage for the Branches You Never Tested

· 12 min read
Tian Pan
Software Engineer

The first time anyone on your team learned that the agent could call revoke_api_key was the morning a well-meaning user typed "this token feels old, can you rotate it for me?" The tool had been registered six months earlier as part of a batch import from the auth team's MCP server. It had passed schema validation, appeared in the catalog enumeration, and then sat. No eval ever invoked it. No production trace ever touched it. Then one prompt, one planner decision, and the incident channel learned the tool existed.

This is the failure mode that hides inside every agent with a non-trivial tool catalog. Forty registered functions and a planner that can compose them produce a reachable graph of plans whose long tail you have never observed. The assumption that "we tested the common paths" papers over the fact that the dangerous branch is, almost by definition, the one you never saw.

The good news is this problem is not new. Compilers have been wrestling with the analogous question — which paths through this control-flow graph are actually reachable, and which are dead — for fifty years. The discipline transfers, with adaptations, to agent action spaces. The bad news is most teams have not made the import yet, so their tool catalogs grow faster than their eval coverage and the gap between "tools we have" and "tools we have validated" silently widens.

Your Tool Catalog Is a Control-Flow Graph

Start with the underlying object. A tool catalog is not a list. It is a graph whose nodes are tools, whose edges are the planner's permissible compositions, and whose entry points are the prompts a user can plausibly send. The action space is the set of finite walks on this graph from any entry point to any terminal state.

The size of the action space is not the size of the catalog. With forty tools and an average plan depth of four, the upper bound on distinct plans is in the millions. Most of that bound is unreachable in practice — the planner's policy, the prompt distribution, and tool-output dependencies prune most edges — but "most" is doing a lot of work. The reachable subgraph is what your evals need to cover, and it is invariably much larger than what your evals actually cover.

In compiler design, reachability analysis answers the question: starting from the program entry, which basic blocks can be visited by some execution? The classical algorithm is mark-and-sweep — initially mark every block unreachable, then traverse from the entry, marking everything you touch. Anything left unmarked is dead and can be deleted. The point is not that dead code is wrong, but that it is unaudited: the compiler cannot reason about its effects, the test suite cannot exercise it, and any execution that reaches it does so by a path the analysis did not predict.

The same algorithm, applied to your agent, gives you something useful. Treat your production traces and eval traces as the set of observed walks on the action graph. Mark every (tool, predecessor, successor) triple that any trace ever exercised. The unmarked subgraph is your unaudited surface.

What "Unreachable" Actually Means in Agent Land

The compiler analogy needs one important correction. In a compiled program, unreachable code is provably never executed; the analysis is sound because the language semantics are deterministic. In an agent, the planner is a stochastic policy whose output distribution shifts every time you change a prompt, swap a model, add a tool, or update a system message. A branch that has never been observed is not unreachable. It is unobserved. The distinction matters because the only thing standing between an unobserved branch and a production invocation is one user prompt that nudges the planner toward it.

So the right model is not "reachable vs. dead" but "exercised vs. unexercised." Every unexercised branch carries a probability of being reached and a blast radius if it is. The product of those two is your risk score for the branch, and the prioritization problem for evals becomes obvious: spend coverage budget where the product is largest, not where the branches are easiest to write.

The classical software-testing literature has been making this argument for years under the name of risk-based coverage. The core finding — that risk-weighted testing delivers more defect detection per test-hour than uniform coverage — is even more true for agents, because the action space is so much larger and the cost of writing an eval per branch is so much higher. You cannot test everything. You can test the things whose failure would matter.

Building the Action Graph from Traces

Most teams do not have an explicit action graph. They have a registered tool catalog, an OpenTelemetry span stream, and an eval suite that grew organically. The first job is to derive the graph the team is implicitly running.

The construction is mechanical. From the tool catalog, enumerate the nodes. From the planner's published affordances — which tools can follow which, which require argument shapes that depend on prior outputs, which the policy layer forbids in certain combinations — derive the edges. From the trace stream, project every observed plan onto the graph and increment a visit count on each edge it traverses. From the eval suite, do the same projection and tag the edges that are covered by any eval.

Now the graph has annotations. Every edge has a production frequency, an eval-coverage flag, and a static blast-radius label that the tool team set when they registered the tool. Edges without traces are unobserved. Edges without evals but with traces are observed-but-unprotected — covered only by whatever the tool author tested, which for many tools is whatever made the integration tests pass that quarter.

The most useful artifact at this point is not a metric. It is a heat map. Lay out the graph, color the high-blast-radius edges red, the low-blast-radius edges green, and the unobserved-and-unprotected ones with a hatching pattern. The red-and-hatched cells are your eval backlog, sorted by something better than the order tools happened to be registered.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates