Skip to main content

Reachability Analysis for Agent Action Spaces: Eval Coverage for the Branches You Never Tested

· 12 min read
Tian Pan
Software Engineer

The first time anyone on your team learned that the agent could call revoke_api_key was the morning a well-meaning user typed "this token feels old, can you rotate it for me?" The tool had been registered six months earlier as part of a batch import from the auth team's MCP server. It had passed schema validation, appeared in the catalog enumeration, and then sat. No eval ever invoked it. No production trace ever touched it. Then one prompt, one planner decision, and the incident channel learned the tool existed.

This is the failure mode that hides inside every agent with a non-trivial tool catalog. Forty registered functions and a planner that can compose them produce a reachable graph of plans whose long tail you have never observed. The assumption that "we tested the common paths" papers over the fact that the dangerous branch is, almost by definition, the one you never saw.

The good news is this problem is not new. Compilers have been wrestling with the analogous question — which paths through this control-flow graph are actually reachable, and which are dead — for fifty years. The discipline transfers, with adaptations, to agent action spaces. The bad news is most teams have not made the import yet, so their tool catalogs grow faster than their eval coverage and the gap between "tools we have" and "tools we have validated" silently widens.

Your Tool Catalog Is a Control-Flow Graph

Start with the underlying object. A tool catalog is not a list. It is a graph whose nodes are tools, whose edges are the planner's permissible compositions, and whose entry points are the prompts a user can plausibly send. The action space is the set of finite walks on this graph from any entry point to any terminal state.

The size of the action space is not the size of the catalog. With forty tools and an average plan depth of four, the upper bound on distinct plans is in the millions. Most of that bound is unreachable in practice — the planner's policy, the prompt distribution, and tool-output dependencies prune most edges — but "most" is doing a lot of work. The reachable subgraph is what your evals need to cover, and it is invariably much larger than what your evals actually cover.

In compiler design, reachability analysis answers the question: starting from the program entry, which basic blocks can be visited by some execution? The classical algorithm is mark-and-sweep — initially mark every block unreachable, then traverse from the entry, marking everything you touch. Anything left unmarked is dead and can be deleted. The point is not that dead code is wrong, but that it is unaudited: the compiler cannot reason about its effects, the test suite cannot exercise it, and any execution that reaches it does so by a path the analysis did not predict.

The same algorithm, applied to your agent, gives you something useful. Treat your production traces and eval traces as the set of observed walks on the action graph. Mark every (tool, predecessor, successor) triple that any trace ever exercised. The unmarked subgraph is your unaudited surface.

What "Unreachable" Actually Means in Agent Land

The compiler analogy needs one important correction. In a compiled program, unreachable code is provably never executed; the analysis is sound because the language semantics are deterministic. In an agent, the planner is a stochastic policy whose output distribution shifts every time you change a prompt, swap a model, add a tool, or update a system message. A branch that has never been observed is not unreachable. It is unobserved. The distinction matters because the only thing standing between an unobserved branch and a production invocation is one user prompt that nudges the planner toward it.

So the right model is not "reachable vs. dead" but "exercised vs. unexercised." Every unexercised branch carries a probability of being reached and a blast radius if it is. The product of those two is your risk score for the branch, and the prioritization problem for evals becomes obvious: spend coverage budget where the product is largest, not where the branches are easiest to write.

The classical software-testing literature has been making this argument for years under the name of risk-based coverage. The core finding — that risk-weighted testing delivers more defect detection per test-hour than uniform coverage — is even more true for agents, because the action space is so much larger and the cost of writing an eval per branch is so much higher. You cannot test everything. You can test the things whose failure would matter.

Building the Action Graph from Traces

Most teams do not have an explicit action graph. They have a registered tool catalog, an OpenTelemetry span stream, and an eval suite that grew organically. The first job is to derive the graph the team is implicitly running.

The construction is mechanical. From the tool catalog, enumerate the nodes. From the planner's published affordances — which tools can follow which, which require argument shapes that depend on prior outputs, which the policy layer forbids in certain combinations — derive the edges. From the trace stream, project every observed plan onto the graph and increment a visit count on each edge it traverses. From the eval suite, do the same projection and tag the edges that are covered by any eval.

Now the graph has annotations. Every edge has a production frequency, an eval-coverage flag, and a static blast-radius label that the tool team set when they registered the tool. Edges without traces are unobserved. Edges without evals but with traces are observed-but-unprotected — covered only by whatever the tool author tested, which for many tools is whatever made the integration tests pass that quarter.

The most useful artifact at this point is not a metric. It is a heat map. Lay out the graph, color the high-blast-radius edges red, the low-blast-radius edges green, and the unobserved-and-unprotected ones with a hatching pattern. The red-and-hatched cells are your eval backlog, sorted by something better than the order tools happened to be registered.

Risk-Weighted Eval Prioritization

The risk score that makes this concrete is straightforward: risk = blast_radius × probability_of_invocation. The two factors decompose naturally.

Blast radius is a property of the tool itself and is best assigned at tool registration, not at eval time. A tool that sends a webhook has a different blast radius than one that reads a configuration value, which has a different blast radius than one that writes to a payments ledger. The taxonomy can be coarse — read, write, external_side_effect, irreversible works for most catalogs — but it has to be set by someone who knows what the tool does, not inferred from the function signature. Tool authors are the right owners; the policy layer is the right enforcement point.

Probability of invocation is harder. The honest answer is that for unobserved branches, the team does not know. A workable proxy is structural similarity to observed branches: if seven plans have routed through (search → summarize → email) and a hypothetical plan would route through (search → summarize → post_to_slack), the probability is high enough that the slack branch deserves an eval. For branches with no nearby observed structure, default to a non-zero floor — say, a synthetic eval to flush out any obvious failure mode — and re-rank when production evidence accumulates.

The prioritization rule that comes out of this is: write evals first for branches whose blast radius is external_side_effect or irreversible and whose structural-similarity probability is above the floor. Everything else goes in the backlog, ordered by the same product. This sounds obvious until you compare it to what most teams actually do, which is "write evals for the branches the PM brought up in the last review."

Eval-Set Construction: Pushing the Planner Down Untraveled Branches

Risk-weighted prioritization tells you which branches to cover. The harder question is how to construct evals that actually exercise them.

Two techniques carry most of the weight. The first is synthetic adversarial prompting: hand-write or model-generate prompts engineered to push the planner toward a specific target tool or composition. This is the agent analogue of fuzzing — the goal is not to verify a normal user flow, but to confirm that when a prompt does steer toward a high-blast-radius branch, the right policy gates fire and the right tool arguments validate.

The second is planner-perturbation evals. Hold the prompt fixed and permute the tool catalog: hide a tool, rename one, swap two, add a near-duplicate with a slightly different signature. Does the planner still pick correctly? Does the policy layer still catch invocations of the renamed tool? Does the agent gracefully fall back when the expected tool is missing? These evals stress the assumption that the planner's behavior is robust to catalog drift, which is exactly what happens every time the auth team adds a new MCP server.

A third technique, less appreciated, is to invert the eval direction. Instead of asking "given this prompt, did the agent pick the right tool," ask "given this tool, what is the cheapest prompt that would invoke it." If the cheapest prompt is one a user might plausibly type, the branch deserves an eval; if the cheapest prompt is a contrived adversarial string, the branch is at least bounded. This is uncomfortable to do because it requires admitting that some branches are reachable from prompts the team had not imagined, but it is the only way to estimate the probability factor honestly.

The Catalog Accumulation Trap

There is an organizational pattern that makes all of this worse. Tool catalogs grow by accumulation. A team builds the agent with twelve tools, ships it, and the next quarter the auth team contributes three, the analytics team contributes four, the support team contributes seven. The integration is technically clean — each tool has a schema, a description, and a happy-path test. The catalog is now thirty tools. Nobody has run reachability analysis since the original twelve.

The accumulation trap shows up in three predictable ways. First, tools that nobody calls accumulate without being audited because the lack of invocations is mistaken for a lack of risk. Second, the planner's behavior changes whenever a new tool's description is written in a way that makes it a competitive match for prompts that previously routed elsewhere — a phenomenon sometimes called description drift, and it causes silent re-routing of production traffic. Third, the policy layer falls behind because each tool team writes the gate for their own tool but nobody owns the gate for compositions of tools across teams.

The fix is the same fix that compiler-driven projects converged on decades ago: the analysis has to run on every change, not once. Every tool registration triggers a recomputation of the action graph and a diff against the prior graph. New edges with high blast radius and no eval coverage block the registration. The build fails. Someone writes the eval. The catalog grows in a way the team can keep up with, instead of in a way that hides risk under the guise of integration progress.

The Architectural Conclusion: Action Spaces Are Public APIs

The deepest reason this matters is that an agent's action space is a public API. Every reachable plan is a contract with whoever sends the prompt: under these conditions, the agent will invoke this sequence of tools with these argument shapes. The contract was not written by anyone, but it is binding the moment the system serves traffic. "We never tested that branch" has the same evidentiary weight as "we never tested that REST endpoint" — which is to say, none, when the endpoint is in production and the customer's data just got mutated.

This reframing has practical consequences. API contracts get versioned; action-space changes should be versioned. API contracts get deprecation policies; tools whose registration is being removed should go through a deprecation window where the planner's policy is updated to stop selecting them before the tool is unregistered, not after. API contracts get backwards-compatibility tests; eval suites should include regression coverage for previously-validated branches whenever the catalog changes. The standard API-engineering hygiene we already know how to do applies, and the absence of it is the reason action-space coverage feels like an open problem.

The work for the next year, for most teams running non-trivial agents, is closing the loop between catalog registration and eval coverage. The mechanism is unglamorous: an action graph that is kept in version control, a coverage delta on every catalog change, a risk-weighted backlog that drives eval-writing priorities, and a policy layer that defaults to deny for branches the analysis has not seen. None of this requires a research breakthrough. It requires acknowledging that the action space is the surface area of the product and treating it accordingly.

The day a tool you registered six months ago fires for the first time in production should not be the day your incident channel learns the tool exists. With reachability analysis applied seriously, it will not be — because either the eval will have caught the branch first, or the policy layer will have refused to let the planner pick it without explicit approval. Until then, the gap between your tool catalog and your eval coverage is the size of the next incident.

References:Let's stay in touch and Follow me for more thoughts and updates