Skip to main content

Agent Branch Coverage: Your Eval Hits the Happy Path, Not the Planner's If-Else

· 8 min read
Tian Pan
Software Engineer

A team I worked with last quarter ran a 240-case eval suite against their support agent. Green across the board for six months. Then they swapped a single sentence in the planner prompt — a tone tweak — and the next day production saw a 3× spike in human-handoff requests. The eval hadn't moved. The handoff branch had simply started firing on borderline cases that used to resolve in-line, and not a single eval case was the kind of borderline. The branch existed in the prompt. It existed in production. It did not exist in the eval.

This is the failure mode I want to name: agent branch coverage. Code-coverage tooling has been a debugging staple for forty years, but agentic systems have a runtime control flow — planner branches that pick a tool, condition the response, escalate to a human, refuse to act, retry with a different strategy — and the eval suite touches only the cases the team thought to write. Eighty percent of the planner's decision branches have never executed under test, and a green eval becomes a smoke test wearing a regression-test costume.

The Hidden Control Flow Inside Your Prompt

A modern agent prompt is a program. It's written in English, but it has branches. Read any production system prompt and you'll find them: "if the user asks about billing, hand off." "If the request is ambiguous, ask one clarifying question before acting." "If the tool returns more than 50 rows, summarize before continuing." "If you cannot verify the user's identity, refuse politely and end the turn."

Each of those is an if-else. The model is the interpreter. The eval suite is supposed to be your test harness. And the eval suite, in almost every team I've seen, was assembled by someone writing down the examples that came to mind — not by walking the prompt's branch graph and asking "which of these decisions has a test?"

The DeepEval team has been formalizing this with what they call DAG metrics — decision-tree-based scoring where each evaluation step branches on prior verdicts, mirroring the structure of the planner itself. Braintrust splits agent evaluation into trajectory checks (did the agent take the right path?) and single-step checks (was each individual decision correct?). Anthropic's "Demystifying Evals" piece argues that agent failures show up as compounding errors across turns and silent tool failures far more than as wrong final answers. All three converge on the same point: the unit of evaluation is the decision, not the response. And if you're not measuring which decisions were made, you can't tell which ones you've missed.

What MC/DC Got Right Forty Years Ago

Safety-critical software has lived with this problem since the 1970s. The aviation standard DO-178C and the automotive standard ISO 26262 don't just require statement coverage — they require modified condition/decision coverage (MC/DC), which insists that every condition in every decision be exercised both ways, and shown to independently affect the outcome. A 16-condition decision needs 17 well-chosen tests under MC/DC. The naive "test every combination" approach would need 65,536. The MC/DC discipline was invented because human intuition is terrible at picking which inputs matter; you need a structural argument, not a vibe.

Agentic systems deserve at least the entry-level version of this. Not full MC/DC — that math doesn't apply cleanly to a probabilistic interpreter — but the underlying idea: enumerate the branches, and require evidence that each one fires under test. When a branch is added in a prompt revision, treat it like a new code path: it gets a test before merge.

The high bar isn't "is your model correct." Probabilistic systems don't admit a binary answer to that. The low bar is far more useful: "has any case in the eval ever caused this branch to execute?" Zero coverage on a branch is a strict lower bound on confidence. You don't know it's wrong, but you know you've never looked.

Instrument the Planner, Not Just the Output

The mechanical step is small and almost always missing. Have the planner emit, alongside its action, a branch ID that identifies the rule it just applied. "BR-ESCALATE-BILLING." "BR-CLARIFY-AMBIGUOUS." "BR-REFUSE-IDENTITY." Bake the IDs into the prompt with a structured-output requirement — most modern model APIs will happily return a JSON field called decision_path next to the user-visible response.

Now your trace logs carry per-decision attribution. The eval runner can compute, after a run, the set of branch IDs each case touched. Aggregate across the suite, and you have a coverage matrix — which branches the eval exercises, and which it does not.

This is where the work gets interesting. The first time a team I advised generated this matrix, they discovered:

  • Three branches in the prompt had never fired in any of the 240 eval cases.
  • One branch fired in 192 of them — a sign of an over-eager fallback that was masking real failures upstream.
  • Two branches that the team thought were redundant were actually the only path that satisfied a particular user pattern.
Loading…
References:Let's stay in touch and Follow me for more thoughts and updates