Skip to main content

The Happy Path Is the Only Path Your Agent Eval Ever Tested

· 10 min read
Tian Pan
Software Engineer

Look at where most agent eval sets come from. Someone builds the agent, demos it to the team, the demo works, and the demo script becomes the eval suite. The cases that pass review are the cases someone already watched pass. The eval set is, almost by construction, a recording of the happy path — the one tool sequence that worked the day the screenshot was taken.

So when the dashboard says the agent scores 94%, what it actually says is: it passes the cases we imagined. It says nothing about the case where the search API returns a 429 in the middle of a multi-step plan, where the user contradicts a constraint they stated two turns ago, or where retrieval comes back empty and the agent has to decide between guessing and admitting it doesn't know. Those cases aren't failing your eval. They were never in it.

This is golden-path bias, and it is the default shape of an agent eval suite unless you fight it deliberately. The fix is not more cases. It is different cases — chosen by failure mode, harvested from production, and stress-tested with deliberate faults.

Why eval sets drift toward the demo script

The bias isn't laziness. It's the path of least resistance built into how evals get written.

When you sit down to invent test cases, you write what you can picture. You picture the user asking a clear question, the agent calling the right tool, the tool returning clean data, and the agent composing a correct answer. That's a coherent story, so it's easy to write down. The case where the user asks something half-formed, gets a partial answer, reformulates, and the agent has to track the correction across three turns — that's harder to picture, so it doesn't get written.

Production data has the same skew, but for a different reason. Historical transcripts are biased toward the happy path because users who had a terrible experience often just leave. They don't file a ticket. They don't rate the conversation. They abandon it and the trace sits in your logs looking like any other short session. If you sample transcripts uniformly, you oversample the interactions that went fine, because the ones that went badly are quietly underrepresented by survivorship.

And once an eval suite exists, it ossifies. It becomes the regression gate. Every change has to pass it, so nobody wants to add hard cases that would drop the green number. The suite that was supposed to measure quality becomes a thing you protect. New failure modes discovered in production get patched in the prompt and never make it back into the eval, because adding them would make the score look worse.

The result is a suite that exercises one tool sequence under one set of conditions, run thousands of times, reported as coverage.

"It passes" means "it passes the cases we imagined"

The honest translation of a passing eval is narrow. A 94% pass rate on a golden-path suite tells you the agent is good at the situations someone already thought to write down. It is silent on everything else.

This matters because real users do not stay on the happy path. They digress. They change topic mid-conversation. They contradict themselves. They paste in malformed input, ask three questions at once, and abandon the original goal halfway through. Berkeley's MAST taxonomy of multi-agent failures (Cemri et al., 2025) catalogued 14 distinct system-level failure modes — task derailment, information withholding between steps, premature termination — and a striking number of them are invisible at the level of a single clean request. They only appear when something goes sideways.

The gap between "passes the eval" and "works in production" is exactly the set of situations the eval never represented. You can close that gap, but not by running the happy-path suite more times. Volume on the wrong distribution is just a more confident wrong answer. A suite of 500 golden-path cases is not more trustworthy than 50 — it is the same blind spot, measured to more decimal places.

The first move is to stop counting cases and start counting failure modes.

Coverage means failure modes, not case count

Reframe the question. Instead of "how many test cases do we have," ask "which ways can this agent fail, and which of those does the suite actually exercise."

Make the list explicit. For a typical tool-using agent it includes things like:

  • Tool errors — a tool returns a 429, a 500, a timeout, or a malformed payload instead of clean data.
  • Empty results — retrieval or search comes back with nothing, and the agent must choose between hedging, asking a clarifying question, or hallucinating.
  • Contradiction — the user states a constraint, then states an incompatible one a few turns later.
  • Ambiguity — the request genuinely supports multiple readings and the right move is to ask, not to guess.
  • Out-of-scope requests — the user asks for something the agent should decline rather than attempt.
  • Multi-step derailment — step three fails and the agent has to replan instead of barreling ahead with a stale plan.
  • Long-context drift — an instruction given early gets ignored once the conversation is long.
Loading…
References:Let's stay in touch and Follow me for more thoughts and updates