Evaluating AI Agents: Why Grading Outcomes Alone Will Lie to You
An agent you built scores 82% on final-output evaluations. You ship it. Two weeks later, your support queue fills up with users complaining that the agent is retrieving the wrong data, calling APIs with wrong parameters, and producing confident-sounding responses built on faulty intermediate work. You go back and look at the traces — and realize the agent was routing incorrectly on 40% of queries the whole time. The final-output eval never caught it because, often enough, the agent stumbled into a correct answer anyway.
This is the core trap in agent evaluation: measuring only what comes out the other end tells you nothing about how the agent got there, and "getting there" is where most failures live.
Why Agent Evaluation Is Structurally Different
Evaluating a standard LLM call is relatively straightforward. You give it input, it produces output, you score the output. Non-determinism is a nuisance but not a fundamental obstacle.
Agents break this model in multiple ways simultaneously. An agent executing a research task might take 12 steps: retrieving context, deciding what to search next, synthesizing partial results, calling external APIs, and looping back when tools return unexpected results. Each step is a decision point. A wrong turn at step 3 can cascade silently through steps 4 through 12 and still produce an output that looks reasonable.
Traditional software testing assumes deterministic behavior. Traditional ML evaluation assumes fixed input-output pairs. Agentic systems break both assumptions at once. The agent's behavior changes based on what tools return, what it retrieved, and which paths it decided to take — none of which you specified in advance.
This creates two distinct evaluation problems. First, you need to evaluate the final outcome: did the agent accomplish the task? Second, you need to evaluate the trajectory: did it accomplish the task the right way, using the right tools, at an acceptable cost, without violating any policies? A pure outcome metric passes agents that got lucky. A pure trajectory metric penalizes valid alternative approaches. You need both, weighted differently depending on what's at stake.
The Grader Spectrum: Matching Tools to the Question
Not all agent behaviors can be evaluated the same way, and trying to use one grader type for everything leads to evaluations that are either too rigid or too inconsistent to trust.
Code-based graders are the bedrock. They evaluate binary conditions: did the agent call the correct tool? Did it pass the required parameters? Did the final state of the database reflect the operation? Did the output contain the required fields? Code-based graders are fast, cheap, deterministic, and debuggable. If the grader says it failed, you know exactly why. Their weakness is brittleness — they penalize valid alternative approaches and struggle with anything requiring interpretation.
LLM-as-judge graders cover the territory code can't reach: coherence, empathy in customer interactions, whether a research summary actually addresses the original question, whether a generated plan is logically sound. They're flexible and capture nuance, but they introduce non-determinism of their own. An LLM judge can disagree with itself across runs, and it requires careful calibration against human expert judgment before you can trust its scores. A structured rubric with explicit criteria per dimension outperforms open-ended "is this good?" prompts by a significant margin.
Human graders remain the gold standard for anything high-stakes or genuinely ambiguous. They're expensive and slow, but they're the only way to calibrate the other two. Without periodic human review of agent traces, you can't know whether your automated graders are actually measuring what you think they're measuring. The rule of thumb from practitioners who've built these systems at scale: never take eval scores at face value until someone has actually dug through a sample of the transcripts.
A practical evaluation system layers all three. Use code-based graders as the first pass — fast, cheap, and catches the obvious failures. Use LLM-as-judge for subjective quality dimensions. Bring in human review periodically to calibrate and catch systematic blind spots. The layers compensate for each other's weaknesses in the same way that overlapping security controls do.
Grading the Math, Not Just the Essay
The most important shift in agent evaluation thinking is moving from outcome-only grading to trajectory grading. Think of it this way: when you evaluate a student's test, you want to see their work, not just their final answer. An answer of "42" might be correct or it might be a lucky guess; the work shows you which.
A transcript (also called a trace) is the complete record of what the agent did: every tool call, every intermediate result, every reasoning step, every decision point. Evaluating trajectories means asking: did the agent call the right tool at the right time? Did it handle errors gracefully when a tool returned unexpected results? Did it use the minimum number of steps, or did it waste tokens on redundant retrievals? Did it stay within policy boundaries throughout?
For different agent architectures, trajectory evaluation looks different. A routing agent needs RouteAccuracy graders — did it send the query to the right downstream handler? A prompt-chaining pipeline needs StepByStepAccuracy — which stage in the chain failed? An orchestrator-worker system needs SubtaskCoverage graders — did the orchestrator decompose the task fully, and did each worker's output get properly synthesized?
Rigid trajectory grading has its own failure mode: penalizing the agent for taking a valid alternative path to the correct outcome. Rules-based trajectory evaluation consistently underestimates success when it rejects valid trajectories that differ from the golden trajectory in the test set. The fix is to grade outcomes at the trajectory level — verify that key checkpoints were hit and that the final state is correct — rather than requiring exact step-by-step reproduction of a reference trace.
Handling Non-Determinism: pass@k vs pass^k
When you run the same agent task twice, you may get two different results. This isn't a bug — it's a fundamental property of probabilistic systems. But it makes evaluation harder than in deterministic software.
Two metrics handle this differently and serve different purposes.
pass@k measures the probability that at least one of k attempts succeeds. If you run the agent three times and it succeeds on any attempt, it passes. This is appropriate for capability evaluation — you're asking "can this agent do this at all?" A pass@3 rate of 90% at a per-trial success rate of 75% tells you the capability exists.
pass^k measures the probability that all k attempts succeed. This is the reliability metric, and it's far more demanding. That same 75% per-trial rate yields only 42% pass^3. If your agent is running in production where it gets one shot per user interaction, pass^k is what matters. An agent that fails 25% of the time is not acceptable for high-stakes workflows, even if it "usually" works.
Most evaluation infrastructure defaults to single-trial runs. This produces optimistic numbers that don't reflect production reliability. For anything where consistency matters — automated workflows, high-stakes decisions, user-facing interactions — run multiple trials per task and report both metrics explicitly.
