Skip to main content

Evaluating AI Agents: Why Grading Outcomes Alone Will Lie to You

· 10 min read
Tian Pan
Software Engineer

An agent you built scores 82% on final-output evaluations. You ship it. Two weeks later, your support queue fills up with users complaining that the agent is retrieving the wrong data, calling APIs with wrong parameters, and producing confident-sounding responses built on faulty intermediate work. You go back and look at the traces — and realize the agent was routing incorrectly on 40% of queries the whole time. The final-output eval never caught it because, often enough, the agent stumbled into a correct answer anyway.

This is the core trap in agent evaluation: measuring only what comes out the other end tells you nothing about how the agent got there, and "getting there" is where most failures live.

Why Agent Evaluation Is Structurally Different

Evaluating a standard LLM call is relatively straightforward. You give it input, it produces output, you score the output. Non-determinism is a nuisance but not a fundamental obstacle.

Agents break this model in multiple ways simultaneously. An agent executing a research task might take 12 steps: retrieving context, deciding what to search next, synthesizing partial results, calling external APIs, and looping back when tools return unexpected results. Each step is a decision point. A wrong turn at step 3 can cascade silently through steps 4 through 12 and still produce an output that looks reasonable.

Traditional software testing assumes deterministic behavior. Traditional ML evaluation assumes fixed input-output pairs. Agentic systems break both assumptions at once. The agent's behavior changes based on what tools return, what it retrieved, and which paths it decided to take — none of which you specified in advance.

This creates two distinct evaluation problems. First, you need to evaluate the final outcome: did the agent accomplish the task? Second, you need to evaluate the trajectory: did it accomplish the task the right way, using the right tools, at an acceptable cost, without violating any policies? A pure outcome metric passes agents that got lucky. A pure trajectory metric penalizes valid alternative approaches. You need both, weighted differently depending on what's at stake.

The Grader Spectrum: Matching Tools to the Question

Not all agent behaviors can be evaluated the same way, and trying to use one grader type for everything leads to evaluations that are either too rigid or too inconsistent to trust.

Code-based graders are the bedrock. They evaluate binary conditions: did the agent call the correct tool? Did it pass the required parameters? Did the final state of the database reflect the operation? Did the output contain the required fields? Code-based graders are fast, cheap, deterministic, and debuggable. If the grader says it failed, you know exactly why. Their weakness is brittleness — they penalize valid alternative approaches and struggle with anything requiring interpretation.

LLM-as-judge graders cover the territory code can't reach: coherence, empathy in customer interactions, whether a research summary actually addresses the original question, whether a generated plan is logically sound. They're flexible and capture nuance, but they introduce non-determinism of their own. An LLM judge can disagree with itself across runs, and it requires careful calibration against human expert judgment before you can trust its scores. A structured rubric with explicit criteria per dimension outperforms open-ended "is this good?" prompts by a significant margin.

Human graders remain the gold standard for anything high-stakes or genuinely ambiguous. They're expensive and slow, but they're the only way to calibrate the other two. Without periodic human review of agent traces, you can't know whether your automated graders are actually measuring what you think they're measuring. The rule of thumb from practitioners who've built these systems at scale: never take eval scores at face value until someone has actually dug through a sample of the transcripts.

A practical evaluation system layers all three. Use code-based graders as the first pass — fast, cheap, and catches the obvious failures. Use LLM-as-judge for subjective quality dimensions. Bring in human review periodically to calibrate and catch systematic blind spots. The layers compensate for each other's weaknesses in the same way that overlapping security controls do.

Grading the Math, Not Just the Essay

The most important shift in agent evaluation thinking is moving from outcome-only grading to trajectory grading. Think of it this way: when you evaluate a student's test, you want to see their work, not just their final answer. An answer of "42" might be correct or it might be a lucky guess; the work shows you which.

A transcript (also called a trace) is the complete record of what the agent did: every tool call, every intermediate result, every reasoning step, every decision point. Evaluating trajectories means asking: did the agent call the right tool at the right time? Did it handle errors gracefully when a tool returned unexpected results? Did it use the minimum number of steps, or did it waste tokens on redundant retrievals? Did it stay within policy boundaries throughout?

For different agent architectures, trajectory evaluation looks different. A routing agent needs RouteAccuracy graders — did it send the query to the right downstream handler? A prompt-chaining pipeline needs StepByStepAccuracy — which stage in the chain failed? An orchestrator-worker system needs SubtaskCoverage graders — did the orchestrator decompose the task fully, and did each worker's output get properly synthesized?

Rigid trajectory grading has its own failure mode: penalizing the agent for taking a valid alternative path to the correct outcome. Rules-based trajectory evaluation consistently underestimates success when it rejects valid trajectories that differ from the golden trajectory in the test set. The fix is to grade outcomes at the trajectory level — verify that key checkpoints were hit and that the final state is correct — rather than requiring exact step-by-step reproduction of a reference trace.

Handling Non-Determinism: pass@k vs pass^k

When you run the same agent task twice, you may get two different results. This isn't a bug — it's a fundamental property of probabilistic systems. But it makes evaluation harder than in deterministic software.

Two metrics handle this differently and serve different purposes.

pass@k measures the probability that at least one of k attempts succeeds. If you run the agent three times and it succeeds on any attempt, it passes. This is appropriate for capability evaluation — you're asking "can this agent do this at all?" A pass@3 rate of 90% at a per-trial success rate of 75% tells you the capability exists.

pass^k measures the probability that all k attempts succeed. This is the reliability metric, and it's far more demanding. That same 75% per-trial rate yields only 42% pass^3. If your agent is running in production where it gets one shot per user interaction, pass^k is what matters. An agent that fails 25% of the time is not acceptable for high-stakes workflows, even if it "usually" works.

Most evaluation infrastructure defaults to single-trial runs. This produces optimistic numbers that don't reflect production reliability. For anything where consistency matters — automated workflows, high-stakes decisions, user-facing interactions — run multiple trials per task and report both metrics explicitly.

Building an Evaluation System That Scales

The evaluation system itself needs to be designed like production infrastructure, not an afterthought.

Start with 20-50 tasks drawn from real failures. Don't design evals from first principles based on what you think will fail. Look at what has actually failed — in production logs, in user feedback, in manual testing — and build tasks around those scenarios. Early evals built on real failures have higher signal than comprehensive coverage of hypothetical edge cases.

Write unambiguous task specifications. A task specification is bad if two domain experts, working independently, would reach different verdicts on whether the agent passed. Include reference solutions that prove the task is solvable. If you can't solve it yourself, you can't reliably grade whether the agent solved it.

Separate your offline and online evaluation strategies. Offline evaluations — run against curated datasets during development — catch regressions before deployment and let you compare agent versions systematically. Online evaluations — scoring real user interactions asynchronously in production — reveal actual user behavior and the distribution of real-world inputs, which never perfectly matches your test set. Both are necessary. Roughly 52% of teams run offline evals; only 37% monitor production with online evals. The gap is where silent regressions live.

Grade outputs and outcomes, not paths. Specify what correct looks like, not exactly how to get there. An evaluation that fails the agent for retrieving a document in a different order than the golden trace is measuring the wrong thing. An evaluation that checks whether the final database state reflects the correct transaction is measuring the right thing.

Monitor for saturation. When an eval suite hits 100% pass rate, it stops providing signal. You're measuring what the agent has already solved, not where it struggles. Refresh the suite or graduate it to a regression suite and build new capability evals targeting the next performance level.

The Pitfalls That Kill Evaluation Programs

Several failure modes appear consistently in evaluation systems that practitioners built, ran for a while, and eventually stopped trusting.

Ambiguous task specifications are the most common. When the spec is underspecified, the agent fails tasks it should have passed — the problem isn't the agent, it's the test. Teams lose trust in the eval suite and stop using it.

Unintended bypasses happen when agents find ways to satisfy the grader without actually solving the problem. An agent that learns to produce outputs that pattern-match to "correct" without doing the underlying work is worthless in production. Graders need to check actual outcomes, not surface features of the output.

Over-specification penalizes valid alternatives. If your grader checks for a specific tool call sequence and the agent solves the problem correctly via a different sequence, you get a false failure. This is especially common when teams port unit-testing habits into agent evaluation — the impulse to assert exact behavior makes sense for deterministic code and actively misleads in probabilistic systems.

Grading bugs are embarrassingly common and systematically distort results. A grader that fails "96.12" when the correct answer is "96.124991" is producing noise, not signal. Floating-point comparisons, whitespace sensitivity, and ordering assumptions are frequent offenders. Read the failing transcripts. If the failures don't look fair, the grader is the problem.

Where Evaluation Programs Succeed or Fail Organizationally

Technical soundness isn't enough. The evaluation programs that actually improve agents over time share a few organizational properties.

Ownership is explicit. Someone is responsible for the eval suite — maintaining tasks, calibrating graders, adding coverage for new capabilities, removing saturated evals. Eval suites without owners decay. Tasks go stale. Graders drift out of calibration. The suite keeps running and producing numbers that no longer mean what they used to.

Domain experts contribute tasks. The engineers who understand what the agent is supposed to do in context write better evaluation tasks than infrastructure engineers who understand evaluation tooling. The best programs treat eval contribution as a responsibility of the product and domain teams, with infrastructure teams providing the scaffolding.

Teams with strong evaluation infrastructure adopt model upgrades significantly faster — days instead of weeks — because they can verify that a new model doesn't regress existing behavior. The eval suite is the mechanism by which research capability translates into production confidence. Without it, every change is a guess.

Evaluation is not a phase you complete before shipping. It's the ongoing feedback mechanism that tells you whether the agent you deployed three months ago is still behaving the way it did when you deployed it. Real user behavior, new edge cases, and model drift all change agent performance in ways that no amount of pre-deployment testing can fully anticipate. The agents that keep working are the ones with owners who are paying attention.

References:Let's stay in touch and Follow me for more thoughts and updates