The Agent Evaluation Readiness Checklist
Most teams building AI agents make the same mistake: they start with the evaluation infrastructure before they understand what failure looks like. They instrument dashboards, choose metrics, wire up graders — and then discover their evals are measuring the wrong things entirely. Six weeks in, they have a green scorecard and a broken agent.
The fix is not more tooling. It is a specific sequence of steps that grounds your evaluation in reality before you automate anything. Here is that sequence.
Before You Touch Evaluation Infrastructure
Read 20–50 real agent traces by hand. This is the single most important step in any evaluation program, and it is the one most commonly skipped.
Without manual trace review, you are building evals blind. You do not know whether the agent fails on reasoning, tool selection, parameter construction, or something subtler — like silently assuming a precondition that the task never guaranteed. You can't design useful graders for failure modes you haven't observed.
After reviewing traces, write down your success criteria in unambiguous language. "The agent should complete the task correctly" is not a success criterion. "The agent calls the correct API endpoint with valid parameters and the final state of the database matches the expected schema" is. The difference matters because your graders will need to operationalize these criteria exactly — and if the criteria are vague, the graders will be too.
Two other things to establish before writing any eval code:
- Separate capability evals from regression evals. Capability evals answer "can the agent do X at all?" Regression evals answer "does it still do X after this change?" Conflating them means your regression suite grows unboundedly and your capability gaps stay invisible.
- Assign single ownership. Evaluation quality degrades when it is owned by a committee. One domain expert with authority to reject bad evals is worth more than five engineers debating rubric design.
Finally, verify your infrastructure before blaming the model. Tool call failures, environment resets, and mock data mismatches routinely masquerade as reasoning failures. A surprising fraction of "the agent is dumb" bugs are actually "the API is returning unexpected errors."
The Three Evaluation Levels
Agent evaluation works at three levels of granularity. Most teams should start at the middle tier and expand from there.
Run-level (single-step): Did the agent pick the right tool? Did it construct a valid call? These are narrow, targeted checks — cheap to run and useful for debugging specific failure modes. But they miss the forest for the trees. An agent that makes every individual call correctly can still fail to complete the task.
Trace-level (full-turn): This is where most evaluation effort should live. A trace-level eval assesses three dimensions at once:
- Final response correctness — did the output satisfy the task?
- Trajectory reasonableness — was the sequence of steps sensible?
- State change verification — did the world actually change in the expected way?
That third dimension is critical and widely ignored. Agents frequently report successful outcomes — the email was sent, the file was created, the record was updated — without the outcome actually occurring. Verifying state changes means checking the actual system, not the agent's claim about it.
Thread-level (multi-turn): Evaluating behavior across a conversation, including error recovery and memory. Only invest here after trace-level evaluation is solid. One useful technique: N-1 testing, where you take a production conversation prefix with N turns and evaluate the agent's behavior on turn N given the prior context.
Building Your Dataset
The evaluation dataset is not a performance benchmark — it is a behavioral specification. Every example in it should have an unambiguous success criterion and a reference solution the grader can check against.
Quality beats quantity, consistently. Twenty hand-reviewed examples with tight success criteria outperform two hundred synthetic examples generated by prompting another model. Synthetic data is useful for rapid expansion once your eval framework is calibrated, but it should never substitute for grounded human judgment at the start.
Source your initial examples from three places:
- Dogfooding failures — real tasks your team ran that the agent got wrong
- Adapted external benchmarks — TerminalBench, BFCL, and similar task suites, filtered for tasks that match your deployment context
- Hand-written behavior tests — scenarios designed to probe specific capabilities and edge cases, including negative cases (behaviors the agent should not exhibit)
Tailor dataset composition to your agent type. Coding agents benefit from deterministic test suites where correctness is binary. Conversational agents need multi-dimensional rubrics that assess coherence, helpfulness, and factuality separately. Research agents need groundedness checks that verify claims against source material.
Grader Design
Four grader types cover most evaluation needs:
| Type | Best For |
|---|---|
| Code-based | Deterministic checks — API call structure, output schema, state verification |
| LLM-as-judge | Rubric-based quality assessment, open-ended responses |
| Human | Calibration, edge cases, grader validation |
| Pairwise | Comparing two versions of the agent side-by-side |
A few design principles that make graders reliable:
Binary pass/fail over numeric scales. A score of 3.7 out of 5 is not actionable, and numeric scales introduce calibration drift over time as your reference set changes. Binary judgments force clarity: what exactly must be true for this to pass? That question is uncomfortable to answer, but answering it improves both the eval and the agent.
Calibrate LLM judges before trusting them. Run your LLM grader against 20–100 human-labeled examples and measure agreement. Without calibration, you are trusting a black box to grade a black box. Disagreements often reveal that the grading rubric is ambiguous, not that the agent is failing.
Grade the outcome, not the exact path. Agents legitimately take different routes to the same result. An eval that requires a specific sequence of tool calls will fail valid solutions and erode trust in the evaluation system. Where possible, check final state rather than intermediate steps.
Decompose into specialized graders. A monolithic grader that evaluates correctness, tone, efficiency, and safety in one pass is unreliable and hard to debug. Build one grader per dimension and aggregate. This makes it easy to identify which dimension is driving failures.
Metrics That Actually Measure Reliability
The standard metrics for non-deterministic agents:
- Pass@k — at least one of k independent attempts succeeds. Optimistic; useful for capability assessment.
- Pass^k — all k independent attempts succeed. Pessimistic; useful for reliability assessment.
Single-run benchmarks are noisy by nature. Run multiple trials and report distributions, not point estimates. An agent that passes 80% of the time is fundamentally different from one that passes 100% of the time, but a single-run eval cannot distinguish them.
Track operational metrics alongside correctness: token usage, latency, number of turns, tool calls per task. Efficiency ratios — observed steps divided by ideal steps — reveal agents that succeed but waste significant resources doing so.
Error Analysis as the Core Practice
Spend 60–80% of your evaluation effort on error analysis, not on building more evals. The goal of evaluation is to improve the agent. The path from failing eval to improved agent runs through understanding why it failed.
A structured error analysis process:
- Collect a representative set of failing traces
- Review them with a domain expert, coding failure modes as you go — no predetermined categories
- Cluster into a taxonomy: prompt design issues, tool definition problems, model limitations, environment failures, data gaps
- Iterate until no new categories emerge
This taxonomy tells you what to fix. If 60% of failures are prompt design issues, the answer is better prompting, not a bigger model. If failures cluster around specific tool definitions, redesign those tools. The eval score is a lagging indicator; the error taxonomy is the leading signal.
One critical distinction: task failures versus evaluation failures. When an eval fails, first check whether the grader itself is wrong. LLM judges make mistakes. Rubrics have edge cases. Evaluation failures — where a valid solution is incorrectly penalized — are common and undermine trust in the entire evaluation system. Every failed trace deserves a brief manual check to confirm the grader's verdict.
Connecting Evals to Production
Evaluation is not a pre-deployment gate. It is a continuous practice that runs before deployment, after deployment, and in the background while the system is live.
Offline evals (pre-deployment) gate releases. Run them in CI against every material change to prompts, tools, or model versions.
Online evals (production traffic) catch regressions that offline evals miss — distribution shifts, edge cases that weren't in the dataset, and emergent failure modes that only appear at scale.
Ad-hoc evals (exploratory) investigate specific behaviors in response to user reports or anomalies flagged by monitoring.
The flywheel that separates mature agent systems from fragile ones:
- Production failures are captured as traces
- Traces are reviewed and added to the evaluation dataset
- New evals are run on the updated dataset
- Improvements are validated before deployment
- Successful capability evals are promoted into the regression suite
- The cycle repeats
This loop requires prompt and tool definitions to be version-controlled alongside code. If your prompts live in a spreadsheet or a wiki, you cannot reliably attribute evaluation changes to specific modifications. Treat prompts as software artifacts.
The Readiness Criteria
Before you declare an agent production-ready from an evaluation standpoint, verify:
- 20+ hand-reviewed traces informing your success criteria
- Unambiguous pass/fail criteria for every eval in your dataset
- Both positive and negative test cases
- Multiple trials per benchmark (not single-run estimates)
- LLM judges calibrated against human labels
- Error taxonomy built from real failures
- Offline evals integrated into CI with automated quality gates
- Online evaluation infrastructure ready for production traffic
- A process for feeding production failures back into the dataset
None of these items is technically difficult. The challenge is discipline — doing them in the right order, before the pressure to ship turns evaluation into a checkbox rather than a practice.
The agents that hold up in production are not the ones with the best benchmark scores. They are the ones built by teams who understood their failure modes before they wrote their first grader.
