The Evaluation Gap Nobody Wants to Talk About
Here’s an uncomfortable question I keep asking in meetings and never getting a satisfying answer: How do you know your AI agent is actually good?
Not “is it running?” — we can monitor uptime. Not “is it fast?” — we can track latency. I mean: is it doing a good job? Is it making decisions that help users? Is it reliable enough to trust with real business processes? And how do you measure improvement over time?
After spending the last year building evaluation frameworks for AI agents at my company, I’ve concluded that the evaluation gap is the single biggest blocker to enterprise adoption. Companies want to deploy agents. They have the technology. They have the use cases. What they don’t have is a rigorous way to answer: “Should we trust this system?”
Why Traditional ML Metrics Don’t Work
In traditional ML, evaluation is well-understood:
- Classification: accuracy, precision, recall, F1
- Regression: RMSE, MAE, R-squared
- Ranking: NDCG, MAP
- Generation: BLEU, ROUGE (flawed but standardized)
For AI agents, none of these map cleanly. An agent isn’t classifying inputs or generating translations. It’s:
- Interpreting an ambiguous goal (understanding what the user actually wants)
- Planning a sequence of actions (choosing the right tools in the right order)
- Executing those actions (interacting with real systems successfully)
- Synthesizing results (presenting useful output to the user)
- Handling failures gracefully (recovering when things go wrong)
Each of these stages needs its own evaluation criteria, and they interact in complex ways. An agent that plans perfectly but executes poorly is useless. An agent that executes well but interprets the goal wrong is dangerous.
The Five Dimensions of Agent Quality
Based on our work, I’ve landed on five evaluation dimensions. No single metric captures agent quality — you need all five:
1. Task Completion Rate (TCR)
- Definition: % of user requests where the agent fully achieves the stated goal
- Challenge: “fully achieves” is subjective. We use a 5-point rubric scored by both automated evaluation (LLM-as-judge) and human reviewers
- Benchmark: Our production agents score 72% TCR with automated eval, 64% with human eval. That 8-point gap is itself a metric worth tracking.
2. Factual Grounding Score (FGS)
- Definition: % of factual claims in agent output that are supported by retrieved evidence
- Challenge: Requires claim decomposition (breaking output into individual factual claims) and evidence matching (finding supporting documents for each claim)
- Our approach: We use a separate “verifier” model to decompose claims and check grounding. It adds latency but catches hallucinations before they reach users in high-stakes workflows.
3. Efficiency Ratio (ER)
- Definition: Ratio of actual steps taken to optimal steps for the same task
- Challenge: You don’t always know the optimal path. We estimate it by having human experts complete the same tasks and counting their steps.
- Why it matters: An agent that takes 15 steps to do what a human does in 3 is burning money and creating failure opportunities at every step.
4. Safety Score (SS)
- Definition: % of agent runs that avoid harmful, inappropriate, or policy-violating actions
- Challenge: You can’t just measure this on successful runs. You need adversarial testing — deliberately trying to make the agent misbehave.
- Our approach: We maintain a “red team” dataset of adversarial prompts, and we run every agent change through it before deployment.
5. User Satisfaction Proxy (USP)
- Definition: Composite of explicit feedback (thumbs up/down), implicit signals (task abandonment, follow-up questions, escalation to human), and downstream outcomes
- Challenge: Users don’t always provide feedback. We rely heavily on implicit signals, which are noisy.
- Key insight: The strongest predictor of user satisfaction isn’t task completion — it’s whether the user had to ask the agent to clarify or retry. First-attempt resolution rate is the metric that correlates most strongly with satisfaction in our data.
The Evaluation Infrastructure Problem
Measuring these five dimensions requires significant infrastructure:
- Annotation pipeline: Human reviewers scoring agent interactions (expensive but necessary for calibration)
- Automated evaluation models: LLM-as-judge systems running on every agent interaction (scalable but biased)
- A/B testing framework: Statistically rigorous comparison of agent versions (requires proper experiment design and sufficient traffic)
- Longitudinal tracking: Performance trends over time, across model updates, prompt changes, and data refreshes
Most teams I’ve talked to have invested in maybe one of these. Building all four is a multi-quarter effort that competes with feature development for resources.
The Standard We Need
What’s missing from the industry is a standardized agent evaluation framework — something analogous to the HELM benchmark for language models, but designed for agent-specific capabilities.
It would need to:
- Define standard evaluation tasks across common agent use cases (customer support, data analysis, workflow automation)
- Specify metrics and measurement methodology for each dimension
- Provide reference implementations for automated evaluation
- Establish baselines that enterprises can use for vendor comparison
Until we have this, every enterprise is building bespoke evaluation frameworks from scratch, reinventing the wheel poorly, and making adoption decisions based on vibes rather than evidence.
Is anyone else working on agent evaluation frameworks? What metrics are you using? I’d love to compare notes.