How Do You Evaluate an AI Agent? The Metrics Problem That's Holding Back Enterprise Adoption

The Evaluation Gap Nobody Wants to Talk About

Here’s an uncomfortable question I keep asking in meetings and never getting a satisfying answer: How do you know your AI agent is actually good?

Not “is it running?” — we can monitor uptime. Not “is it fast?” — we can track latency. I mean: is it doing a good job? Is it making decisions that help users? Is it reliable enough to trust with real business processes? And how do you measure improvement over time?

After spending the last year building evaluation frameworks for AI agents at my company, I’ve concluded that the evaluation gap is the single biggest blocker to enterprise adoption. Companies want to deploy agents. They have the technology. They have the use cases. What they don’t have is a rigorous way to answer: “Should we trust this system?”

Why Traditional ML Metrics Don’t Work

In traditional ML, evaluation is well-understood:

  • Classification: accuracy, precision, recall, F1
  • Regression: RMSE, MAE, R-squared
  • Ranking: NDCG, MAP
  • Generation: BLEU, ROUGE (flawed but standardized)

For AI agents, none of these map cleanly. An agent isn’t classifying inputs or generating translations. It’s:

  1. Interpreting an ambiguous goal (understanding what the user actually wants)
  2. Planning a sequence of actions (choosing the right tools in the right order)
  3. Executing those actions (interacting with real systems successfully)
  4. Synthesizing results (presenting useful output to the user)
  5. Handling failures gracefully (recovering when things go wrong)

Each of these stages needs its own evaluation criteria, and they interact in complex ways. An agent that plans perfectly but executes poorly is useless. An agent that executes well but interprets the goal wrong is dangerous.

The Five Dimensions of Agent Quality

Based on our work, I’ve landed on five evaluation dimensions. No single metric captures agent quality — you need all five:

1. Task Completion Rate (TCR)

  • Definition: % of user requests where the agent fully achieves the stated goal
  • Challenge: “fully achieves” is subjective. We use a 5-point rubric scored by both automated evaluation (LLM-as-judge) and human reviewers
  • Benchmark: Our production agents score 72% TCR with automated eval, 64% with human eval. That 8-point gap is itself a metric worth tracking.

2. Factual Grounding Score (FGS)

  • Definition: % of factual claims in agent output that are supported by retrieved evidence
  • Challenge: Requires claim decomposition (breaking output into individual factual claims) and evidence matching (finding supporting documents for each claim)
  • Our approach: We use a separate “verifier” model to decompose claims and check grounding. It adds latency but catches hallucinations before they reach users in high-stakes workflows.

3. Efficiency Ratio (ER)

  • Definition: Ratio of actual steps taken to optimal steps for the same task
  • Challenge: You don’t always know the optimal path. We estimate it by having human experts complete the same tasks and counting their steps.
  • Why it matters: An agent that takes 15 steps to do what a human does in 3 is burning money and creating failure opportunities at every step.

4. Safety Score (SS)

  • Definition: % of agent runs that avoid harmful, inappropriate, or policy-violating actions
  • Challenge: You can’t just measure this on successful runs. You need adversarial testing — deliberately trying to make the agent misbehave.
  • Our approach: We maintain a “red team” dataset of adversarial prompts, and we run every agent change through it before deployment.

5. User Satisfaction Proxy (USP)

  • Definition: Composite of explicit feedback (thumbs up/down), implicit signals (task abandonment, follow-up questions, escalation to human), and downstream outcomes
  • Challenge: Users don’t always provide feedback. We rely heavily on implicit signals, which are noisy.
  • Key insight: The strongest predictor of user satisfaction isn’t task completion — it’s whether the user had to ask the agent to clarify or retry. First-attempt resolution rate is the metric that correlates most strongly with satisfaction in our data.

The Evaluation Infrastructure Problem

Measuring these five dimensions requires significant infrastructure:

  • Annotation pipeline: Human reviewers scoring agent interactions (expensive but necessary for calibration)
  • Automated evaluation models: LLM-as-judge systems running on every agent interaction (scalable but biased)
  • A/B testing framework: Statistically rigorous comparison of agent versions (requires proper experiment design and sufficient traffic)
  • Longitudinal tracking: Performance trends over time, across model updates, prompt changes, and data refreshes

Most teams I’ve talked to have invested in maybe one of these. Building all four is a multi-quarter effort that competes with feature development for resources.

The Standard We Need

What’s missing from the industry is a standardized agent evaluation framework — something analogous to the HELM benchmark for language models, but designed for agent-specific capabilities.

It would need to:

  • Define standard evaluation tasks across common agent use cases (customer support, data analysis, workflow automation)
  • Specify metrics and measurement methodology for each dimension
  • Provide reference implementations for automated evaluation
  • Establish baselines that enterprises can use for vendor comparison

Until we have this, every enterprise is building bespoke evaluation frameworks from scratch, reinventing the wheel poorly, and making adoption decisions based on vibes rather than evidence.

Is anyone else working on agent evaluation frameworks? What metrics are you using? I’d love to compare notes.

Rachel, this framework is excellent. But I want to add a dimension that’s missing from a product/business perspective: business outcome metrics.

Technical Metrics vs. Business Metrics

Your five dimensions measure agent quality from a technical standpoint. But when I’m in a board meeting justifying our agent investment, the questions aren’t about grounding scores or efficiency ratios. They’re about:

  • Revenue impact: Did the agent-assisted sales process close more deals?
  • Cost reduction: Did the agent reduce support ticket volume by the amount we projected?
  • Time savings: Did the agent reduce the time-to-resolution for customer issues?
  • Retention impact: Did customers who interacted with the agent have higher NPS scores?
  • Operational leverage: Can we handle 3x the volume without 3x the headcount?

These are the metrics that determine whether the agent program gets more investment or gets killed.

The Attribution Problem

The challenge is attribution. When a customer support agent helps resolve a ticket, did the resolution happen because of the agent, despite the agent, or would it have happened anyway with a human agent?

We’ve been running controlled experiments:

  1. Agent-assisted group: Users interact with the AI agent first, with human fallback
  2. Human-only group: Users go directly to human support
  3. Comparison metrics: Resolution time, resolution rate, customer satisfaction, cost per ticket

Early results from our experiment (N=2,400 tickets over 6 weeks):

Metric Agent-Assisted Human-Only Delta
Median resolution time 4.2 min 18.7 min -77%
First-contact resolution 61% 74% -13%
Customer satisfaction 3.8/5 4.2/5 -9.5%
Cost per ticket $2.40 $11.80 -80%
Escalation rate 39% 0% +39%

The numbers tell a mixed story. The agent is dramatically cheaper and faster, but it resolves fewer issues on first contact, has lower customer satisfaction, and 39% of interactions still need human escalation.

The business question isn’t “is the agent good?” It’s “is this tradeoff acceptable?” And the answer depends on the use case, the customer segment, and the company’s strategy.

What I’d Add to Your Framework

A sixth dimension: Business Value Score (BVS)

  • Definition: Net business impact of agent deployment, measured as the delta between agent-assisted and baseline outcomes on key business metrics
  • Measurement: Requires controlled experiments (A/B tests at the business level, not just the model level)
  • Key insight: BVS can be positive even when technical metrics are mediocre. A 60% TCR agent that handles 80% of volume at 20% of the cost might be a better business outcome than a 90% TCR human team.

The evaluation conversation needs to include business stakeholders, not just data scientists and engineers. Otherwise we optimize for metrics that don’t matter to the people writing the checks.

Rachel, I want to push on the technical benchmarking side because I think there’s a practical approach that most teams can implement without building the full evaluation infrastructure you described.

The Pragmatic Technical Benchmark

Your five-dimension framework is comprehensive, but most engineering teams I work with need something they can implement in a week, not a quarter. Here’s the minimal viable evaluation stack I recommend:

Level 1: Automated regression tests (Day 1)

Before you build any evaluation infrastructure, build a regression test suite:

  • Curate 50-100 representative agent interactions (real production data, anonymized)
  • For each interaction, define the expected outcome (not the exact output — the expected behavior)
  • Run every agent change through this suite before deployment
  • Track pass/fail rate over time

This catches the obvious regressions. It doesn’t measure quality, but it prevents things from getting worse. Think of it as unit tests for agent behavior.

Level 2: Automated scoring with LLM-as-judge (Week 1)

Set up an automated scoring pipeline using a separate LLM to evaluate agent outputs:

The key decision: which model evaluates which model? We use a different model family for evaluation than for the agent itself. If your agent runs on Claude, evaluate with GPT-4. If your agent runs on GPT-4, evaluate with Claude. This reduces systematic bias (both models have blind spots, but different ones).

Level 3: Cost-normalized performance (Week 2)

Rachel’s Efficiency Ratio is the right idea, but I’d frame it differently for engineering teams: cost per successful outcome.

This single metric captures both quality and efficiency. An agent that completes tasks well but burns 10x the tokens is visible in this metric. An agent that’s cheap but fails half the time is also visible.

Track this metric per agent, per use case, per week. It’s the single best proxy for “is this agent getting better or worse?”

Level 3.5: Latent failure detection

One thing missing from your framework: detecting failures that nobody reported. We run a daily job that:

  1. Samples 5% of agent interactions from the previous day
  2. Runs them through the LLM-as-judge evaluation
  3. Flags any interaction scored below 3/5 that wasn’t already flagged by the user
  4. Routes flagged interactions to a human reviewer

In our experience, this catches 2-3x more failures than user-reported feedback alone. Most users don’t bother reporting bad agent behavior — they just leave.

Where I Agree With Rachel

The industry absolutely needs a standardized benchmark. But I’d push for something more practical than an academic framework:

  • Open-source evaluation harness with pluggable scoring models
  • Standard task sets per domain (support, sales, coding, data analysis)
  • Leaderboard for agent frameworks and models on standard tasks
  • Shared annotation guidelines so different teams’ human evaluations are comparable

The closest thing we have today is probably the Berkeley Function-Calling Leaderboard and SWE-bench for coding agents. We need equivalents for every major agent use case.

Rachel and David, both your frameworks are valuable, but I want to add the team adoption dimension that determines whether any of these metrics actually get used.

The Metrics Nobody Tracks: Team Adoption

In my experience leading teams that deploy agents, the biggest evaluation gap isn’t technical metrics or business metrics — it’s whether the team actually trusts and uses the evaluation system.

Here’s what I’ve seen happen at three different companies:

Company A: Built a comprehensive 8-dimension evaluation framework. It was so complex that nobody ran it except the ML team. Product managers didn’t understand the scores. Engineers found it too slow to run in CI. After 3 months, the evaluation system was abandoned and teams went back to “try it and see if it feels right.”

Company B: Implemented a simple thumbs up/down feedback button. Got 4% response rate. Of those responses, 80% were from power users who didn’t represent the average user. The data was biased and the team knew it, but kept reporting it because it was the only metric they had.

Company C: Required human review of every agent interaction before launch, then switched to sampling. The human reviewers developed “evaluation fatigue” — their scoring became less consistent over time, and inter-rater reliability dropped from 0.85 to 0.62 within two months.

What Actually Works: The Adoption-First Evaluation Stack

Based on these experiences, here’s what I recommend for teams getting started:

1. Pick ONE metric that everyone understands.

Not five dimensions. Not a composite score. One metric that product, engineering, and leadership all understand and care about. For most teams, that’s either:

  • First-attempt resolution rate (Rachel’s insight — I agree this is the best single metric for most use cases)
  • Cost per resolved interaction (David’s framing — most compelling for leadership)

You can add more dimensions later. But start with one that drives alignment across the organization.

2. Build evaluation into the developer workflow, not alongside it.

If evaluation requires running a separate tool, going to a separate dashboard, or waiting for a separate pipeline, developers won’t do it. Embed evaluation into the existing workflow:

  • Agent changes should automatically trigger evaluation in the CI pipeline
  • Results should appear in the pull request as comments
  • Regressions should block deployment (with a clear override path for false positives)
  • Dashboards should be in the same tools teams already use (Grafana, Datadog, etc.)

3. Invest in evaluator calibration.

If you’re using human evaluation (and you should, at least for calibrating automated systems), invest in evaluator training and calibration:

  • Written rubric with examples for each score level
  • Monthly calibration sessions where evaluators score the same interactions and discuss disagreements
  • Track inter-rater reliability as a metric of your evaluation system, not just of your agents
  • Rotate evaluators to prevent fatigue and bias

4. Set realistic improvement targets.

I’ve seen teams set goals like “achieve 95% task completion rate by Q2” when their current rate is 64%. That’s not a goal — it’s a fantasy. Set targets based on:

  • Current performance as baseline
  • Incremental improvement (5-10% per quarter is aggressive but achievable)
  • Comparison to human performance (not perfection — actual human performance, which is often lower than people assume)

The best evaluation framework in the world is useless if the team doesn’t use it, doesn’t trust it, or doesn’t act on the results. Start simple, build trust, and add complexity when the team is ready for it.