Choosing Eval Metrics Is a Product Decision, Not a Technical One
A team building an LLM-based literature screening tool celebrated 96% accuracy on their test set. Their model was, by any standard engineering metric, performing excellently. There was one problem: it found zero true positives. It had learned to classify everything as irrelevant and still scored near-perfect accuracy, because relevant papers were rare in the dataset. The failure wasn't in the model — it was in the metric.
This failure mode is not exotic. It plays out silently across AI teams every week, in codebases where engineers select evaluation metrics the way they'd select a sorting algorithm: as a technical choice with a right answer. The framing is wrong. Metric selection is a product decision. It encodes which failure modes you're willing to tolerate, which users you're optimizing for, and what "good" actually means for your specific context. Getting this wrong produces eval suites that look rigorous and measure the wrong thing.
Metrics Encode Tolerance for Failure, Not Just Performance
Every eval metric is a bet on what matters. BLEU rewards n-gram overlap with a reference translation. LLM-as-judge rewards outputs that a language model scores highly. Human preference rewards outputs that annotators prefer. Task completion rate rewards agents that finish tasks without intervention. Each of these bets reflects a choice about what failures are acceptable.
Consider the tradeoffs:
- BLEU and ROUGE penalize legitimate paraphrases and rewarded lexical similarity even when meaning diverges. If your product surfaces hallucinated but fluently written text, BLEU will not catch it. The metric was designed for machine translation — it measures surface form, not correctness or user value.
- LLM-as-judge achieves Pearson correlation coefficients up to 0.85 with human judgments on certain tasks. But it systematically favors verbose, formally structured outputs regardless of whether they're correct. It also has position bias — responses presented earlier or later in a comparison receive systematically different scores, depending on the model used as judge.
- Human preference scoring captures what users actually want but introduces its own distortions. Annotators optimize for what reads well, not what is accurate. One study comparing automated metrics to human evaluators found machines rated logically structured outputs highly while humans described the same outputs as "cold, robotic, lacking authenticity."
- Task completion rate measures whether agents finish tasks autonomously. Research shows 68% of production agents execute only 10 or fewer steps before requiring human intervention. But "task completion" aggregated across a test suite hides where agents fail — systematic failures on specific input types vanish into averages.
None of these are objectively better. Each is appropriate for specific failure modes and actively misleading for others. Choosing which failure modes to measure — and which to ignore — is not a technical judgment. It's a product judgment.
Why Engineering-Driven Selection Systematically Misfires
When engineers own metric selection without product input, the selection criteria default to what's measurable, automatable, and optimizable. These criteria are orthogonal to user value.
The canonical failure pattern is Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. This is not theoretical in AI. Consider:
- Model labs began treating leaderboard rankings as targets once they became industry reputation signals. Within months, labs were selectively showcasing only their strongest model variants, cherry-picking benchmark subsets, and reporting scores that reflected gaming rather than capability. One frontier model reported 50% on a major benchmark; independent testing found 29.4%.
- Google's YouTube recommendation system optimized for "hours watched" as a proxy for user satisfaction. The metric was easy to measure, correlated with engagement, and completely trackable. It also drove recommendations toward conspiracy content, because longer videos kept users watching regardless of quality.
- The literature screening team hit 96% accuracy while missing every relevant paper. Accuracy was the metric their engineers knew how to optimize. Recall — the metric that mattered for the use case — wasn't on the dashboard.
In each case, engineers selected metrics that were technically tractable. In each case, the metric diverged from what the product actually needed. The problem isn't that engineers are bad at their jobs. It's that metric selection requires knowing what failure modes are catastrophic, which users are most affected, and what "good enough" means for a specific context. That knowledge lives in product, not engineering.
The Failure Mode Map
Different failure modes require different metrics. Before selecting any metric, teams should be explicit about which failure modes they're measuring — and which they're deliberately accepting.
Hallucination and factual errors are not captured by fluency or coherence metrics. A perfectly fluent hallucinated answer scores identically to a correct one on ROUGE. TruthfulQA, dedicated hallucination benchmarks, and RAG-based factuality checks exist for this. If your product is in a domain where errors are costly — legal, medical, financial — ignoring hallucination metrics is a conscious decision to accept those failures unchecked.
Demographic performance disparities are invisible in aggregate accuracy. A model with 92% accuracy overall may have 78% accuracy for one demographic group and 96% for another. If your product serves diverse users, aggregate accuracy is not a safety metric — it's an averaging function that hides discrimination.
Agentic failure modes don't aggregate cleanly. Research on multi-step agent failures identifies four recurring patterns: premature action without grounding, over-helpfulness that substitutes missing entities, vulnerability to distractor-induced context pollution, and fragile execution under load. A single task completion rate score collapses all of these into one number. Teams that care about why agents fail need per-failure-type instrumentation.
Domain expert agreement exposes the limits of automated evaluation. LLM-as-judge agreement with subject matter experts ranges 60–68% in specialized domains like dietetics and mental health. If your product is knowledge-intensive, automated metrics will over-report quality. The acceptable gap between automated scores and expert judgment is a product decision: how much misalignment can ship without creating liability, eroding trust, or harming users?
None of these answers are written in the eval framework documentation. They require stakeholder input.
Co-Designing Metrics Before Writing Eval Examples
The practical alternative to engineering-driven selection is a structured co-design process that runs before eval examples are written. The sequence matters: metric selection must precede example collection, because examples are always implicitly optimized for whatever you're measuring.
- https://www.evidentlyai.com/llm-guide/llm-as-a-judge
- https://arxiv.org/html/2412.05579v2
- https://arxiv.org/abs/2002.08512
- https://arxiv.org/html/2504.12328v1
- https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
- https://arxiv.org/html/2511.12635
- https://productschool.com/blog/artificial-intelligence/ai-evals-product-managers/
- https://www.technologyreview.com/2026/03/31/1134833/ai-benchmarks-are-broken-heres-what-we-need-instead/
- https://blog.collinear.ai/p/gaming-the-system-goodharts-law-exemplified-in-ai-leaderboard-controversy
- https://www.statsig.com/perspectives/product-metrics-feature-flags-ai
