The Eval Overcrowding Problem: Why Your Bigger Test Suite Is Catching Fewer Regressions
Your AI eval suite has 800 test cases. You add 200 more. Your model now scores 94% on evals and you ship with confidence. Three days later, a user finds a regression that none of your 1,000 tests caught.
This isn't bad luck — it's structural. The regression exists precisely because of how you grew your test suite, not despite it. The instinct to add more evals when something breaks is correct in theory and counterproductive in practice. More tests do not automatically mean better coverage of what matters. They mean better coverage of what's easy to test, which is a different thing entirely.
How Eval Suites Drift Away from Production Reality
The failure mode starts with incentives. When an AI feature ships and a bug surfaces, the immediate response is to add a test case covering that specific input. When a new capability is added, you write evals for the happy path and a handful of edge cases. Over six months, the suite grows to hundreds of cases — each added for a locally sensible reason, but collectively drifting away from the distribution of failures that actually matter in production.
Three forces drive this drift:
Automation gravity. Tests that can be scored automatically — exact-match comparisons, code execution results, schema validation — are cheap to run and maintain. Tests that require human judgment about quality, tone, or usefulness are expensive. Over time, teams unconsciously optimize for what they can automate. The eval suite fills up with cases that are easy to grade, not cases that are hard to get right.
Edge case accumulation. Engineers are trained to think adversarially. They find corner cases, boundary conditions, and failure modes, then write tests for them. These cases are real but rare. They stack up in the eval suite while the most common user journeys — more predictable, less exciting to test — receive fewer cases per failure mode.
Coverage theater. Eval suites are often reported as a count. A team with 2,000 evals sounds more rigorous than a team with 200. This creates pressure to grow the suite as a proxy for rigor, independent of whether the new cases are actually predictive of production failures.
The result is an eval suite that improves on every metric as you add cases, while its ability to predict whether a release will cause user-visible regressions stays flat or degrades.
The Expected Defect Detection Value of Each Test Case
Every eval item has an implicit expected defect detection value (EDDV) — roughly, the probability that running this test will catch a failure that would have reached production otherwise. Most teams never calculate this, but the structure becomes clear when you ask: of the last ten bugs your users reported, how many would have been caught by your current eval suite?
For most teams, the answer is two or three out of ten. The rest were failures in the gaps between tests — not because the tests were wrong, but because the test distribution did not match the production input distribution.
The relationship between suite size and EDDV follows a familiar curve. The first 50 well-designed test cases provide enormous signal. Each covers a genuinely distinct failure mode. The next 50 add less. By the time you're adding test cases in the hundreds, each new case often duplicates coverage of a failure mode already represented, adds coverage of an edge case that occurs rarely enough to not matter, or grabs an adversarial input that no real user would generate.
The aggregate effect is that total suite EDDV grows sublinearly while maintenance cost grows linearly. A 1,000-case suite is not ten times more valuable than a 100-case suite. It is probably two to three times more valuable, and five times more expensive to maintain, run, and interpret.
The saturation timeline for AI benchmarks illustrates this. MNIST and Switchboard took decades to saturate. GLUE saturated within two years of release. Benchmarks explicitly designed to resist saturation — built from difficult expert-level questions — have been saturating within 15 months as frontier models improve. Each benchmark follows the same arc: strong signal early, noise and gaming as the suite ages and becomes a target.
Goodhart's Law Is Not a Metaphor, It's a Mechanism
When your eval suite becomes the primary quality gate for shipping, it becomes a target. When a metric becomes a target, it ceases to be a good measure of the underlying quality you care about.
This plays out in four distinct ways in AI systems:
Score gaming without capability gain. Models can improve benchmark scores through prompt formatting, few-shot examples, temperature tuning, and ensemble voting — none of which change the underlying capability. A team that optimizes model configuration against its eval suite gets score improvements that don't translate to user satisfaction.
Data contamination. Evaluation datasets overlap with pretraining data at rates teams rarely measure. When a model has encountered eval-similar examples during training, its performance on those evals reflects memorization, not generalization. This is a spectrum, not a binary — even partial contamination systematically inflates benchmark scores relative to production performance.
Structural overfitting. When a model is fine-tuned on outputs that score well on your eval suite, it optimizes for your eval distribution, which is a subset of the actual user input distribution. The fine-tuned model performs better on evals and potentially worse on the long tail of real queries that evals don't cover.
Reviewer anchoring. Human evaluators who review model outputs calibrate over time against the outputs they've seen. After several months, they become better at recognizing outputs that look good to them, which correlates with outputs that score well on the existing suite. New failure modes that don't resemble past failures are rated more charitably than they should be.
The practical consequence: a 37% average gap between lab benchmark scores and real-world deployment performance is common enough to treat as a baseline expectation. Your eval suite improving is necessary but not sufficient evidence that production quality improved.
The Forced-Ranking Methodology for Pruning Eval Suites
The fix is not to stop adding tests. It's to treat your eval suite as a portfolio with limited capacity and force every case to earn its slot.
Start from production failures, not from coverage. When building or auditing an eval suite, the first question is not "what inputs should we test?" but "what has actually failed in production?" Real user reports, monitoring alerts, and A/B test degradations are a ground-truth signal about the failure modes that matter. An eval case derived from a real production failure has higher EDDV than a synthetic edge case by construction.
Validate the metric-outcome correlation before scaling. For each metric you use to score evals, verify that improvement on that metric correlates with improvement on production KPIs — user satisfaction, task completion, cost per inference. This correlation check is the step most teams skip. A metric that doesn't correlate with production outcomes cannot catch production regressions, regardless of how many cases use it. The calibration target is low false positive and false negative rates between eval scores and real production quality signals.
Force-rank by discriminative power, not by recency. Sort your eval cases by how often they discriminate between model versions — cases that every version passes are providing no signal. Sort by whether the case would have caught recent production failures. Cases that fail neither criterion should be retired. This is uncomfortable because it means deleting tests, which feels like reducing coverage. It isn't: it's trading coverage-theater for predictive power.
Cap suite size and enforce it. Set an explicit maximum — say, 200 cases — and require that adding a new case means retiring an old one. This forces the team to make explicit tradeoffs about which cases are most predictive, rather than treating the suite as an ever-growing append-only log. When suite size is unconstrained, the average EDDV of cases decreases monotonically. When it's capped, additions must justify themselves against the existing cases.
Distinguish saturation from passing. A test case a model always passes is not evidence that the model has mastered the capability. It may mean the case has saturated — it no longer discriminates between good and bad model versions. Read agent transcripts regularly. When a case stops being a site of meaningful variance, retire it or replace it with a harder version of the same capability test.
The Eval Suite Is a Living System, Not a Ledger
The deeper problem is that most teams treat eval suites as a ledger where test cases accumulate. Each test case added is a commitment. Removing a case feels like reducing rigor. This is backwards.
An eval suite that grows without pruning is a system with no negative feedback. The signal-to-noise ratio decreases over time. Cases that were once discriminative saturate. New cases are added faster than old ones are retired. Eventually, the suite requires more maintenance than a small team can provide, but the overhead of removing cases — justifying why safety is being reduced — prevents the necessary pruning.
The forcing function is this: if you cannot answer "which production failure would this test have caught?" for a given eval case, the case should be in a staging pool, not the active suite. It may be promoted if a production failure occurs that it would have detected. Until then, it is speculation about failure modes, not evidence about them.
Treating eval design as a continuous audit — not a build-and-forget asset — changes the economics. The active suite stays small, fast to run, and predictively calibrated. The staging pool holds experimental cases pending validation. Cases graduate from staging into the active suite when they prove discriminative power; they retire from the active suite when they saturate.
This is more work than appending cases. It's considerably less work than debugging a production regression that a 1,000-case suite confidently failed to predict.
- https://hai.stanford.edu/ai-index/2025-ai-index-report/technical-performance
- https://huggingface.co/blog/evaleval/eval-costs-bottleneck
- https://www.confident-ai.com/blog/the-ultimate-llm-evaluation-playbook
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://arxiv.org/html/2602.16763v1
- https://arxiv.org/html/2502.00678
- https://openai.com/index/measuring-goodharts-law/
- https://simmering.dev/blog/agent-benchmarks/
- https://www.ranger.net/post/ai-test-case-prioritization-cicd-pipelines
- https://www.mindstudio.ai/blog/benchmark-gaming-ai-inflated-scores-explained
