The Eval Overcrowding Problem: Why Your Bigger Test Suite Is Catching Fewer Regressions
Your AI eval suite has 800 test cases. You add 200 more. Your model now scores 94% on evals and you ship with confidence. Three days later, a user finds a regression that none of your 1,000 tests caught.
This isn't bad luck — it's structural. The regression exists precisely because of how you grew your test suite, not despite it. The instinct to add more evals when something breaks is correct in theory and counterproductive in practice. More tests do not automatically mean better coverage of what matters. They mean better coverage of what's easy to test, which is a different thing entirely.
