Eval Coverage as a Production Metric: Is Your Test Suite Actually Testing What Users Do?
Most AI teams treat a passing eval suite as a signal that their system is working. It isn't—not by itself. A suite that reliably scores 87% is doing exactly one thing: telling you the system performs well on the 87% of cases your suite happens to cover. If that suite was hand-curated six months ago, built from the examples the team thought of, and never updated against live traffic, it's measuring the wrong thing with increasing confidence.
This is the eval coverage problem. It's not about whether your evaluators are accurate—it's about whether the distribution of queries in your test set matches the distribution of queries your users are actually sending. When those two distributions diverge, you get a result that's far worse than a failing eval: a passing eval sitting on top of a silently degrading product.
