Skip to main content

Eval Coverage as a Production Metric: Is Your Test Suite Actually Testing What Users Do?

· 9 min read
Tian Pan
Software Engineer

Most AI teams treat a passing eval suite as a signal that their system is working. It isn't—not by itself. A suite that reliably scores 87% is doing exactly one thing: telling you the system performs well on the 87% of cases your suite happens to cover. If that suite was hand-curated six months ago, built from the examples the team thought of, and never updated against live traffic, it's measuring the wrong thing with increasing confidence.

This is the eval coverage problem. It's not about whether your evaluators are accurate—it's about whether the distribution of queries in your test set matches the distribution of queries your users are actually sending. When those two distributions diverge, you get a result that's far worse than a failing eval: a passing eval sitting on top of a silently degrading product.

Why Coverage Is a Distribution Problem, Not a Count Problem

The naive interpretation of eval coverage is headcount: do you have enough test cases? But coverage measured by count is almost meaningless. You could have ten thousand eval cases and still have critical blind spots if those cases cluster in the same semantic neighborhood.

The real question is: does your eval set's query distribution approximate your production query distribution?

Consider an AI customer support agent. When you built the eval suite at launch, users were mostly asking about account setup and basic navigation. Eight months later, they're asking about billing discrepancies, integration errors, and edge cases in a new product tier you shipped in Q3. Your eval set wasn't updated. The model handles the new query types adequately, but not well—and you have no eval signal for those queries because they don't exist in your suite. The aggregate score looks fine because the majority of eval cases (the old ones) still pass.

This is the coverage gap: the difference between where your eval weight is concentrated and where your actual user traffic lives.

The Cluster Analysis Approach to Measuring Coverage Gaps

The most direct way to measure coverage gaps is to embed and cluster both your production queries and your eval queries, then compare the resulting distributions.

The mechanics are straightforward:

  1. Sample production traffic over a meaningful window—typically 30 days, stratified to avoid over-representing weekend vs. weekday patterns or promotional spikes that don't reflect steady-state behavior.
  2. Embed both sets using the same embedding model. Consistency matters here; mixing embedding models introduces noise that obscures real distributional differences.
  3. Cluster the production queries using k-means or HDBSCAN (the latter handles density variation better for query sets that have a long tail of unusual inputs). Choose a granularity that's meaningful—typically 20–80 clusters for a mid-size production system.
  4. Project your eval queries onto those clusters and measure what fraction of each cluster your eval set covers, weighted by how much traffic each cluster receives.

What you get out of this is a coverage heatmap. Some clusters will be densely covered—you have many eval cases from that semantic region. Others will be sparsely covered or completely empty. Cross-referencing cluster size (how much production traffic it represents) against eval density (how many eval cases you have in that region) tells you exactly where your blind spots are.

A cluster that represents 15% of your production traffic and contains zero eval cases is not a gap—it's a liability.

Jensen-Shannon Divergence as a Coverage Score

Cluster heatmaps are useful for diagnosis, but you need a scalar metric to track over time and integrate into dashboards. Jensen-Shannon divergence (JSD) between the production distribution and the eval distribution works well here.

JSD measures how much two probability distributions differ. Applied to eval coverage, you compute the probability mass of each cluster under the production distribution and under the eval distribution, then compute JSD between those two vectors. The result is a number between 0 and 1:

  • JSD near 0: Your eval distribution closely approximates your production distribution. Coverage is good.
  • JSD near 0.3–0.5: Meaningful divergence. Your evals are weighted toward a different region of query space than your users are.
  • JSD above 0.5: Severe mismatch. Your eval suite is largely measuring behavior on query types that don't represent your actual traffic.

Track this metric over time alongside your aggregate eval score. A rising JSD score is a leading indicator of a coverage gap widening—you'll catch it before users start complaining. A stable JSD score with a dropping aggregate eval score means your coverage is fine but something else changed: the model, the prompt, or the world.

The key insight is that JSD gives you an answer to the question that most eval dashboards never surface: is our eval suite still asking the right questions?

Traffic-Stratified Sampling for Eval Refresh

Knowing you have a coverage gap is half the work. Closing it requires building a principled process for refreshing your eval set from production traffic.

Random sampling from production logs is a start, but it tends to over-represent your high-frequency query types—which are probably already well-covered—and under-represent the tail where your blind spots actually are. Instead, use cluster-stratified sampling:

  • For clusters with low eval coverage: sample aggressively, even if the cluster is small. A cluster with 3% of traffic and 0% eval coverage needs representation more urgently than a cluster with 40% of traffic and 35% eval coverage.
  • For clusters with good coverage: sample lightly to detect drift, but don't over-invest.
  • For very small clusters (under 0.5% of traffic): consider grouping them into an "edge case" pool and sampling from that pool as a unit, rather than treating each micro-cluster as its own category.

Once sampled, these production queries need labels. This is where human review is unavoidable for the initial round—automated judges can score responses, but only a human can decide whether an uncovered query type should be added to the permanent eval set and what the correct or acceptable response looks like. After the first annotation pass, you can often automate subsequent labeling for similar queries using few-shot examples from the initial batch.

A practical cadence for most teams: run the JSD coverage score weekly, perform cluster analysis monthly, and do a full eval refresh quarterly or whenever JSD exceeds your threshold (a reasonable starting threshold is 0.25).

The Compounding Failure Case

The coverage gap is especially dangerous for multi-step agents and long-context systems because failure modes compound across turns.

If your eval suite covers the first-turn query distribution reasonably well but doesn't reflect the multi-turn sequences real users pursue, you're missing an entire dimension of failure. A 20-step workflow where each step has a 95% success rate succeeds end-to-end only 36% of the time. Evaluating each step in isolation will show a healthy 95%—but users are experiencing 36% success on their actual goals.

The distribution problem here isn't just which queries appear but which sequences appear. Production traffic has a joint distribution over query sequences, and your eval suite needs to reflect that structure, not just the marginal distribution of individual queries.

For agentic systems, extend the cluster analysis to include session-level features: number of turns, tool call patterns, context length at each step, and whether the session ended in success or abandonment. Clustering on these richer representations surfaces coverage gaps that single-turn query analysis will miss entirely.

What a Coverage-Aware Eval Pipeline Looks Like

A mature eval pipeline treats coverage as a first-class metric, not an afterthought. In practice, this means:

  • Coverage score in the eval dashboard: JSD between production and eval distributions, tracked alongside accuracy, latency, and cost. When JSD rises above threshold, it triggers a refresh workflow.
  • Cluster coverage annotations: for each cluster, know its production weight, its current eval density, and when the most recent eval cases in that cluster were added. Old cases in rapidly-evolving clusters degrade in validity.
  • Production-to-eval feedback pipeline: a lightweight process that samples production traffic on a schedule, runs it through the cluster classifier, routes it to the appropriate annotation workflow, and merges approved cases into the eval set.
  • Eval case provenance: track whether each eval case was hand-authored, generated synthetically, or sampled from production, and when. This lets you audit the composition of your suite and understand why coverage gaps exist.

None of this is exotic infrastructure. Embedding production queries, clustering, and computing JSD are operations that take minutes to run and can be automated entirely. The main investment is cultural: you have to treat the eval set as a living artifact that needs maintenance, not a static fixture that gets checked in once and forgotten.

The Green Dashboard That Lies

A passing eval suite on a stale eval set is worse than no eval suite in some ways—it creates false confidence that suppresses the alertness that would otherwise prompt investigation. Teams that track only aggregate eval scores can ship regressions for weeks before user complaints or support ticket volume signals something wrong.

The fix is to instrument coverage explicitly. Your eval score should always be qualified by your coverage score. "87% passing" means something very different when JSD is 0.05 versus when JSD is 0.45. One is a meaningful signal about system quality. The other is a meaningful signal about a test suite that has drifted away from what your users are actually doing.

Treat eval coverage as a production metric. Track it. Alert on it. Refresh your eval sets when it degrades. The investment is small; the alternative—shipping quiet regressions that a better-calibrated eval suite would have caught—is not.

References:Let's stay in touch and Follow me for more thoughts and updates