Skip to main content

Eval Set Rot: Why Your Score Trends Up While Users Trend Down

· 10 min read
Tian Pan
Software Engineer

The eval score has been trending up for two quarters. The dashboard is green, the regression suite has not flagged a real failure since March, and the team has gotten faster at shipping prompt changes because the eval gives crisp pass/fail answers. Meanwhile, user-reported quality is sliding. NPS is down four points, the support queue is full of failure modes nobody has labels for, and the head of product has started asking why the evals look great if customers are angry.

The eval set is not lying. It is answering the question it was built to answer, six months ago, against the traffic distribution that existed in launch week. The product has shifted. The user base has shifted. The long-tail use cases the team did not anticipate at launch now make up a third of traffic. The eval set is still measuring the world that existed in week one, and the team is averaging today's model against yesterday's product.

This is eval set rot. It is one of the quietest failure modes in modern AI engineering, and it gets worse as the eval set gets bigger, because the people maintaining it confuse "more cases" with "better coverage."

The eval set is a sample, and the sampling policy is invisible

The first thing engineers do when they build an eval set is treat it as a fixed object. A folder of JSON files, a test runner, a pass rate. The CI pipeline gates merges on the pass rate. The team adds new cases when bugs are reported. The eval set grows. Everyone feels good about it.

What nobody is tracking is the implicit sampling policy behind the set. An eval case got into the set because someone, at some point, decided it was important. The policy at launch was probably "the top use cases we expected from research interviews." A month later it became "use cases plus a handful of bugs we want to regress against." Six months later it became "everything anyone has ever hand-labeled, weighted by whichever team complained loudest."

That is not a sampling policy. That is an accumulation. The distinction matters because an eval score is only meaningful relative to a sample, and an unprincipled sample produces an unprincipled score. The team is reporting a number that has no defensible relationship to the live distribution, and they have lost the ability to answer the question "if our model passes the eval, will users be happier?"

A useful eval set has a written, versioned sampling policy. Something a new engineer can read in five minutes that says: this set is N cases, drawn from production traces over a defined window, stratified by the following intent categories with these target weights, with the following exclusion rules, refreshed on the following cadence. If you cannot write that paragraph for your current eval set, you do not have an eval set. You have a folder.

Two failure modes hide inside aggregate pass rate

The number that gets reported in standups is the aggregate pass rate. That number masks two distinct failure modes that demand different fixes.

The first is calibration drift. The cases in the set were a reasonable proxy for live traffic at launch, but live traffic has moved. The set still passes because the cases themselves have not changed and the model has not regressed against them, but the set is now answering a question the team no longer cares about. The fix is to resample the set against the current production distribution.

The second is easy-case accumulation. New cases were added over time, mostly from bug reports. Bug reports skew toward the cases that were obvious enough that someone could write a clear expected output. Hard cases — the open-ended generation, the multi-turn ambiguity, the ones where reasonable judges disagree — get added more slowly because they are harder to label. Over months, the relative weight of the easy cases creeps up. The pass rate goes up not because the model got better but because the eval got easier. The fix is a deletion discipline and per-cohort weighting, not more cases.

Most teams notice neither. They see a number trending up and assume the system is improving. The architectural realization is that the aggregate pass rate is a lossy summary of two separate dynamics, and a healthy eval program reports both.

The shadow set: a parallel instrument for distribution shift

The discipline that fixes calibration drift is to run two eval sets in parallel.

The gold set is the artifact the team trusts. It gates releases. It is curated, hand-labeled, and refreshed on a quarterly cadence. Its job is regression detection: did this model change break a behavior we have explicitly committed to keeping?

The shadow set is built fresh every cycle from a stratified sample of recent production traffic. It is not yet trusted enough to gate releases. Its job is distribution-shift detection: does this model change behave differently on cases that look like what users are actually sending today?

The interesting metric is not the pass rate of either set in isolation. It is the disagreement between them. When the gold and shadow sets agree, the gold set is still a faithful proxy for production. When they disagree by more than a threshold, the gold set has gone stale and is up for refresh. That disagreement signal is what tells the team when to invest in regenerating the gold set, instead of refreshing on a fixed calendar that is either too aggressive or not aggressive enough.

A discipline for promoting items from shadow to gold matters. The shadow set will surface noise. A spike of cases representing a one-week marketing campaign is not a stable new pattern. The promotion rule should require that a shadow case category has been present, at meaningful weight, across multiple sampling cycles before it earns a place in the gold set. Otherwise the gold set inherits whatever was loud last month, and the disagreement signal collapses.

Measure drift directly, not by waiting for complaints

Most teams notice eval rot through the customer-complaint channel: support tickets pile up, an executive asks why, the team goes hunting. By that point the product has already paid for the rot. There are direct measurements that catch it earlier.

The simplest is a coverage metric: how well does the eval set's distribution match the production distribution today? A few practical implementations:

  • Intent-category KL divergence: classify production traces and eval cases into the same intent taxonomy, then compute the KL divergence between the two distributions. When it crosses a threshold, the eval set has drifted.
  • Embedding-space coverage: embed both production queries and eval cases, then measure how much of the production embedding cloud is within a small radius of any eval case. Falling coverage is the signal.
  • Per-cohort pass-rate vs. traffic share: track each user cohort's pass rate weighted by its share of live traffic. When a small but high-value cohort grows in traffic share but is barely represented in the eval set, the aggregate pass rate becomes meaningless for that cohort even when it looks fine in aggregate.

None of these metrics are perfect. KL divergence on token frequencies is famously blind to subtle semantic shifts. Embedding coverage is sensitive to the embedding model's biases. Per-cohort weighting depends on a cohort taxonomy that itself drifts. The point is not to find one canonical metric. The point is to watch something that points at the gap between the eval set and live traffic, instead of waiting for the gap to manifest as user pain.

Stratification, deletion, and the policy you can defend

The discipline that has to land in a team that has been letting the eval set drift looks something like this:

  • Stratified resampling. Each refresh draws fresh cases from production traces, stratified across feature surfaces, user cohorts, and intent categories. The strata weights are written down and reviewed. They are not the empirical traffic share — high-value cohorts often deserve over-representation, and tail use cases need a floor or they vanish from the set entirely.
  • A deletion discipline. Cases that no longer represent any production cohort get retired. This is the part teams resist most because it feels like throwing away work, but a 2,000-case set where 600 cases describe a feature that was deprecated last year is dragging the score around for no reason. A useful rule: if no production trace has matched a case's pattern in two refresh cycles, the case is a candidate for retirement.
  • Per-cohort slices. The aggregate pass rate is reported alongside slices for each meaningful cohort. The dashboard shows divergence between cohorts immediately, and a regression in a small high-value segment cannot hide behind a healthy aggregate.
  • Weighting that is explicit, not emergent. When a new case is added, the team decides whether it goes into a new cohort, expands an existing one, or replaces an older case. Otherwise the relative weight of every cohort drifts every time someone runs git add.

Industry reporting puts the rate of degradation at significant levels — one 2025 LLMOps survey noted that models left unchanged for six months saw error rates jump roughly 35% on new data, even before accounting for changes in the upstream model APIs. The number is less interesting than the shape: degradation is gradual, invisible in aggregate, and compounds with the size of the gap between the eval policy and the live distribution.

The architectural realization

An eval set is a measurement instrument. Like any instrument, it has a calibration drift problem. A team that does not have a process for re-calibrating its instrument is not measuring what it thinks it is measuring, and the longer the instrument runs uncalibrated, the more confidently wrong the readings become.

This reframes most of the practical decisions:

  • "How many cases should our eval set have?" is the wrong question. The right question is "what sampling policy does our set implement, and does that policy still match the distribution we care about?"
  • "Should we add this bug report to the eval set?" becomes "does this case represent a stable cohort, or is it a one-off that will distort the policy?"
  • "Why is the eval green when users are unhappy?" becomes "what is the disagreement between our gold set and a fresh shadow set, and which direction is the gap?"

The teams that get this right treat the eval set the way infrastructure teams treat their monitoring: a piece of production-grade software with an SLO, a release process, and a deprecation policy. The teams that get it wrong treat the eval set like a folder of test fixtures, accumulate cases until the score is dominated by whatever was easy to label, and discover six months later that the green dashboard was the most expensive lie in the building.

The forward move is small and concrete. Pick a refresh cadence — quarterly is a defensible default, monthly if your traffic is changing fast. Build a shadow set. Track the disagreement. When the disagreement crosses your threshold, refresh the gold set, and write down what changed and why. The discipline is not glamorous. It is the difference between an eval program that earns trust as the product grows and one that quietly stops measuring anything at all.

References:Let's stay in touch and Follow me for more thoughts and updates