Skip to main content

Snapshot Eval Decay: When Green CI Stops Meaning Your Product Still Works

· 11 min read
Tian Pan
Software Engineer

Six months of green CI is hiding the fact that roughly forty percent of your eval set no longer represents what users actually do with your product. The suite still runs. The judge still scores. The dashboards still glow. But the cases were written against a query distribution, a corpus, a tool surface, and a regulatory text that have all moved underneath them — and a green run now means "yesterday's product still works on yesterday's reality," which is not the question you are paying CI to answer.

This is snapshot eval decay, and it is the slowest, most expensive failure mode in AI evaluation. Slow because the suite never fails — staleness shows up as inability to discriminate between models, not as red builds. Expensive because by the time someone notices that a model swap which the evals approved caused a production regression, the team has already accumulated a year of "we ship when evals pass" muscle memory built on top of an asset that quietly stopped working.

Static test suites have a familiar failure mode in conventional software: the suite passes, but the product breaks in a way nobody tested for. Snapshot eval decay is worse because the cases themselves haven't gone bad — they still test something correctly. It's the underlying world that moved. The cases ask, "given a customer support email from Q2, does the model file the right ticket?" The model still files the right ticket. But customers stopped writing emails like that. They started pasting Slack threads. They started uploading screenshots. They started using slang the eval set has never seen. The signal the suite is sending — "we are safe to ship" — was true once. It is now an artifact of who curated which examples on which week.

The Four Layers That Drift

It helps to think about eval decay as four overlapping distributions that all shift on different timescales, and the eval set anchors to whichever one was easiest to capture on the day it was written.

The first is the query distribution — what users actually ask. This moves fastest, because it tracks marketing campaigns, viral content, seasonal patterns, and new user cohorts. A coding assistant that launched serving senior backend engineers will, within a quarter, be serving juniors who phrase questions differently and bootcamp graduates who paste in error messages without context. The eval set, frozen at launch, never sees that.

The second is the corpus — the documents the system retrieves over. Knowledge bases get edited. Policy text gets revised. Product catalogs grow. The eval case that asked "what's our return policy?" was graded against the policy text as it stood when the case was written. The policy has been edited eleven times since. The case still passes, because the judge prompt still expects the old language. The model is wrong in production.

The third is the upstream tool outputs — the data shape coming back from APIs the agent calls. Backend teams add fields, deprecate enums, change pagination, switch from synchronous responses to job IDs that resolve later. The eval case mocks the old shape. The agent passes the eval. The agent breaks in production because the new shape includes a nullable field the prompt never accounts for.

The fourth is the rule set — regulatory text, compliance language, brand guidelines, internal policies the model is supposed to follow. This moves on a slow cadence but with sharp edges: a GDPR amendment, a new disclosure requirement, an updated style guide. The eval set encodes the old rules. The judge enforces the old rules. The model, even if perfectly compliant against the eval, ships output that legal flags in week one.

Most teams build their eval set against one of these — usually the query distribution at launch — and let the others drift uninspected. The set becomes increasingly an evaluation of an old product against an old context, dressed up in the visual language of confidence.

How Decay Hides Inside a Green Suite

The trap is that none of this looks like failure. Failure in software has a visual identity — a red mark, a stack trace, a flag in a Slack channel. Eval decay looks like success. The cases still execute. The judge still emits scores. The aggregate number on the dashboard still trends slightly upward over time, because the team keeps adding cases for new features and the new cases are easy.

The discriminative power is what is silently disappearing. A useful eval set must distinguish "model A is better than model B for this product" from "model A is worse." That requires cases hard enough to actually elicit different behavior between candidate models. When the world moves, two things happen at once. New failure modes appear in production that the eval set has zero coverage for — the model gets worse in ways CI cannot see. And the old cases, often, become trivially easy: the underlying model has improved on the kind of input that was hard two years ago, and now every candidate aces them. The score stays high. The information content collapses.

You can detect this if you measure for it. The diagnostic is whether your eval set still produces meaningful variance across the model candidates you are choosing between. If you swap the production model for last year's frontier model on the eval set and the score drops by less than a percent, the suite has lost the ability to tell good from bad. The signal is gone, and the green build is no longer evidence of anything.

The harder diagnostic is whether your eval set still produces meaningful variance against your own production logs. Sample a few thousand recent production interactions, replay them through the eval pipeline, and compare the score distribution to your curated set. If production is significantly harder, your curated set is calibrating against a phantom of the product. If production is significantly easier, the eval set has drifted into testing edge cases nobody actually hits.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates