Skip to main content

The Eval That Quietly Went Stale: When Your Test Suite Measures a World That No Longer Exists

· 9 min read
Tian Pan
Software Engineer

Your eval suite passed. All 240 cases green, same as last week. You ship. Two days later support tickets spike, and when you read the transcripts you find a failure mode your suite has no opinion about at all — not a case that flipped from pass to fail, but a question your users started asking that your suite never thought to ask.

This is the quiet failure of evals. We treat a green run as a statement about the present: "the system works." It is actually a statement about a past — the moment the eval cases were written. An eval authored six months ago encodes three things as they were that day: the product's scope, the model's failure modes, and the way real users phrase their requests. All three move. The feature grew a new surface. The model got upgraded twice. The input distribution drifted as users learned what the product could do. The suite did not move with them, so a green run increasingly certifies a world that no longer exists.

Nobody notices, because nothing breaks. A stale eval does not throw an error. It keeps passing, confidently, while measuring less and less of what matters.

An eval is a snapshot, and snapshots age

It helps to be precise about what an eval case actually is. A case bundles an input, an expected behavior or a grading rubric, and an implicit claim: this input is representative of something users do, and this expected behavior is what "correct" means for it. Both halves of that claim have a shelf life.

The input ages because user behavior is not stationary. The queries a system handles in month three rarely match the distribution from launch week. Users discover capabilities, abandon ones that disappoint them, and adopt new phrasings as your product's vocabulary leaks into how they think. Industry terms shift, slang shifts, and the things people consider worth asking an AI shift faster than anything. An eval case frozen at launch keeps testing the launch-week user, who has since stopped existing.

The expected behavior ages too, and more dangerously, because it ages silently. When you upgrade the model, a rubric that once discriminated "acceptable" from "excellent" can collapse — the new model clears the old bar so easily that the case no longer separates anything. When the product changes, the "correct" answer changes with it: a case that expects the agent to refuse a request becomes wrong the day that request becomes a supported feature. The case still runs. It still reports a result. It is just grading against a spec that has been quietly deprecated.

So a green eval suite is not one claim. It is hundreds of small claims about a world, each made on a different day, none of them automatically revisited. The suite's overall freshness is only as good as its oldest unexamined case.

Three ways a suite rots

Staleness is not one phenomenon. It arrives through at least three distinct channels, and they call for different responses.

Scope drift. The product surface area grew, but the eval suite did not. You added a new tool, a new integration, an entire new workflow — and the suite still covers only the original three. Coverage as a percentage of real functionality silently fell from 90% to 55%. Every case still passes. The suite is not wrong; it is just small relative to what it now needs to defend. This is the most common rot and the easiest to miss, because the existing cases give no signal at all that they have stopped being enough.

Distribution drift. The product surface is stable, but the inputs are not. Your cases were sampled from launch-week traffic, and launch-week traffic is gone. The new edge cases — the ambiguous phrasings, the multi-intent requests, the questions that assume features that did not exist six months ago — are absent from the suite because they were absent from the world when the suite was built. The suite tests a representative sample of a population that has moved.

Spec drift. The most insidious one. The product behavior changed on purpose, but nobody propagated the change into the eval rubrics. Now the suite encodes the old contract. Here a passing case is the bug: it is enforcing behavior the team deliberately retired. And a failing case might be correct — the system did the new right thing, and the eval is punishing it for not doing the old right thing. The signal has inverted, and the suite still looks like it is working.

The reason these matter individually is that they look identical on the dashboard — green, green, green — but each one is fixed differently. Scope drift needs new cases. Distribution drift needs resampled cases. Spec drift needs rewritten cases, and until they are rewritten the suite is actively lying.

A red run is ambiguous: regression or stale case?

Most teams have a story for when the suite goes red: something broke, find it, fix it. That story is only half right, and the missing half is where stale suites do their damage.

When a case flips from pass to fail, there are two possibilities, and they are opposites. Either the system regressed — it used to do this correctly and now it does not — or the eval went stale — the system's new behavior is actually correct, and the case is still grading against the old expectation. One demands a code fix. The other demands an eval fix. Treat them the same and you will either revert a real improvement to make the suite happy, or "fix the eval" to silence a genuine regression. Both are common, and both feel like progress in the moment.

Telling them apart is a judgment call, but it is a structured one. Ask, in order: Did the product spec change in a way that touches this case? If yes, suspect the eval, not the system. Did the model change? If yes and only borderline cases moved, suspect rubric collapse. Is the new output something you would happily show a user? If yes, the eval is probably stale; if it makes you wince, it is probably a regression. The output of this triage is not just a fix — it is a verdict that should be recorded on the case, because a case that has been ruled "stale" once is telling you something about its whole cohort.

This is why a red run is healthier than a green one. A failing case at least forces the question. A passing stale case never gets triaged at all, which is exactly how it survives.

Treat freshness as a property of the suite

The fix is not heroics. It is to stop treating the eval suite as a fixed asset and start treating it as code with a half-life — something that decays on a schedule and must be maintained on one.

Start by dating every case. Each case gets a created date and a last_validated date — the last time a human looked at it and confirmed the input is still representative and the expected behavior still matches the current spec. Validation is not running the case; an automated run tells you the case passes, not that the case still means anything. Once cases are dated, "freshness" becomes a real metric you can put on the dashboard next to pass rate: median case age, percentage of cases validated in the last quarter, oldest unvalidated case. A suite that is 95% green and has a median case age of eleven months is not a healthy suite. It is a confident one.

Then set an explicit expiry policy. Pick a revalidation interval appropriate to how fast your product moves — many fast-moving AI products land somewhere around a quarter. When a case crosses that line it does not get deleted; it gets flagged for human revalidation. A reviewer either renews last_validated, rewrites the case to match current reality, or retires it because the behavior it guarded no longer exists. The point is that nothing in the suite gets to keep voting indefinitely on the strength of a decision made long ago.

Feed the suite from production, continuously. The antidote to distribution drift is to keep sampling. Mine production traces — especially the failures, the escalations, the low-confidence turns — and promote them into eval cases. A pipeline where production failures become tomorrow's regression tests is the only reliable defense against a suite that slowly stops resembling reality. It also means scope drift self-corrects: when a new feature ships and starts generating traffic, that traffic starts generating candidate cases.

Finally, gate model upgrades on a freshness check, not just a pass rate. Before adopting a new model, do not only ask "does the suite still pass?" Ask "does the suite still discriminate?" Run the upgrade and look at how many cases were borderline before and trivial after. A wave of cases that the new model aces effortlessly is not a victory lap — it is a notice that those cases have stopped measuring anything, and the suite needs harder ones.

Run all four and the suite has a metabolism: cases enter from production, get validated on a clock, get rewritten when the spec moves, and get retired when they stop discriminating. Cases age, but the suite does not.

Stop trusting green

The instinct to trust a green eval run is the same instinct that makes stale evals dangerous. A green run is not evidence that the system is correct. It is evidence that the system still satisfies a set of claims about the world, each of which was true on the day it was written.

The discipline is to keep asking when that day was. Date your cases. Put case age on the dashboard next to pass rate. Triage every red run as regression or stale? before touching code. Rotate cases on an expiry, refill from production, and re-check discrimination on every model upgrade. None of this is exotic. It is the same maintenance hygiene you already apply to dependencies and infrastructure — applied, finally, to the thing you have been trusting to tell you whether any of it works.

An eval suite that is never revalidated does not protect you. It just makes the moment you discover it stopped protecting you arrive later, and more expensively. The worst eval failure is not the suite that goes red. It is the one that stays green while quietly measuring nothing at all.

References:Let's stay in touch and Follow me for more thoughts and updates