Skip to main content

The Eval-Set-as-Simulator Drift: When Offline Scores Improve and Production Gets Worse

· 11 min read
Tian Pan
Software Engineer

The most expensive failure mode in an LLM product is not a bad release. It is six consecutive good releases — by every internal scoreboard — while user trust quietly bleeds out. The offline eval score climbs every Friday demo. The CSAT line in the weekly business review goes flat, then dips, then nobody knows when it started dipping because nobody was triangulating the two charts. By the time a postmortem names it, the team has spent two quarters tuning a prompt against a dataset that stopped resembling reality somewhere around month three.

This is the eval-set-as-simulator drift, and it is the cleanest example I know of an old machine-learning lesson being rediscovered at full retail price by a generation of LLM teams who skipped the reading list. An eval suite is not a fixture. It is a simulator, and a simulator that is never re-calibrated against the system it claims to predict will eventually predict a different system.

How a Sample Becomes a Fixture

Every eval set begins life as a sample of production traffic. Someone exported a thousand recent conversations, filtered for the failure modes they cared about, hand-labeled the ones worth scoring, and committed the result to the repo. On day one, the eval is a faithful — if narrow — picture of what users actually ask and what good answers actually look like.

Then the eval gets a name, a CI job, and a scoreboard. The number it produces becomes the gate that prompt changes ship through. The team that owns the prompt now has a strong incentive — and many small daily incentives — to learn its specific edge cases. They notice that example #47 fails because of a particular phrasing, and they patch the prompt to handle that phrasing. They notice that the LLM-as-judge in the suite has a soft spot for bullet points, and the next prompt revision uses more bullet points. None of this is dishonest. Each individual change is a reasonable response to a real signal. But cumulatively, the team is now optimizing the system for the eval, not for users — the strong version of Goodhart's law, the one that says when you optimize a proxy hard enough, the thing you actually care about often gets worse, not just decoupled.

Meanwhile, three forces are quietly moving production away from the frozen sample:

  • User behavior shift. Customers learn what the agent is good at and what it isn't. The query distribution at month one is dominated by exploratory questions; by month six it is dominated by the long tail of edge cases the easy questions don't cover. The eval, sampled at month one, is now testing the wrong distribution.
  • Model upgrade shift. Every model swap moves the joint distribution of what users send and how the system responds. A model that handles ambiguous prompts more gracefully invites more ambiguous prompts. The next-version eval is being scored against last-version traffic patterns.
  • Concept shift. The right answer to "what is our refund policy?" changes the day the policy changes. Eval items whose ground-truth answers have decayed are now actively misleading the score — the model that learned the new policy fails an eval written for the old one. This is the classical concept drift problem from a decade of recommender-systems literature, with no LLM-specific exemption.

By month four, the eval is a snapshot of a moving target. By month six, the snapshot and the target are statistically distinct populations. By month eight, the offline-vs-online correlation that justified the entire test suite is no longer there, and nobody has bothered to measure it.

The Three Failure Modes Stacked Together

Each of those three forces is bad alone. Stacked, they produce the specific pathology that shows up in the postmortem: a system whose offline metrics are a strict lie. Not noisy — anti-correlated. The releases scoring highest on the suite are the ones tuned hardest to its edge cases, and tuning hard to a stale fixture is exactly how you ship a regression to current users.

You can see the same shape in benchmark land. Public benchmarks saturate not because models got perfect, but because labs learned the test. MMLU stopped distinguishing frontier models years before the models stopped improving. The same dynamic plays out internally: every team eventually ships its own MMLU, and every team eventually saturates it without noticing — except that public benchmarks at least get retired loudly, while internal eval suites just rot in place.

The teams that catch this early share one habit: they instrument the gap, not just the score. They keep a number that says how different is the offline eval distribution from this week's production distribution — a divergence metric, however crude — and they put it next to the score on the same dashboard. The gap is the load-bearing signal; the score is downstream of it. A score that ticks up while the gap also ticks up means nothing, and the dashboard should make that visible.

What "Simulator Discipline" Actually Looks Like

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates