Skip to main content

Your Eval Set Is a Frozen Photograph of Traffic Your Users Already Left

· 10 min read
Tian Pan
Software Engineer

You shipped a model upgrade. The eval suite went from 87% to 91%. The release notes wrote themselves, leadership clapped, and then the dashboards that actually matter — user satisfaction, escalation rate, thumbs-down ratio — did nothing. Flat. Maybe slightly worse.

This is one of the most disorienting failure modes in AI engineering, because nothing is broken. The eval ran correctly. The numbers are real. The model genuinely improved on the 600 examples you tested it against. The problem is that those 600 examples are a photograph of traffic from the week you built the suite, and your users have spent the months since then walking out of frame.

An eval set is not a measurement of quality. It is a measurement of quality on a specific distribution of inputs. When you quote a benchmark number without quoting the distribution it was computed over, you are doing the equivalent of reporting a temperature without saying where the thermometer was. The number is precise and the number is useless, and the gap between those two facts is where weeks of engineering effort quietly disappear.

Production traffic is a river, not a lake

When you assemble an eval set, you sample from whatever your users were doing at that moment. You curate it, label it, freeze it, and check it into the repo. From that point on it is a static artifact. It does not change.

Your users do not extend the same courtesy. Production traffic is a moving river, and it moves in at least three distinct ways.

It moves because users discover new use cases. You built a support agent for password resets and billing questions. Six weeks in, people are pasting entire error logs into it and asking it to diagnose integrations. Nobody asked permission. The traffic mix shifted under you, and your eval set still thinks the job is password resets.

It moves because users abandon old patterns. The query type that made up 30% of your eval set might be 4% of live traffic now, because you shipped a UI change that handles it without the model, or because a competitor trained users to phrase things differently. Your eval is still spending 30% of its weight grading a question almost nobody asks.

It moves because users probe the edges adversarially. Not always maliciously — often just curiously. They find the phrasings that confuse the model and, because confusion is interesting, they keep doing it. The hard cases in production compound over time. The hard cases in your eval set were fixed in amber the day you wrote them.

A benchmark gain, then, measures progress on a distribution your users have already left. It is a real gain. It is just a gain in a place nobody is standing anymore.

The suite goes green while satisfaction drops

The most dangerous version of this is silent. There is no alarm, no failed CI check, no red number. The eval suite is passing. It passes more emphatically than it used to. That green checkmark is doing active harm, because it is the thing leadership looks at to decide whether the AI feature is healthy.

Consider the arithmetic. Suppose your eval set was 85% representative of production traffic when you built it. Each month, drift erodes that — new use cases, retired patterns, shifted phrasing — by a few percentage points. After two quarters, maybe 55% of your eval still reflects what users actually send. Your 91% score is now 91% on a little over half of reality, and a coin flip on the rest.

Meanwhile the half of production your eval no longer covers is exactly the half that is growing, because growth is what made it diverge in the first place. The eval is brightest precisely where the action is thinnest. Teams have reported models that looked fine on static suites while quietly accumulating subtly degraded answers, slow retrieval decay, and edge-case hallucinations — all of it invisible in the aggregate because the aggregate was computed over stale inputs.

You do not get a warning for this. You get a series of normal-looking weekly eval reports and a slow, unattributable decline in the metrics that come from actual humans. By the time someone connects the two, the eval suite has lost credibility, and rebuilding trust in a number is much harder than rebuilding the number.

The survivorship trap: your eval over-represents solved problems

There is a second, subtler distortion, and it is a form of survivorship bias.

When you build an eval set, you tend to populate it with queries you can confidently label. You need a ground-truth answer to score against, so you gravitate toward questions with clean, checkable answers — which are disproportionately the questions the current model already handles well. The genuinely hard, ambiguous, where-even-is-the-answer queries are harder to label, so they get under-sampled or quietly dropped.

The result is an eval set skewed toward solved problems. It is the machine-learning equivalent of studying returning bombers to decide where to add armor: you are looking at the planes that made it back. The queries that "made it back" into your eval are the ones that were tractable enough to label. The ones that crashed — the messy, novel, genuinely confusing inputs — never entered the dataset, so the eval cannot see them.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates