Skip to main content

The Eval Set That Got Easier While You Weren't Looking

· 9 min read
Tian Pan
Software Engineer

You wrote the eval set eighteen months ago. Back then it was a useful instrument: the cheap model scored 71%, the better one scored 84%, and when a regression slipped in the number dropped and somebody noticed. The suite earned its place in CI. You stopped thinking about it.

Run it today and every candidate model scores 96, 97, 98. The new release scores the same as the old one. The model you suspect is worse scores the same as the model you suspect is better. The number still renders green in the dashboard, the check still passes, and it tells you exactly nothing. Your eval set didn't break. It got easier — because the models got better underneath it — and nobody was watching the moment it stopped discriminating.

This is eval saturation, and it is not a failure mode you might hit. It is the guaranteed end state of any static suite given a long enough timeline. A test that everything passes has stopped being a test.

Saturation Is the Expected Outcome, Not the Surprise

It helps to stop treating saturation as a bug and start treating it as physics. The public benchmarks make the trajectory obvious. HumanEval and MBPP — once the default coding yardsticks — now sit around 99% and 94% Pass@1 for frontier models. MMLU and MMLU-Pro are functionally saturated above 88%, where the gap between the top several models is smaller than the noise in the measurement. SuperGLUE hit its ceiling within roughly a year of release. A systematic study of benchmark plateaus found that benchmark age and scale predict saturation, while the safeguards people assume will help — private test sets, closed-ended answer formats — show limited protective effect.

The mechanism is simple. A benchmark discriminates by spreading scores across a range. When the population being measured improves, scores pile up against the maximum. The range compresses. Past a certain point you cannot statistically distinguish the best model from the second-best, because both are scoring 97% and the difference is within sampling error. This is a ceiling effect, and it destroys the one property you actually wanted: the ability to tell good from great.

Your internal eval set obeys the same physics. It is smaller and more specific than MMLU, which means it saturates faster, not slower — you wrote a few hundred cases reflecting the failure modes you knew about in 2024, and the model has since absorbed most of them. The difference is that nobody publishes a leaderboard for your suite, so nothing forces you to notice. The public benchmarks at least get retired loudly. Yours just quietly goes flat in a CI job that keeps passing.

A Benchmark Everything Passes Has Stopped Measuring Anything

The dangerous part of a saturated eval set is not that it is useless. It is that it is confidently useless. It still produces a number. The number is still high. The pipeline still goes green. Every signal in your release process says the model is fine, and that signal is now disconnected from reality.

Concretely, here is what a saturated suite can no longer do:

  • Rank candidates. Two models score 97% and 97%. You ship a coin flip and call it a decision.
  • Catch regressions. A change that breaks 1% of behavior moves a 97% suite to 96%, indistinguishable from run-to-run variance. The regression ships.
  • Justify a model upgrade. The new model costs more. Your eval says it performs identically. You either overpay or skip a real improvement, and you have no evidence either way.
  • Defend a rollback. Something feels worse in production, but the eval set that should adjudicate the dispute reports no change.

The score didn't lie about any single case. Each case it tests, it tests correctly. It lies by omission — by being composed entirely of problems the model has outgrown, so the aggregate it reports averages over a region of the capability space where nothing interesting happens anymore. You are measuring with a ruler whose smallest gradation is now larger than every difference you care about.

There is a quieter cost too. A green eval suite is an alibi. When the suite passes, the team stops looking, because looking is expensive and the dashboard says you are fine. A saturated suite buys false confidence with real attention. It is worse than no eval at all, because no eval at least keeps you nervous.

The Discrimination You Need Lives at the Hard Tail

Where did the signal go? It moved. The cases that separate a good model from a great one are not the ones in your suite — they are the ones at the hard tail you never added.

Think of your eval set as a difficulty distribution. When you wrote it, the bulk of the cases sat in a band the models found genuinely challenging, with a thin tail of hard ones. Eighteen months of model progress slid the entire distribution to the left. The challenging band is now trivial. The thin hard tail is now the only part still doing work — and it is too thin to produce a stable score. Most of your 300 cases contribute zero bits of information per run, because every model gets them right. Your effective eval set is the twelve hard cases nobody can fully ace, drowned in 288 cases of noise.

This reframes the fix. You do not need a bigger eval set. A thousand easy cases saturate exactly as fast as a hundred. You need a harder one — and specifically you need to keep feeding the hard tail so the difficulty distribution stays centered on the frontier of current model capability. The instrument has to track the thing it measures. A thermometer calibrated for room temperature tells you nothing useful inside an oven, and the model has moved into the oven.

The hard tail also happens to be where your product actually lives or dies. Nobody churns because the agent failed an easy case; the model nails those. They churn on the ambiguous instruction, the adversarial input, the multi-step task where step four depends on a misread of step two. Those are exactly the cases a saturated suite has stopped containing.

Treat the Eval Set as a Living Artifact

The mental shift that fixes this is to stop thinking of the eval set as a fixed asset you built once and start thinking of it as a living artifact with a maintenance burden — code that rots if you don't tend it. Three practices keep it alive.

Mine production for new cases. Your richest source of fresh hard cases is already arriving for free: every thumbs-down, every silent abandonment, every user who rephrased and retried is a labeled failure handed to you by reality. The discipline is to capture each one with the context that produced it — the input, the model version, the retrieved documents, the actual output — and convert it into a permanent eval case. A practical starting point is to seed the suite with 50 real production failures and add a case every time you fix a bug, so each incident permanently raises the floor. Production failures are the highest-quality eval data you will ever get, because by definition they are the cases your existing suite did not catch.

Add adversarial and synthetic hard cases deliberately. Waiting for production to surface every weakness is too slow and too risky. Stratify by difficulty and actively author cases at the hard end: ambiguous instructions, contradictory context, prompt-injection attempts in user-supplied content, long-horizon tasks. Some teams generate candidate cases at evaluation time so the suite cannot be quietly memorized between runs. The goal is to keep a usable score gradient — a spread wide enough that a 1% regression is visible above the noise.

Retire cases the model has solved. This is the step teams skip, and it is the one that keeps the headline number meaningful. When a case has been answered correctly by every model for six months, it has stopped contributing signal and is now pure ballast inflating your aggregate. Move solved cases into a cheap, separate regression tier — still run them, still catch a catastrophic backslide — but pull them out of the headline score so that number reflects only contested ground. An eval set should get smaller and harder at its core even as the regression archive grows.

The test of whether your suite is alive is blunt: when a new model drops, does your eval set produce a different number than the old one did? If yes, the instrument still works. If every model scores the same, you are not measuring the model — you are measuring the age of your test cases.

What to Do This Week

Pull the last several runs of your eval suite across the model versions you have tested. Plot the scores. If the line is flat near the top, you do not have a passing eval suite — you have a broken instrument that happens to be reporting green.

Then do three things. Find your hard tail: the cases where models still disagree or fail, and confirm there are enough of them to move the aggregate. Open a pipeline from production failures into the suite, with full context attached to every case. And set a recurring calendar reminder — quarterly is reasonable — to audit difficulty, retire solved cases, and inject new hard ones.

The model will keep getting better. That is the one thing you can count on. An eval set that does not get harder in response is not a safety net slowly wearing out — it is a safety net that has already failed, still hanging there, still looking intact, right up until the moment something falls through it.

References:Let's stay in touch and Follow me for more thoughts and updates