Skip to main content

The Eval Ceiling: When Your Golden Test Cases Stop Discriminating

· 10 min read
Tian Pan
Software Engineer

A year ago, your eval suite did its job beautifully. Candidate models came back with scores spread between 60 and 80, and the ranking told you something. The new fine-tune beat the baseline by six points; the cheaper model lost three. Decisions flowed from the numbers. Today, every candidate scores 95 or 96 or 97 on the same suite, and the spread has collapsed into noise. Your team is still running the eval, still reading the report, still using it to green-light migrations — but the report has stopped containing information.

This is not benchmark contamination. It is not world-drift decay. It is a measurement-instrument problem: your test cases were calibrated for a difficulty level that the platform passed. The ruler hasn't broken; the things you're measuring have outgrown it. And the team that doesn't notice keeps making model decisions with a tool whose discriminating range no longer overlaps the candidates being compared.

Saturation Has a Recognizable Shape

The public benchmark literature has been screaming about this problem for two years and the lesson has not propagated to internal eval suites. MMLU saturated above 88% with frontier models clustering inside a two-point band. GSM8K crossed 99% on top contenders, making it useless for top-tier comparisons. BIG-Bench Hard, designed in 2022 to be hard for frontier models, hit 94% within twelve months of release. Humanity's Last Exam was created in early 2025 explicitly because the existing instruments had run out of headroom.

Your internal eval suite has the same shape — the same compression at the top — but you don't see it as starkly because there are fewer candidates and the band looks like signal until you stop and ask what fraction of cases every candidate now scores above 95 on. When the answer is "most of them," you have an instrument whose information per question has fallen near zero.

The classical psychometrics framing is the cleanest way to think about this. In item response theory, every test item has two parameters that matter: difficulty and discrimination. A highly discriminating item has a steep success curve — slightly more capable test-takers pass it, slightly less capable ones fail it, and the slope at the inflection point is sharp. A poorly discriminating item is flat: every candidate above some low threshold passes, so the item tells you nothing about which of those candidates is actually better. An eval suite is an item pool, and an item pool's useful-life depends on whether its difficulty distribution still straddles the ability range you're trying to compare.

When the ability range moves up and the difficulty distribution stays put, every item migrates from the steep middle of the success curve to the flat top. The suite still produces a number. The number is just not a measurement anymore.

The Failure Mode Is Quiet and Confident

The team rarely notices the ceiling because every part of the ritual still works. Eval runs complete. Reports generate. Numbers get filed in the model-comparison spreadsheet. Reviewers cite "passed eval suite, 96.2% vs baseline 95.8%" as evidence that a migration is safe. The decision-making texture feels indistinguishable from the era when the suite was discriminating.

The failure shows up downstream and weeks late. A model migration green-lit by a 0.4-point eval improvement turns into a customer-support spike four weeks after rollout. A vendor swap that the suite blessed produces complaints about quality regressions on the long-tail cases nobody tested. A "cheaper model with comparable eval scores" decision burns through budget on retries because the cheaper model fails on the production traffic the eval suite happened not to sample.

The damage is not that you picked the wrong model — sometimes you got lucky and the saturated eval and the real ranking happened to agree. The damage is that you made the decision with no information and didn't know it. You'd have made a better choice flipping a coin and looking at one production trace. The eval gave you confidence proportional to the comprehensiveness of the suite, not the discriminating power of any case in it.

The closely related Goodhart's-law failure compounds the confidence: vendors and internal teams alike have been training against publicly known benchmarks for years, so the sub-population of cases your private suite borrowed from public sources is the most-saturated sub-population. The cases you copied from MMLU into your "general knowledge" eval section have probably been seen, in some form, during the candidate model's pretraining. The eval is not just saturated — it is contaminated in a direction that compresses scores even further.

What a Working Discipline Looks Like

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates