Benchmark Contamination: Why That 90% MMLU Score Doesn't Mean What You Think
When GPT-4 scored 88% on MMLU, it felt like a watershed moment. MMLU — the Massive Multitask Language Understanding benchmark — tests 57 academic subjects from elementary math to professional law. An 88% accuracy across that breadth looked like strong evidence of genuine broad intelligence. Then researchers created MMLU-CF, a contamination-free variant that swapped out any questions with suspicious proximity to known training corpora. GPT-4o dropped to 73.4% — a 14.6 percentage point gap.
That gap isn't a small rounding error. It's the difference between "reliably correct on complex academic questions" and "reliably correct when you've seen the question before." For teams making model selection decisions based on leaderboard scores, it means buying a capability that doesn't fully exist.
How Evaluation Data Gets Into Training Corpora
The mechanism isn't a secret conspiracy. It's structural and nearly inevitable given how LLM pretraining works.
Modern frontier models train on web-scale datasets — Common Crawl snapshots, GitHub, arXiv, StackExchange, Reddit threads. The popular benchmark datasets like MMLU, HumanEval, and GSM8K are publicly available on GitHub and Hugging Face, routinely discussed in academic papers that land on arXiv, and referenced in StackExchange answers. When you vacuum up the internet, you get the benchmarks too.
The contamination takes several forms. Direct inclusion is the most obvious: a GitHub repo with HumanEval problems sits in the training data. But there's also paraphrasing contamination, where synthetic datasets derived from benchmark questions end up in training (CodeAlpaca contained approximately 12.8% rephrased HumanEval samples). There are discussion forums where people post benchmark questions and solutions. There are study guides built from MMLU content. The test sets diffuse through the web over time.
For LLaMA 2, researchers found over 16% of MMLU examples were contaminated, with 11% "seriously contaminated" — meaning more than 80% of tokens from those examples had appeared in pretraining data. For HumanEval, contamination rates across major open-weight pretraining sets range from 8% to 18%.
The harder question is whether this constitutes actual performance inflation, or whether the model is genuinely learning something from those contaminated examples that transfers. The MMLU-CF results answer this directly: the gap is real and substantial.
Why Standard Detection Methods Keep Failing
The industry's first-pass response to contamination concerns was n-gram matching: check if the test inputs appear verbatim in the training corpus. This works for direct inclusion. It completely fails for everything else.
A 13-parameter Llama model demonstrated this vividly by achieving GPT-4-level performance on MMLU, GSM8K, and HumanEval after training on paraphrased versions of the test sets. The paraphrases passed n-gram decontamination filters trivially. The model had effectively memorized the answer distribution without storing the literal text.
More recent detection approaches use semantic similarity (sentence-BERT embeddings), perplexity analysis (contaminated examples show anomalously low perplexity relative to uncontaminated neighbors), and membership inference attacks. These catch more contamination, but research in 2025 showed a new problem: contamination evidence can be concealed through post-training with clean data even after supervised fine-tuning contamination is detectable. You can train away the fingerprints.
This is the adversarial dynamic that makes benchmark scores fundamentally unreliable for closed-source models: the people creating the benchmarks can't audit the training data, and the people training the models have strong incentives to score well.
The Benchmark-to-Reality Performance Gap
The clearest evidence that contamination matters for real tasks comes from code generation benchmarks. On HumanEval — which tests function-level code generation on isolated, self-contained problems — frontier models score in the 84–89% range. On real-world class-level code generation benchmarks (requiring models to understand project context, inter-class dependencies, and existing patterns), performance drops to 25–34%.
That's not a small methodological quibble. HumanEval was designed to be easy to evaluate: single functions with clear docstrings and unit tests. It succeeded at being easy to evaluate and failed at predicting whether a model would be useful for actual software development.
GSM8K math reasoning shows similar patterns. Inference-time decontamination techniques — where you rephrase test questions at evaluation time to reduce pattern match likelihood — reduce accuracy by up to 22.9% on GSM8K and 19.0% on MATH. Contamination isn't just inflating absolute numbers; it's artificially narrowing the gap between smaller and larger models, making cheaper models look more capable than they are on tasks that matter.
What Contamination-Resistant Evaluation Actually Looks Like
There are four approaches that work at different points in the evaluation lifecycle.
Temporal grounding is the simplest in principle: use evaluation data that provably postdates the model's training cutoff. LiveBench and LiveCodeBench do this continuously, pulling questions from recent competitions, news events, and newly published problems. The advantage is that contamination is structurally impossible for data that didn't exist during training. The disadvantage is that you need to refresh the benchmark constantly to maintain that property.
Process-level evaluation shifts what you're measuring. Instead of checking only whether the final answer is correct, you evaluate the reasoning trace — the intermediate steps the model produces. Two models with identical accuracy on a math problem set can exhibit sharply different reasoning: one is doing valid algebraic manipulation while the other is pattern-matching to common answer forms. Models that genuinely learned the underlying procedure show better generalization to problem variants. Models relying on memorized answer patterns fall apart when you change the surface form slightly.
Behavioral probing goes further. Rather than testing on a held-out slice of the benchmark distribution, you construct new test cases that probe the same capability through a different surface. If a model claims to have learned "solving quadratic equations," you probe whether it can solve them presented as word problems, as code, as geometric problems, and as novel notation systems. Genuine learning generalizes across surface variations; memorization doesn't.
Private held-out sets are often the right answer for practitioners who aren't benchmark researchers — just build your own evaluation set from your actual task distribution, keep it off the internet, and never share it with vendors as fine-tuning examples. This eliminates the entire contamination pathway for your use case.
A Practical Evaluation Framework for Teams
If you're selecting or comparing models for a specific deployment, benchmark leaderboard scores should be treated as weak prior information at best. Here's what actually gives you useful signal:
First, construct your evaluation set from your own data. Take 50–200 examples from your actual task distribution, have humans evaluate the correct outputs, and test models against that. This is contamination-proof by design, and it directly measures the capability you care about rather than a proxy.
Second, test under realistic prompting conditions. Many benchmark results use elaborate multi-shot prompting with chain-of-thought, optimized exemplars, and ensemble decoding. In production you'll use simple zero-shot or few-shot prompts. Performance often degrades substantially when prompting moves from benchmark-optimal to production-realistic.
Third, if you're comparing models on standard benchmarks, prefer contamination-aware benchmarks where researchers have made a methodological commitment to contamination resistance — LiveBench, MMLU-CF, or task-specific variants with temporal cutoffs. Treat raw MMLU or HumanEval scores as marketing material.
Fourth, test behavioral generalization rather than accuracy on the standard distribution. Take your use case and create 5–10 "variant probes" that test the same capability differently. A model that aces your examples but fails the variants was probably pattern-matching, not reasoning.
Fifth, for reasoning tasks specifically, look at trace quality. Does the model's reasoning trace lead coherently to the answer, or does the answer appear alongside disconnected reasoning fragments? Correct-answer-with-bad-reasoning is a contamination signal. It's also a reliability signal: those models tend to produce confident wrong answers more frequently when the question is slightly out of distribution.
The Broader Measurement Problem
Benchmark contamination is one symptom of a broader issue: the evaluation infrastructure for AI systems is structurally misaligned with the incentives of the actors being evaluated. Benchmark creators publish their data publicly to allow reproducible research. Model trainers have access to that published data. Evaluators then use the same public data to assess model quality. The conflict of interest is baked in.
This isn't fully solvable — you can't have a benchmark that's simultaneously public enough to allow third-party auditing and private enough to prevent contamination. But you can build evaluation pipelines that acknowledge this limitation rather than ignoring it.
The 73.4% vs. 88.0% gap on MMLU isn't a bug to be patched. It's a quantification of how much benchmark scores overstate genuine capability when you can't verify training data cleanliness. For models where you have no training data visibility — which is most commercial APIs — assume a similar gap exists until you have evidence from your own evaluation showing otherwise.
Build your evals. Test on your distribution. Treat leaderboard scores as a starting filter, not a decision criterion.
- https://arxiv.org/html/2502.17521v2
- https://arxiv.org/html/2406.04244v1
- https://arxiv.org/html/2402.15938v1
- https://aclanthology.org/2024.naacl-long.482/
- https://lmsys.org/blog/2023-11-14-llm-decontaminator/
- https://arxiv.org/html/2412.15194v1
- https://arxiv.org/html/2308.08493
- https://openreview.net/forum?id=sKYHBTAxVa
- https://arxiv.org/html/2406.13990
- https://arxiv.org/html/2510.26130v2
- https://arxiv.org/html/2405.00332v1
- https://arxiv.org/html/2404.00699v4
