Skip to main content

Benchmark Contamination: Why That 90% MMLU Score Doesn't Mean What You Think

· 8 min read
Tian Pan
Software Engineer

When GPT-4 scored 88% on MMLU, it felt like a watershed moment. MMLU — the Massive Multitask Language Understanding benchmark — tests 57 academic subjects from elementary math to professional law. An 88% accuracy across that breadth looked like strong evidence of genuine broad intelligence. Then researchers created MMLU-CF, a contamination-free variant that swapped out any questions with suspicious proximity to known training corpora. GPT-4o dropped to 73.4% — a 14.6 percentage point gap.

That gap isn't a small rounding error. It's the difference between "reliably correct on complex academic questions" and "reliably correct when you've seen the question before." For teams making model selection decisions based on leaderboard scores, it means buying a capability that doesn't fully exist.

How Evaluation Data Gets Into Training Corpora

The mechanism isn't a secret conspiracy. It's structural and nearly inevitable given how LLM pretraining works.

Modern frontier models train on web-scale datasets — Common Crawl snapshots, GitHub, arXiv, StackExchange, Reddit threads. The popular benchmark datasets like MMLU, HumanEval, and GSM8K are publicly available on GitHub and Hugging Face, routinely discussed in academic papers that land on arXiv, and referenced in StackExchange answers. When you vacuum up the internet, you get the benchmarks too.

The contamination takes several forms. Direct inclusion is the most obvious: a GitHub repo with HumanEval problems sits in the training data. But there's also paraphrasing contamination, where synthetic datasets derived from benchmark questions end up in training (CodeAlpaca contained approximately 12.8% rephrased HumanEval samples). There are discussion forums where people post benchmark questions and solutions. There are study guides built from MMLU content. The test sets diffuse through the web over time.

For LLaMA 2, researchers found over 16% of MMLU examples were contaminated, with 11% "seriously contaminated" — meaning more than 80% of tokens from those examples had appeared in pretraining data. For HumanEval, contamination rates across major open-weight pretraining sets range from 8% to 18%.

The harder question is whether this constitutes actual performance inflation, or whether the model is genuinely learning something from those contaminated examples that transfers. The MMLU-CF results answer this directly: the gap is real and substantial.

Why Standard Detection Methods Keep Failing

The industry's first-pass response to contamination concerns was n-gram matching: check if the test inputs appear verbatim in the training corpus. This works for direct inclusion. It completely fails for everything else.

A 13-parameter Llama model demonstrated this vividly by achieving GPT-4-level performance on MMLU, GSM8K, and HumanEval after training on paraphrased versions of the test sets. The paraphrases passed n-gram decontamination filters trivially. The model had effectively memorized the answer distribution without storing the literal text.

More recent detection approaches use semantic similarity (sentence-BERT embeddings), perplexity analysis (contaminated examples show anomalously low perplexity relative to uncontaminated neighbors), and membership inference attacks. These catch more contamination, but research in 2025 showed a new problem: contamination evidence can be concealed through post-training with clean data even after supervised fine-tuning contamination is detectable. You can train away the fingerprints.

This is the adversarial dynamic that makes benchmark scores fundamentally unreliable for closed-source models: the people creating the benchmarks can't audit the training data, and the people training the models have strong incentives to score well.

The Benchmark-to-Reality Performance Gap

The clearest evidence that contamination matters for real tasks comes from code generation benchmarks. On HumanEval — which tests function-level code generation on isolated, self-contained problems — frontier models score in the 84–89% range. On real-world class-level code generation benchmarks (requiring models to understand project context, inter-class dependencies, and existing patterns), performance drops to 25–34%.

That's not a small methodological quibble. HumanEval was designed to be easy to evaluate: single functions with clear docstrings and unit tests. It succeeded at being easy to evaluate and failed at predicting whether a model would be useful for actual software development.

GSM8K math reasoning shows similar patterns. Inference-time decontamination techniques — where you rephrase test questions at evaluation time to reduce pattern match likelihood — reduce accuracy by up to 22.9% on GSM8K and 19.0% on MATH. Contamination isn't just inflating absolute numbers; it's artificially narrowing the gap between smaller and larger models, making cheaper models look more capable than they are on tasks that matter.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates