What AI Benchmarks Actually Measure (And Why You Shouldn't Trust the Leaderboard)

November 8, 2025 · 10 min read

Software Engineer

When GPT-4o, Claude 3.5 Sonnet, and Llama 3.1 405B all score 88–93% on MMLU, what does that number actually tell you about which model to deploy? The uncomfortable answer: almost nothing. The benchmark that once separated capable models from mediocre ones has saturated. Every frontier model aces it, yet they behave very differently in production. The gap between benchmark performance and real-world utility has never been wider, and understanding why is now essential for any engineer building on top of LLMs.

Benchmarks feel rigorous because they produce numbers. A number looks like measurement, and measurement looks like truth. But the legitimacy of a benchmark score depends entirely on the validity of what it's measuring—and that validity breaks down in ways that are rarely surfaced on leaderboards.

Goodhart's Law Has Already Reached MMLU

In 1975, economist Charles Goodhart articulated what became known as Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. AI benchmarks are now a textbook example.

MMLU (Massive Multitask Language Understanding) became the de facto standard for evaluating broad model knowledge. Labs cited it in press releases. Investors asked about it. Gradually, optimizing for MMLU performance became a competitive goal in itself. Models got better at the benchmark—not necessarily because they became more capable, but because training pipelines increasingly overlapped with benchmark content.

By mid-2024, the discrimination power of MMLU had effectively collapsed. Top models cluster in a narrow band where score differences are smaller than the measurement noise introduced by prompt formatting. This isn't speculation—changing how answer choices are presented (from (A) to [A], for instance) produces accuracy swings of ~5% on identical models. A 5% formatting artifact is larger than the gap between many competing models on the leaderboard.

MMLU-Pro was designed to fix this, scaling to 12,000 harder questions with 10 answer choices instead of 4. It bought two years. By late 2025, frontier models were clustering in the 88–90% range on MMLU-Pro as well.

This is a pattern, not a fluke. Benchmarks get saturated. The field moves on to harder benchmarks, which in turn get saturated faster because labs now optimize harder. Static benchmarks have a structural half-life problem.

Data Contamination Is Underreported and Underappreciated

When a benchmark question appears in a model's training data, the model isn't demonstrating understanding—it's demonstrating memorization. This is data contamination, and it's far more pervasive than public benchmark reports suggest.

A 2024 analysis of 30 models found that only 9 reported any measurement of train-test overlap. That means the other 21 published benchmark scores without disclosing whether the test data had leaked into training. Studies on mathematical reasoning benchmarks found widespread contamination across 31 analyzed models.

The incentive structure explains this gap. Benchmark scores are used as marketing. Voluntarily disclosing that your benchmark performance might be inflated by contamination is not a compelling announcement. There are no enforcement mechanisms. Researchers who want clean comparisons must essentially take lab numbers on faith.

Even when contamination isn't intentional, it happens. Modern LLMs train on enormous web-scraped datasets. Popular benchmarks are widely discussed, cited, and reproduced online. The overlap isn't always deliberate—it's often unavoidable with current data pipelines. The honest response is rigorous contamination reporting; the common response is silence.

Multiple-Choice Format Introduces Its Own Distortions

A significant portion of LLM evaluation relies on multiple-choice questions. This format is operationally convenient: scoring is automatic, questions are reproducible, and results are easy to compare. It also introduces subtle distortions that practitioners rarely account for.

Multiple-choice evaluation rewards option elimination over knowledge. A model that knows almost nothing about a topic can still perform above chance by recognizing implausible distractors. At scale, this inflates scores in ways that don't transfer to open-ended tasks.

More damaging is the BBQ bias evaluation example: a widely-cited benchmark for measuring model bias. In one evaluation, models achieved perfect bias scores while simply refusing to answer ambiguous questions—technically correct by the rubric, but completely useless for assessing actual bias behavior. The benchmark was measuring refusal rate, not bias. The metric looked good; the capability being assessed was not.

This is a version of the deeper problem: multiple-choice scores are proxies for the capability you actually care about, and the distance between the proxy and the capability varies enormously across domains, model families, and deployment contexts.

Third-Party Frameworks Aren't a Free Lunch

The natural response to distrusting individual lab-reported numbers is to rely on third-party evaluation frameworks. HELM, BIG-bench, and others were built to provide independent, standardized comparisons. They help, but they also introduce their own failure modes.

BIG-bench, which grew through community contributions, accumulated 204 tasks. The breadth was a feature—it covered a wide surface area of capabilities. But it also created serious practical problems: bugs in contributed evaluations, inconsistent quality, and a computational burden that made comprehensive evaluation prohibitively expensive. More fundamental, because tasks came from many different contributors with different standards, the aggregated score doesn't have a coherent interpretation.

HELM took the opposite approach: expert curation, top-down design. But this created a different problem—the framework couldn't accommodate model-specific design choices. Models with particular input formats or prompting conventions couldn't be evaluated fairly under a rigid unified schema. Iteration cycles stretched into months. When a model updated, the evaluation lagged far behind.

Neither approach solved the fundamental challenge: designing an evaluation that is simultaneously standardized (so comparisons are valid), comprehensive (so coverage is meaningful), and practically feasible (so it can run at the speed of model development).

What Human Evaluations Actually Cost

When automated benchmarks fall short, human evaluation fills the gap. A/B testing with human raters, red-teaming, preference labeling—these feel more direct because they involve human judgment about real outputs. The tradeoff is steep.

Human evaluation introduces evaluator variance. Different raters have different quality thresholds, different domain knowledge, different biases about what counts as helpful versus harmful. A "harmless" judgment from one rater may be a "harmful" judgment from another. Aggregating across evaluators smooths some of this noise, but it doesn't eliminate the fundamental subjectivity.

The tension between helpfulness and harmlessness is particularly sharp. A refusal is always harmless—the model produced nothing harmful because it produced nothing at all. This creates a systematic evaluator bias toward over-refusal in red-teaming contexts. The model that refuses everything scores well on safety benchmarks. Whether it's actually useful is a different question.

For high-stakes domains—national security risks, biosecurity, critical infrastructure—red-teaming evaluations require domain experts with genuinely specialized knowledge. Those experts are rare, expensive, and may face legal ambiguity around reproducing controlled information during testing. The evaluation problem becomes a security clearance problem.

The Shift to Custom Evaluation Infrastructure

The practical response to these limitations, especially in production environments, is building custom evaluation infrastructure rather than relying on public benchmarks. This is now the dominant approach at organizations that have moved beyond toy deployments.

Custom evals are built from production data and actual failure modes. If you're building a legal document assistant, your evaluation set should come from legal documents—real edge cases your users have encountered, failure modes you've diagnosed in production, not multiple-choice law bar exam questions. The gap between what you care about and what a generic benchmark measures is too wide.

The workflow that works in practice is a loop:

Offline evals: Run curated test suites against candidate model changes before deployment. Catch regressions. Test the specific capabilities your application depends on.
Shadow and canary testing: Route a fraction of real traffic to the new model, compare outputs against the baseline, measure against metrics that reflect actual user outcomes.
Production monitoring: Track live performance over time. Detect drift. Capture outliers that reveal capability gaps the eval suite didn't cover.

This isn't a replacement for benchmarks—it's a complement. Public benchmarks are still useful for initial model selection and rough capability comparison. Custom evals are what actually tells you whether a model works for your application.

The practical gap is significant: organizations that systematically evaluate against their specific use cases see production failure rates drop substantially compared to teams relying on public benchmark scores alone.

How to Read Benchmark Claims Skeptically

Until evaluation science matures, the right posture is systematic skepticism rather than benchmark rejection.

Ask about methodology before citing a score. How was prompting handled? Did they use few-shot examples? Chain-of-thought? Different choices here can shift results by 10 percentage points or more. A benchmark score without methodology documentation is not a benchmark score—it's a number.

Check what the benchmark actually measures. MMLU measures multiple-choice knowledge recall. HumanEval measures algorithm implementation on competitive-style problems. SWE-bench measures performance on real GitHub issues. These are different capabilities. High performance on one says little about the others.

Look for contamination disclosures. If a lab claims state-of-the-art performance on a benchmark without any disclosure of train-test overlap analysis, treat the number cautiously. The absence of disclosure isn't proof of contamination, but it removes confidence.

Weight task-specific benchmarks more heavily than general ones. SWE-bench tells you more about coding capability than MMLU does. GPQA tells you more about scientific reasoning. General-purpose benchmarks are useful for coverage; domain-specific benchmarks are better for capability assessment.

Treat leaderboard rankings as a starting point, not an answer. The top three models on any leaderboard may perform differently on your workload. Use public benchmarks to narrow the selection to a shortlist, then evaluate on your data.

The Harder Truth

The fundamental issue is that evaluation science hasn't kept pace with model development. Models improved rapidly; the infrastructure for rigorously measuring what they can do improved more slowly. Governance frameworks, procurement decisions, and deployment choices are now being made based on benchmarks that have known validity problems—because they're the benchmarks that exist.

This isn't an argument for nihilism about benchmarks. Some benchmarks are better constructed than others. Some are more resistant to contamination. Dynamic and adversarial benchmarks—where questions are generated fresh at evaluation time—are harder to game than static ones. Expert-grounded evaluations that require real domain knowledge are more meaningful than proxy tasks.

But any practitioner who deploys a model because it leads a public leaderboard, without any task-specific evaluation on their own data, is outsourcing a critical engineering judgment to a process with known flaws. The number on the leaderboard is a signal—not a measurement.

What evaluations actually measure is a function of how they were designed, what data they include, how they're scored, and whether the capability they proxy matches the capability you care about. Getting clarity on all four of those questions, for every benchmark you rely on, is the unglamorous work that separates robust AI engineering from benchmark-chasing.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

What AI Benchmarks Actually Measure (And Why You Shouldn't Trust the Leaderboard)

Goodhart's Law Has Already Reached MMLU

Data Contamination Is Underreported and Underappreciated

Multiple-Choice Format Introduces Its Own Distortions

Third-Party Frameworks Aren't a Free Lunch

What Human Evaluations Actually Cost

The Shift to Custom Evaluation Infrastructure

How to Read Benchmark Claims Skeptically

The Harder Truth

Recommended Reading

About Tian Pan

Goodhart's Law Has Already Reached MMLU​

Data Contamination Is Underreported and Underappreciated​

Multiple-Choice Format Introduces Its Own Distortions​

Third-Party Frameworks Aren't a Free Lunch​

What Human Evaluations Actually Cost​

The Shift to Custom Evaluation Infrastructure​

How to Read Benchmark Claims Skeptically​

The Harder Truth​

Recommended Reading

About Tian Pan

Goodhart's Law Has Already Reached MMLU

Data Contamination Is Underreported and Underappreciated

Multiple-Choice Format Introduces Its Own Distortions

Third-Party Frameworks Aren't a Free Lunch

What Human Evaluations Actually Cost

The Shift to Custom Evaluation Infrastructure

How to Read Benchmark Claims Skeptically

The Harder Truth