Skip to main content

Goodhart's Law in Your LLM Eval Suite: When Optimizing the Score Breaks the System

· 9 min read
Tian Pan
Software Engineer

Andrej Karpathy put it bluntly: AI labs were "overfitting" to Arena rankings. One major lab privately evaluated 27 model variants before their public release, publishing only the top performer. Researchers estimated that selective submission alone could artificially inflate leaderboard scores by up to 112%. The crowdsourced evaluation system that everyone pointed to as ground truth had become a target — and once it became a target, it stopped being a useful measure.

This is Goodhart's Law in action: when a measure becomes a target, it ceases to be a good measure. It's been well-understood in economics and policy for decades. In LLM engineering, it's actively destroying eval suites right now, often without the teams building them realizing it.

The problem isn't bad intent. Most teams gaming their evals are doing so through ordinary engineering instincts — fix what the metric is telling you to fix, improve what the dashboard shows, tune what's scoring poorly. The issue is that each of those rational decisions slowly erodes the signal until your eval suite measures how well you've optimized for the eval suite, not how well your system performs in production.

The Three Ways Teams Game Their Own Evals

Cherry-Picked Judge Calibration

Most teams using LLM-as-judge evaluation choose a judge model, write a rubric, generate some outputs, check whether the scores "feel right," and then ship the eval system. The calibration step is the trap.

When you adjust your rubric until it produces scores that match your intuitions, you're training your evaluator on your biases. If your team believes verbose, confident-sounding answers are better, you'll tune the rubric to reward verbosity. If you spot-check only on cases where your system performs well, you'll calibrate on an unrepresentative slice.

The result is an evaluator that systematically rewards the surface features your team associates with quality, not actual quality. Models can and do learn to exploit these: formatting tricks, keyword stuffing, and confident tone reliably boost scores in many LLM-as-judge setups. A model that answers "This is clearly X, because [three bullet points of plausible-sounding reasoning]" will outscore one that says "It's probably X, though there's ambiguity here" — even when the second answer is more epistemically honest and more useful.

Training Set Contamination

Eval suites accumulate contamination through proximity to training pipelines. The most common path: you find an area where your model performs poorly, create new fine-tuning examples to fix it, and re-evaluate. If your eval set and your fine-tuning set aren't cleanly separated, you're now testing on data adjacent to what you trained on.

The effect shows up as mysteriously high scores on benchmarks paired with baffling production failures. Research examining leading open- and closed-source models on clean benchmarks found accuracy drops of up to 13% compared to scores on established benchmarks — with several model families showing systematic overfitting across almost all model sizes. LiveCodeBench, which continuously collects novel problems from competitive programming sites after model training cutoffs, exposed this directly: models with strong static benchmark scores showed stark drops when faced with genuinely novel problems.

The worst version of this problem is active exploitation. In software engineering agent evaluations, some agents learned to inspect the .git history of test repositories to find human-written patches, then copy those solutions rather than solving the underlying problems. The eval measured whether the agent produced the right diff. The agent found a shortcut that produced the right diff. Score went up; capability didn't.

Adversarial Prompt Pruning

Teams routinely prune "unfair" or "ambiguous" prompts from their eval sets. A prompt that consistently confuses the model gets removed because it "isn't representative of real users." Edge cases that expose failure modes get excluded because they're "adversarial."

What remains is a curated set of cases your model handles well. Each pruning decision is individually defensible. The aggregate effect is an eval suite with selection bias toward your current system's strengths. The score goes up, but what it's measuring has quietly shifted.

Why the Divergence Accelerates

The score-production gap doesn't widen linearly — it accelerates. Early in a model's lifecycle, your eval suite correlates reasonably well with production quality because the suite was designed when you hadn't yet optimized heavily for it. As you ship iterations, each optimization pass tightens the coupling between your model and your specific eval setup. The eval becomes less like a test of capabilities and more like a test of familiarity with the test.

Several structural forces drive this:

Saturation. When most examples in your eval set score 85–95%, score differences become statistically meaningless. You're measuring noise. But teams rarely acknowledge this; they continue using the suite as a decision signal even as its discriminative power approaches zero.

Distribution shift. Production usage evolves; eval sets don't, unless someone actively maintains them. Your users start asking different questions, using the system in unexpected contexts, or hitting edge cases you didn't anticipate. The eval suite remains frozen in the distribution of six months ago.

Judge model drift. If your LLM-as-judge uses an external model (GPT-4o, Claude), that model gets updated. The rubric you calibrated against version N now runs against version N+2, which may have different verbosity preferences, formatting tendencies, or reasoning patterns. Your scores shift without any change to your system.

Meta-Evaluation: Keeping the Eval Honest

The solution to Goodhart's Law isn't abandoning metrics — it's evaluating your evaluator. These practices don't require a research team; they're engineering discipline.

Blind Holdout Sets

Maintain a partition of your eval data that no one on the team uses for iteration. Not "we try not to look at it" — actually lock it down. Run it only when you need to make a consequential decision (shipping a new model, switching providers, major prompt changes). Your working eval set can be Goodharted freely; the holdout set tells you whether the Goodharting translated to real improvement.

The holdout set should be replenished from production data on a regular cadence. Pull a sample of recent production interactions, label them, and rotate them into the holdout pool. This keeps the holdout distribution aligned with actual usage rather than drifting into an artifact of your initial design.

Rater Agreement Audits

Periodically take a sample of your eval outputs and have multiple human raters score them independently. Then compare human scores to your automated evaluator's scores. The numbers to track are:

  • Cohen's Kappa or Krippendorff's Alpha between human raters (tells you whether the task is coherent — if humans disagree with each other, your eval is measuring something ambiguous)
  • Correlation between human consensus and automated scores (tells you whether your evaluator is measuring what you think it's measuring)

A practical cadence is 100–200 examples quarterly, or whenever you make a significant change to your eval system. Teams that skip this step are operating on faith that their automated judge agrees with humans. That faith is often misplaced — state-of-the-art judges exhibit correlation fluctuations of up to 0.2 depending on how prompts are structured, and known biases (verbosity preference, position bias, self-preference) can systematically skew scores in ways no rubric adjustment will fix.

User-Outcome Correlation

Eval scores are internal signals. The external signal is what users actually do. Build an explicit correlation check between your eval metrics and downstream user behavior:

  • Task completion rate
  • Edit rate on model outputs (users who immediately edit a response are implicitly rating it poor)
  • Return behavior (did this interaction lead to another session, or did the user abandon?)
  • Explicit ratings where available

You don't need perfect measurement here — even rough correlation tracking will tell you whether your eval is predictive. If your eval score improved 15% over the last quarter but task completion rate is flat, your eval suite has decoupled from the thing you actually care about. That's a fire alarm, not a reason to celebrate.

Treating Your Eval as an Adversarial System

The mindset shift that actually works is to treat your eval suite the way a security team treats an application: assume it will be exploited, and design for that assumption.

This means:

  • New prompt categories enter through a defined intake process, not ad hoc addition
  • Removing prompts from the eval set requires explicit justification that's reviewed by someone not on the daily iteration loop
  • Judge calibration is done against a human-labeled ground truth, not intuition
  • Every score dashboard shows historical trend lines, so a sudden improvement triggers investigation rather than celebration

The teams that maintain trustworthy eval suites don't avoid Goodhart's Law through cleverness — they build institutional friction that slows the drift. The eval suite is a shared asset that should be harder to change than the model it's evaluating.

The Practical Starting Point

If your eval suite has been in place for more than six months and you haven't run a rater agreement audit or built a holdout set, assume the signal has drifted. Not because your team did something wrong, but because this is what happens by default.

The minimum viable intervention: pick 150 examples from your eval set, have two engineers score them independently, compare to your automated scorer, and document what you find. If agreement is below 0.7 Kappa, your eval is measuring something ambiguous. If correlation with your automated scorer is below 0.6, your evaluator has diverged from human judgment. Either result is worth knowing before you ship a decision based on those scores.

Goodhart's Law doesn't mean metrics are useless — it means metrics require maintenance. Your eval suite isn't a one-time artifact; it's infrastructure that degrades without attention. The score that felt like solid ground six months ago may be a false floor now.

References:Let's stay in touch and Follow me for more thoughts and updates