The Eval Smell Catalog: Anti-Patterns That Make Your LLM Eval Suite Worse Than No Evals At All
A team I worked with last year had an eval suite with 847 test cases, a green dashboard, and a shipping cadence that looked disciplined from the outside. Then their flagship summarization feature started generating confidently wrong summaries for roughly one in twenty customer support threads. The eval score for that capability had been 94% for six months straight. When we audited the suite, the problem wasn't that the evals were lying. The problem was that the evals had quietly rotted into something that measured the wrong thing, punished correct model behavior, and shared blind spots with the very model it was evaluating. The suite wasn't broken in the loud way tests break. It was broken in the way a thermometer is broken when it reads room temperature no matter where you put it.
Test smells have been studied for two decades in traditional software. The Van Deursen catalog, the xUnit patterns taxonomy, and more recent work have documented how tests that look fine can actively harm a codebase — by encoding the wrong specification, by making refactors expensive, by creating false confidence that pushes the real bugs deeper. LLM evals are new enough that the equivalent literature barely exists, but the same dynamic is already playing out in every AI team I talk to. The difference is that LLM eval smells have mechanisms traditional tests don't: training data overlap, stochastic outputs, judge-model feedback loops, capability drift. You can't just port the old taxonomy. You need a new one.
What follows is the catalog I've been accumulating — five smells that I've seen turn eval suites into liabilities, with the refactoring pattern for each. The goal isn't to throw out your evals. It's to restore their signal, the way a test refactor restores signal in production code.
Smell 1: Training-Eval Data Leakage
The smell: your eval scores are suspiciously high, and they don't correlate with how users actually experience the feature. Specifically, scores are high on tasks your model has likely seen, and low on tasks it probably hasn't.
This is the most widely documented LLM eval failure, and it has at least three mechanisms that teams conflate. First, classic benchmark contamination: the model was trained on a corpus that included the test set, so evaluation measures memorization instead of capability. Second, subtle operational leakage: your eval set was generated using the same model you're evaluating (a "golden dataset" produced by GPT-4 that you then use to test GPT-4 variants), and the distribution quietly encodes the evaluator's preferences. Third, preference leakage in LLM-as-judge setups, where the judge model has been fine-tuned on data produced by — or closely related to — the model being judged, so the judge systematically over-rates certain styles.
Recent work has shown that even mitigation strategies can fail their own premise. Code refactoring to disguise benchmarks, synthetic perturbations, paraphrased rewrites — none are both effective and faithful to the original evaluation intent across all settings. The practical lesson isn't that contamination is unfixable. It's that any eval set over six months old, especially one curated from web-like sources or generated with a frontier model, should be treated as contaminated until proven otherwise.
The refactor: split your eval suite by provenance. Tag each case with its origin — synthetic-from-production, hand-written, public benchmark, model-generated. Track scores per bucket rather than in aggregate. When the gap between buckets is large (public benchmarks scoring 15+ points above fresh production-derived cases), you have a contamination signal you can act on, not a mystery. Add a "fresh cases" cohort sourced from the last 30 days of production and weight regressions there more heavily than on the legacy public bucket. Over time, retire public benchmark cases from the gating suite entirely and keep them only as historical reference.
Smell 2: Brittle Exact-Match Assertions
The smell: your test is red, but the output is correct. "The answer is 42" fails against the expected "42". A JSON key order swap breaks the assertion. A model that starts wrapping numbers in units ("42 ms" instead of "42") torches your pass rate, and nobody can tell whether the regression is real.
This one ports directly from the traditional test smell literature, where "Assertion Roulette" and "Fragile Test" have been catalogued for years. In LLM evals it takes an especially toxic form because natural language has no canonical surface form. The model produces one of thousands of valid paraphrases; your assertion accepts one. Every model upgrade shuffles the distribution of phrasings, and your eval suite punishes the model for drifting toward a different valid form.
Exact match still has a place. Structured outputs where format matters (a SQL query, a function call, a machine-readable enum) should be checked with exact or near-exact equality — that's not a smell, that's the correct tool. The smell is applying exact match to free-form outputs where semantic equivalence is what you actually care about.
The refactor: move free-form assertions up the semantic stack. For short factual answers, use extract-and-normalize assertions (parse the number, strip units, compare). For longer outputs, prefer property-based checks ("answer contains the correct entity", "answer does not contradict source", "answer is no more than two sentences") over full-string equality. For truly open-ended text, use rubric-graded judge evaluation with explicit criteria, not BLEU-style lexical overlap. When you do need to assert against reference text, use semantic similarity with a quality-controlled threshold rather than exact match — and pin the embedding model used for the comparison so the threshold stays meaningful across runs.
Smell 3: Eval Rot
The smell: your eval suite passes, your users complain, and the two sets of signals no longer overlap. Production is failing in ways your suite was never designed to see, because your suite was designed around failure modes from two model generations ago.
In traditional tests this is sometimes called "test rot" or "test relevance decay," but LLM eval rot is faster and meaner. Model capabilities shift every few months. Product surfaces expand. User behavior adapts to the model's quirks, which changes the input distribution. An eval suite written in early 2024 around GPT-4's refusal patterns is mostly measuring an artifact that no longer exists — and worse, it's punishing newer models for not reproducing the old behavior. The suite isn't measuring quality; it's measuring backwards-compatibility with a prior model's idiosyncrasies.
A specific sub-smell here is what I'd call "capability pinning": assertions written to demand a specific reasoning path ("response must mention intermediate step X") rather than a correct outcome. When a more capable model reaches the right answer via a different route, the test fails, and an eager engineer "fixes" the model by prompting it back into the old path. Capability pinning turns your eval suite into a regression-enforcement mechanism, actively resisting model improvements.
The refactor: enforce an eval freshness SLO. Every case gets a last-reviewed date. Cases older than a threshold (90 days is a reasonable starting point, shorter for rapidly evolving features) are flagged for review — not auto-deleted, but required to be re-examined against current production failure modes before they can gate a release. Pair this with a continuous ingestion pipeline: sample production traces, annotate the interesting failures, convert the failures into new eval cases. Teams that treat their eval set as a living index of current failure modes rather than a fossil of past ones don't get rot.
Concretely, audit your current assertions for capability pinning. For every assertion that names a specific intermediate step, ask: would this test pass if the model reached the correct final answer via a different route? If the answer is no and that route is plausibly better, rewrite the assertion to check the outcome, not the path.
Smell 4: Judge-Model Collusion
The smell: your LLM-as-judge eval gives glowing scores to outputs that human reviewers find mediocre. Scores for your production model are consistently higher when the judge is the same model family, and drop when you swap in a different family's judge. Your judge has opinions that look suspiciously like preferences the production model was trained on.
The mechanism is well-studied. Judge models prefer outputs with lower perplexity under their own distribution — which means outputs generated by similar models, or fine-tuned against similar preference signals, receive systematically higher scores. This is "self-preference bias," and it survives even careful prompt engineering of the judge. Add in position bias (judges favor earlier or later options based on model family and prompt structure), verbosity bias (longer, more fluent outputs get higher scores regardless of substance), and alignment bias (judges favor outputs matching their own policy preferences), and you have an eval that's measuring stylistic alignment with the judge rather than quality.
The deeper failure mode is feedback-loop collusion: the judge model and the production model share blind spots. A hallucination both models would produce won't be flagged. A factual gap both models share becomes invisible. Your eval can never see below its own epistemic floor.
The refactor: treat the judge as a component that needs its own validation, not as ground truth. First, build a human-labeled calibration set — a few hundred production-representative cases with human scores — and measure judge-human agreement. A judge with less than moderate agreement with humans on your task is not a judge; it's a random signal generator. Second, use judge model diversity: rotate among at least two model families, and flag any case where they disagree strongly for human review. Third, prefer narrow, specific judge prompts ("does this output contain a factual claim contradicted by the source?" scored binary) over vague ones ("rate quality from 1-10"). Narrow prompts degrade more gracefully under bias.
Finally, for any benchmark where your own production model might serve as the judge or share ancestry with the judge, disclose it in your internal dashboards. Collusion becomes much easier to catch when it's labeled rather than hidden.
Smell 5: Vanity Aggregates
The smell: your eval dashboard shows one big number. That number is green. Nobody can tell you what it means, and when pressed, the number moves for reasons nobody can explain.
This is the LLM-eval version of the traditional "Uninformative Test Name" and "Mystery Guest" smells, elevated by the fact that LLM eval suites often span wildly different task types. A single aggregate score that mixes factuality, instruction-following, safety, tone, and latency into one number is fundamentally a PR metric, not an engineering signal. When the number moves, you can't tell whether your factuality got better, your safety got worse, your tone shifted, or your latency tanked. When the number is flat, you can't tell whether nothing changed or whether equal and opposite movements canceled out.
The sub-smell to watch for is "averaging across severity." A suite that averages a question-the-model-shouldn't-answer-under-any-circumstances case together with a minor tone-preference case is implicitly claiming they matter equally. They don't. Under an average, a regression on the critical case gets diluted by improvements on the trivial one, and the aggregate moves in the wrong direction while the actual safety posture degrades.
The refactor: separate your eval suite into signal clusters — factuality, instruction-following, safety, format compliance, latency, cost — and report them independently. Never roll up across clusters into a single scalar. Within each cluster, stratify by severity: a "must-not-fail" bucket that blocks releases on any regression, a "core quality" bucket that triggers review, a "polish" bucket that's informational. A release that improves polish by 3 points and regresses safety by 1 point should not ship, and no sane dashboard would let it look like progress. A stratified dashboard prevents the arithmetic from lying; an aggregate one invites it.
The Meta-Smell: Evals Without Error Analysis
Underneath all five specific smells is a structural one: running evals without doing error analysis on the failures. A suite that produces pass/fail numbers but no one sits down to read the actual failing outputs will accumulate every smell above, because the smells are only visible when you look at specifics. The contaminated case looks fine in aggregate; you have to eyeball it to notice the model is reciting. The brittle assertion looks like a regression; you have to read the diff to see the output was correct. The rotted case passes until you notice it's testing behavior that doesn't matter anymore. Error analysis — the unsexy practice of reading thirty failing cases by hand every week — is the activity that keeps the catalog above from forming in the first place.
Teams that skip error analysis tend to treat evals as a CI check. Teams that do error analysis treat evals as a lens, and keep sharpening the lens as the thing it's looking at changes shape. The second kind of team ships AI systems that hold up; the first kind ships systems that look green until they don't.
What To Do Monday Morning
You don't need to rewrite your eval suite. You need to instrument it. Tag cases by provenance so contamination is visible. Audit a random 20 assertions for brittleness and convert the worst offenders to property checks. Put a last-reviewed date on every case and retire anything older than your freshness SLO that nobody can justify. Measure your judge against human labels, and if the agreement is poor, either fix the judge or stop using it. Break your dashboard into clusters and refuse to ship on any aggregate that hides severity. And sit down once a week with thirty failing traces and actually read them.
The suites that survive model upgrades, product pivots, and capability jumps are the ones whose maintainers treat them like living infrastructure. The ones that decay into vanity dashboards are the ones where nobody has opened a failing case in three months. The difference is entirely a matter of practice — which means the fix is entirely within reach.
- https://aclanthology.org/2025.acl-long.901/
- https://aclanthology.org/2025.emnlp-main.511/
- https://leak-llm.github.io/
- https://howiehwong.github.io/preference_leakage.pdf
- https://arxiv.org/abs/2410.21819
- https://llm-judge-bias.github.io/
- https://aclanthology.org/2025.ijcnlp-long.18.pdf
- https://testsmells.org/
- https://link.springer.com/article/10.1007/s11219-024-09663-7
- https://arxiv.org/html/2411.13768v3
- https://latitude.so/blog/why-ai-agents-break-in-production
- https://hamel.dev/blog/posts/evals-faq/
- https://eugeneyan.com/writing/eval-process/
- https://newsletter.ruder.io/p/the-evolving-landscape-of-llm-evaluation
- https://langfuse.com/blog/2025-08-29-error-analysis-to-evaluate-llm-applications
