Skip to main content

The Eval Smell Catalog: Anti-Patterns That Make Your LLM Eval Suite Worse Than No Evals At All

· 12 min read
Tian Pan
Software Engineer

A team I worked with last year had an eval suite with 847 test cases, a green dashboard, and a shipping cadence that looked disciplined from the outside. Then their flagship summarization feature started generating confidently wrong summaries for roughly one in twenty customer support threads. The eval score for that capability had been 94% for six months straight. When we audited the suite, the problem wasn't that the evals were lying. The problem was that the evals had quietly rotted into something that measured the wrong thing, punished correct model behavior, and shared blind spots with the very model it was evaluating. The suite wasn't broken in the loud way tests break. It was broken in the way a thermometer is broken when it reads room temperature no matter where you put it.

Test smells have been studied for two decades in traditional software. The Van Deursen catalog, the xUnit patterns taxonomy, and more recent work have documented how tests that look fine can actively harm a codebase — by encoding the wrong specification, by making refactors expensive, by creating false confidence that pushes the real bugs deeper. LLM evals are new enough that the equivalent literature barely exists, but the same dynamic is already playing out in every AI team I talk to. The difference is that LLM eval smells have mechanisms traditional tests don't: training data overlap, stochastic outputs, judge-model feedback loops, capability drift. You can't just port the old taxonomy. You need a new one.

What follows is the catalog I've been accumulating — five smells that I've seen turn eval suites into liabilities, with the refactoring pattern for each. The goal isn't to throw out your evals. It's to restore their signal, the way a test refactor restores signal in production code.

Smell 1: Training-Eval Data Leakage

The smell: your eval scores are suspiciously high, and they don't correlate with how users actually experience the feature. Specifically, scores are high on tasks your model has likely seen, and low on tasks it probably hasn't.

This is the most widely documented LLM eval failure, and it has at least three mechanisms that teams conflate. First, classic benchmark contamination: the model was trained on a corpus that included the test set, so evaluation measures memorization instead of capability. Second, subtle operational leakage: your eval set was generated using the same model you're evaluating (a "golden dataset" produced by GPT-4 that you then use to test GPT-4 variants), and the distribution quietly encodes the evaluator's preferences. Third, preference leakage in LLM-as-judge setups, where the judge model has been fine-tuned on data produced by — or closely related to — the model being judged, so the judge systematically over-rates certain styles.

Recent work has shown that even mitigation strategies can fail their own premise. Code refactoring to disguise benchmarks, synthetic perturbations, paraphrased rewrites — none are both effective and faithful to the original evaluation intent across all settings. The practical lesson isn't that contamination is unfixable. It's that any eval set over six months old, especially one curated from web-like sources or generated with a frontier model, should be treated as contaminated until proven otherwise.

The refactor: split your eval suite by provenance. Tag each case with its origin — synthetic-from-production, hand-written, public benchmark, model-generated. Track scores per bucket rather than in aggregate. When the gap between buckets is large (public benchmarks scoring 15+ points above fresh production-derived cases), you have a contamination signal you can act on, not a mystery. Add a "fresh cases" cohort sourced from the last 30 days of production and weight regressions there more heavily than on the legacy public bucket. Over time, retire public benchmark cases from the gating suite entirely and keep them only as historical reference.

Smell 2: Brittle Exact-Match Assertions

The smell: your test is red, but the output is correct. "The answer is 42" fails against the expected "42". A JSON key order swap breaks the assertion. A model that starts wrapping numbers in units ("42 ms" instead of "42") torches your pass rate, and nobody can tell whether the regression is real.

This one ports directly from the traditional test smell literature, where "Assertion Roulette" and "Fragile Test" have been catalogued for years. In LLM evals it takes an especially toxic form because natural language has no canonical surface form. The model produces one of thousands of valid paraphrases; your assertion accepts one. Every model upgrade shuffles the distribution of phrasings, and your eval suite punishes the model for drifting toward a different valid form.

Exact match still has a place. Structured outputs where format matters (a SQL query, a function call, a machine-readable enum) should be checked with exact or near-exact equality — that's not a smell, that's the correct tool. The smell is applying exact match to free-form outputs where semantic equivalence is what you actually care about.

The refactor: move free-form assertions up the semantic stack. For short factual answers, use extract-and-normalize assertions (parse the number, strip units, compare). For longer outputs, prefer property-based checks ("answer contains the correct entity", "answer does not contradict source", "answer is no more than two sentences") over full-string equality. For truly open-ended text, use rubric-graded judge evaluation with explicit criteria, not BLEU-style lexical overlap. When you do need to assert against reference text, use semantic similarity with a quality-controlled threshold rather than exact match — and pin the embedding model used for the comparison so the threshold stays meaningful across runs.

Smell 3: Eval Rot

The smell: your eval suite passes, your users complain, and the two sets of signals no longer overlap. Production is failing in ways your suite was never designed to see, because your suite was designed around failure modes from two model generations ago.

In traditional tests this is sometimes called "test rot" or "test relevance decay," but LLM eval rot is faster and meaner. Model capabilities shift every few months. Product surfaces expand. User behavior adapts to the model's quirks, which changes the input distribution. An eval suite written in early 2024 around GPT-4's refusal patterns is mostly measuring an artifact that no longer exists — and worse, it's punishing newer models for not reproducing the old behavior. The suite isn't measuring quality; it's measuring backwards-compatibility with a prior model's idiosyncrasies.

A specific sub-smell here is what I'd call "capability pinning": assertions written to demand a specific reasoning path ("response must mention intermediate step X") rather than a correct outcome. When a more capable model reaches the right answer via a different route, the test fails, and an eager engineer "fixes" the model by prompting it back into the old path. Capability pinning turns your eval suite into a regression-enforcement mechanism, actively resisting model improvements.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates