Skip to main content

Your Eval Harness Is a Museum: How Production Failures Should Write Tomorrow's Tests

· 9 min read
Tian Pan
Software Engineer

Most AI teams build their eval suite once — carefully, thoughtfully, during the sprint before launch. They write cases for the edge scenarios they can imagine, document the expected outputs, get sign-off, and ship. Six months later, the suite still passes. The model has quietly gotten worse on the actual traffic hitting production, but the eval harness was authored before any of that traffic existed. It's still grading the answers to questions the author asked, not the questions users are asking.

That's the museum problem: an eval suite curated at one point in time accumulates relics. It proves the system handles the cases someone anticipated, not the cases that actually break it.

The fix isn't to write more test cases upfront. It's to build a feedback loop that converts every production failure into a permanent regression test automatically — so the suite's coverage grows in direct proportion to the complexity of real usage.

The Frozen Harness and Why It Fails Silently

Static eval harnesses suffer from a version of survivorship bias. You write cases for failures you've already thought about, which means the harness is good at catching regressions on known failure modes. The unknown failure modes — the ones that emerge when thousands of users find the edges of your system — never make it into the suite unless someone manually adds them after the fact.

In practice, manual addition almost never keeps pace. When a bad output surfaces in production, the instinct is to fix the model or the prompt and move on. The regression test gets added sporadically, when an engineer has time, or after the third occurrence makes it feel urgent enough. By the time the case is formally documented, the root cause may already be patched — and the discipline to formalize it erodes under the next feature deadline.

The result is a suite that grows stale in proportion to how much your product grows. More users means more edge cases. More edge cases means more failure modes. If the eval harness isn't consuming those failures, its coverage is diverging from reality.

There's also a subtler problem: public benchmarks leak, and internal benchmarks accumulate benchmark-specific optimizations. When teams benchmark-tune during fine-tuning or prompt engineering, they're implicitly teaching the model to pass those specific cases rather than generalize to the underlying capability. A self-reinforcing harness built from production failures is much harder to overfit, because the cases are always arriving from novel user behavior rather than a fixed test author's imagination.

The Four-Step Feedback Loop

The mechanics of a self-improving eval suite come down to four operations that need to happen automatically, in sequence: capture, dedup, grade, and promote.

Capture means intercepting production failures with enough context to reconstruct the test case. A failure here isn't necessarily an error — it's any output that falls below a quality threshold as measured by an automated scorer (latency, format violations, user-reported thumbs-down, low LLM-judge scores). The capturing system needs to store the full input, the output, the trace of what the system did, and the failure signal.

Modern LLMOps platforms do this via async scoring: they log the request, evaluate it asynchronously against a quality rubric, and flag cases below threshold — with no impact on response latency.

Dedup is where most teams skip a step and pay for it with eval bloat. When production is busy, the same failure pattern may surface dozens of times per day. If every flagged case gets promoted to the eval suite unfiltered, the suite fills with semantic duplicates — 30 variations of the same "user asks about refund policy in informal language" edge case. Semantic deduplication solves this by embedding flagged failures and clustering them: only cases sufficiently distant from existing cases in the embedding space get promoted. The threshold is tunable — you can run tight dedup early when the suite is small and loosen it as coverage matures and new cases become genuinely harder to find.

Grade determines not just whether a new case should be added, but how much weight it should receive in the eval run. Cases that expose novel failure modes get higher weight. Cases that represent a common but well-covered pattern get lower weight. The grading signal can come from difficulty estimation (did the model pass this case in N attempts?), from uniqueness in embedding space, or from the severity of the production impact that surfaced it. This is where the eval suite starts to behave like an active learning system: it preferentially allocates attention to cases that provide the most diagnostic signal per compute dollar.

Promote is the final step: the deduped, graded case gets added to the harness with a source tag (production failure, timestamp, severity), a pass/fail label, and — if a ground-truth output can be determined — the expected response. For cases where ground truth is ambiguous, an LLM judge fills the role. Research shows that strong LLM judges align with human raters at around 85% for both pairwise comparisons and single-output quality scoring, which is sufficient for most eval tasks even when the judge isn't perfect.

The Paradox: Coverage Grows Fastest When Production Is Messiest

The most counterintuitive property of a production-driven eval suite is that its growth rate is highest exactly when you want it to be: during periods of high traffic, feature launches, and model upgrades — the moments when your system is under the most novel stress.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates