Skip to main content

Deleting an Eval Case Is a Decision, Not Cleanup

· 10 min read
Tian Pan
Software Engineer

Every eval suite eventually gets pruned. Someone notices the suite takes nine minutes to run, costs $40 a pass, and is full of cases nobody remembers writing. They open a PR titled "clean up stale eval cases," delete forty entries that "don't seem relevant anymore," and the CI run drops to four minutes. The PR gets a thumbs-up. Nobody objects, because deleting tests looks like maintenance.

It is not maintenance. Every eval case is a guarantee the team made to itself: this failure mode will not recur silently. Deleting the case retires the guarantee. The pass rate does not change, the dashboard stays green, and the only thing that disappears is the team's memory that the guarantee ever existed. Six months later a model migration reintroduces exactly the regression a deleted case was guarding, the postmortem rediscovers a lesson the team already paid for once, and someone writes "we should add a test for this" — the test that was deleted in the cleanup PR.

The asymmetry is the whole problem. Adding an eval case is visible work — it shows up in review, it has an author, it usually cites an incident. Deleting one is invisible work, because a smaller suite that still passes looks strictly better than a larger one. So the suite grows under scrutiny and shrinks without it. That is exactly backwards. The delete is the higher-stakes operation, and it is the one nobody reviews.

An eval suite is a ledger of promises

A traditional unit test suite has a forgiving property: most of its cases are derivable. If you delete the test for a pure function, a competent engineer can look at the function and write an equivalent test from scratch. The test encodes logic that is still visible in the code.

An eval suite does not work that way. A large fraction of its cases are not derivable from anything — they are harvested from reality. The case exists because a specific user typed a specific input, the model produced a specific wrong answer, and someone decided that class of failure was worth guarding. That provenance is the case's entire value. The input is an arbitrary string. The expected behavior is a judgment call someone made under the pressure of a real incident. Delete the case and there is nothing in the codebase from which to reconstruct it. You have not removed a test; you have removed a fact about the world that you can no longer observe.

This is why the standard advice to "treat eval datasets like code, version them, prune obsolete cases" is half a recommendation. Versioning the dataset preserves the content of deleted cases — you can git log your way back to the JSON. What versioning does not preserve is the signal. A future engineer does not grep ten thousand lines of dataset history wondering whether a failure mode was once tested. They look at the current suite, see no case for the bug they just shipped, and conclude the team never thought about it. The deleted guarantee is functionally gone even though the bytes are technically recoverable.

So the suite is better understood not as a dataset but as a ledger. Each row is a promise: we have decided this behavior matters and we are committed to detecting its regression. A clean-up PR that deletes forty rows is forty promises broken with no minutes recorded.

Why suites get pruned for the wrong reasons

The pressure to prune is real and the reasons are usually legitimate on their face:

  • Cost. Every case is an LLM call, often several if you use an LLM judge. A suite of a few thousand cases run on every PR is a genuine line item.
  • Latency. A nine-minute eval gate trains engineers to merge and look away, which defeats the gate.
  • Noise. Cases drift. The product changed, the case's expected output is now wrong, and it fails for reasons unrelated to model quality. A flaky case erodes trust in the whole suite.

Notice what these three have in common: they are all arguments about the cost of a case. None of them is an argument about the value of the guarantee the case encodes. And the engineer doing the pruning is almost always optimizing the thing they can see — the CI time, the API bill — against a thing they cannot see, which is the probability that this exact failure mode comes back.

That information gap is the org seam. The person pruning the suite for speed is, structurally, not the person who will be paged when the guarantee turns out to have mattered. The pruner is doing infra hygiene this sprint. The cost of a bad delete lands on a future on-call engineer, in a different quarter, possibly on a different team, who has no way to know the failure was ever anticipated. Cleanup feels free precisely because its bill is sent to someone else.

A genuinely stale case — one whose expected output contradicts current product behavior — should be removed. The point is not that suites must only grow. The point is that "this case is noisy" and "this guarantee no longer matters" are different claims, and the cleanup PR collapses them into one.

Borrow the deprecation lifecycle from APIs

Software already has a mature pattern for retiring something that other parties depend on, and nobody treats it as cleanup: API deprecation. You do not delete an endpoint because traffic looks low. You mark it deprecated with a reason and a date, you keep it serving while consumers migrate, and when you finally remove it you return 410 Gone — a response that explicitly says this used to exist and was intentionally retired — rather than 404 Not Found, which says this never existed. The whole lifecycle is designed so that removal is a deliberate, auditable, multi-step act rather than a delete key.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates