3 posts tagged with "regression-testing"

Deleting an Eval Case Is a Decision, Not Cleanup

May 16, 2026 · 10 min read

Software Engineer

Every eval suite eventually gets pruned. Someone notices the suite takes nine minutes to run, costs $40 a pass, and is full of cases nobody remembers writing. They open a PR titled "clean up stale eval cases," delete forty entries that "don't seem relevant anymore," and the CI run drops to four minutes. The PR gets a thumbs-up. Nobody objects, because deleting tests looks like maintenance.

It is not maintenance. Every eval case is a guarantee the team made to itself: this failure mode will not recur silently. Deleting the case retires the guarantee. The pass rate does not change, the dashboard stays green, and the only thing that disappears is the team's memory that the guarantee ever existed. Six months later a model migration reintroduces exactly the regression a deleted case was guarding, the postmortem rediscovers a lesson the team already paid for once, and someone writes "we should add a test for this" — the test that was deleted in the cleanup PR.

The Snapshot Trace Test: Production Traces as Your Regression Suite

May 9, 2026 · 10 min read

Tian Pan

Software Engineer

The eval set most teams run as their regression suite was hand-curated by an engineer in week three of the project, frozen by week six because nobody wanted to touch it before launch, and is now being used in month nine to gate deploys. The product has shifted twice. The user base has tripled. The cases the LLM actually sees in production overlap with that frozen suite by maybe forty percent. When the suite passes, nobody trusts it; when it fails, nobody knows whether the failure is real or whether the case is just stale. The team writes a doc proposing a "v2 eval set" and never gets around to it.

Meanwhile, every request the system has handled in production has been recorded in a tracing backend. Every prompt, every tool call, every intermediate output, every refusal, every retry — all of it sitting in object storage, time-indexed and span-tagged, ready to be replayed. The highest-fidelity test corpus the team will ever have is already on disk. They built an eval suite from scratch instead of reading from it.

Multimodal Eval Drift: Why Your Image and Audio Paths Regress While Text Stays Green

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

The dashboard says quality is up two points this release. The text-eval suite ran clean. Your model provider shipped a new checkpoint that beats the prior one on every public benchmark you track. You roll forward. A week later the support team flags a quiet but persistent uptick in tickets about uploaded screenshots — users say the model is "reading the wrong numbers from the chart" or "missing a row in the table." Audio transcription complaints follow a few days later, mostly from non-American English speakers. None of it shows up in your eval pipeline. The release looks healthy. It isn't.

This is multimodal eval drift, and almost every team that bolted vision and audio onto a text-first stack is shipping it. The eval discipline that worked for text — gold sets, LLM-as-judge, drift dashboards, an aggregate score that gates the release — extends to multimodal in name only. The failure rates per modality are not commensurable, the rubrics that catch text errors don't catch image errors, and the labeling pipeline that produced your text gold set is calibrated to a workload that ships every six months, not to a multimodal regression that arrives with every checkpoint update.

The right mental model is that multimodality is not a flag on the same model — it is a different product surface with a different failure distribution, and the eval discipline that ignored that distinction is shipping silent regressions every model release.

About Tian Pan