Skip to main content

2 posts tagged with "regression-testing"

View all tags

The Snapshot Trace Test: Production Traces as Your Regression Suite

· 10 min read
Tian Pan
Software Engineer

The eval set most teams run as their regression suite was hand-curated by an engineer in week three of the project, frozen by week six because nobody wanted to touch it before launch, and is now being used in month nine to gate deploys. The product has shifted twice. The user base has tripled. The cases the LLM actually sees in production overlap with that frozen suite by maybe forty percent. When the suite passes, nobody trusts it; when it fails, nobody knows whether the failure is real or whether the case is just stale. The team writes a doc proposing a "v2 eval set" and never gets around to it.

Meanwhile, every request the system has handled in production has been recorded in a tracing backend. Every prompt, every tool call, every intermediate output, every refusal, every retry — all of it sitting in object storage, time-indexed and span-tagged, ready to be replayed. The highest-fidelity test corpus the team will ever have is already on disk. They built an eval suite from scratch instead of reading from it.

Multimodal Eval Drift: Why Your Image and Audio Paths Regress While Text Stays Green

· 11 min read
Tian Pan
Software Engineer

The dashboard says quality is up two points this release. The text-eval suite ran clean. Your model provider shipped a new checkpoint that beats the prior one on every public benchmark you track. You roll forward. A week later the support team flags a quiet but persistent uptick in tickets about uploaded screenshots — users say the model is "reading the wrong numbers from the chart" or "missing a row in the table." Audio transcription complaints follow a few days later, mostly from non-American English speakers. None of it shows up in your eval pipeline. The release looks healthy. It isn't.

This is multimodal eval drift, and almost every team that bolted vision and audio onto a text-first stack is shipping it. The eval discipline that worked for text — gold sets, LLM-as-judge, drift dashboards, an aggregate score that gates the release — extends to multimodal in name only. The failure rates per modality are not commensurable, the rubrics that catch text errors don't catch image errors, and the labeling pipeline that produced your text gold set is calibrated to a workload that ships every six months, not to a multimodal regression that arrives with every checkpoint update.

The right mental model is that multimodality is not a flag on the same model — it is a different product surface with a different failure distribution, and the eval discipline that ignored that distinction is shipping silent regressions every model release.