The Trace Replay Your New Model Cannot Trust
The standard playbook for an LLM upgrade has the comforting shape of a unit test. Capture last week's production traces against the incumbent model. Replay them against the candidate. Diff the outputs. If the disagreement rate is below some threshold — say 3% — ship it. The diff is small, the dashboard is green, the migration looks safe. A week later, the on-call channel fills with reports that the new model is forgetting context across turns, calling tools with arguments that no longer parse, and confidently citing documents that have been deleted from the corpus.
The replay didn't lie, exactly. It measured a real thing. It just measured behavior in a context the production model never actually saw, and the green number is a confidence interval over a distribution that doesn't exist anywhere except in the replay harness.
