Skip to main content

The Trace Replay Your New Model Cannot Trust

· 12 min read
Tian Pan
Software Engineer

The standard playbook for an LLM upgrade has the comforting shape of a unit test. Capture last week's production traces against the incumbent model. Replay them against the candidate. Diff the outputs. If the disagreement rate is below some threshold — say 3% — ship it. The diff is small, the dashboard is green, the migration looks safe. A week later, the on-call channel fills with reports that the new model is forgetting context across turns, calling tools with arguments that no longer parse, and confidently citing documents that have been deleted from the corpus.

The replay didn't lie, exactly. It measured a real thing. It just measured behavior in a context the production model never actually saw, and the green number is a confidence interval over a distribution that doesn't exist anywhere except in the replay harness.

The replay's hidden assumption is that a trace is a sufficient record of a turn. In a stateless function call — given inputs (x, y), return f(x, y) — that assumption holds. In a production LLM system it holds for almost nothing. The trace captures the request, the response, and maybe the tool calls in between. It does not capture the KV cache that was warm for that turn because the previous fifty turns from the same session lived in the prefix. It does not capture the prompt cache that hit because three other users in the same batch sent overlapping prefixes that minute. It does not capture the agent's memory store, which an earlier session wrote at 03:00 UTC and an aggressive TTL deleted at 09:00. It does not capture the tool side effect that mutated the database between the trace's creation and your replay job's run. It does not capture the third-party API whose response yesterday was "200 OK with payload X" and whose response now is "404 — that record was deleted." The trace is a snapshot of behavior in a context, and the replay reconstructs only the behavior.

The model behavior depends on state the replay cannot reconstruct

Strip the replay down to its assumptions and you get a list of things that have to be true for the green number to mean anything. Every one of them is unstable.

The KV cache, in any modern serving stack, is shared across requests with overlapping prefixes. Two production turns that look identical at the API level — same system prompt, same user message — can produce different latencies and, under aggressive prefix-cache settings, different reasoning paths if the cache eviction policy fragmented the prefix differently. The replay runs each turn cold, against a freshly initialized cache, on a candidate model whose cache layout is by construction different from the production model's. The replay's latency numbers are not the production system's latency numbers, and any metric you derive from them — including refusal rate, since timeouts surface as refusals — carries the bias of a cold-start system measured against a warm-state one.

The agent's memory store is the second invisible dependency. Trace replay assumes that the memory the agent reads on turn 7 is the memory the agent had on turn 7 in production. In practice, the production agent's turn 7 read whatever was written by turn 1 through turn 6 of that session, plus possibly turn N-K of a different session under the same user, plus a background summarization job that ran between turn 4 and turn 5. The replay can either (a) replay the memory writes too, in which case it is no longer a unit test but a session simulator with all the state-management bugs that implies, or (b) freeze the memory at trace-capture time, in which case it measures behavior against a memory snapshot that is now older than the production cache eviction policy would have allowed.

Tool side effects are the third. A trace from yesterday says the agent called create_invoice(amount=400, customer=42) and the tool returned {"invoice_id": "INV-99"}. The replay runs the same call today, against the same tool, and the tool either creates a duplicate invoice (now the test environment is polluted), or 404s because the customer ID was test data that's since been cleaned up, or returns a different invoice ID that the next turn's prompt no longer expects. To preserve the trace's downstream context, you have to either mock the tool — which means you're no longer testing tool behavior, only the model's adherence to a frozen tool stub — or run the tool for real and accept that "real" now means "in a sandbox that diverges from production starting at turn 2."

The upstream services that the agent called — search APIs, knowledge retrieval, third-party SaaS — are the fourth. The world has moved between the trace and the replay. The Wikipedia article changed. The Stripe API version is one minor version later. The internal search index reindexed and demoted half the documents the agent cited. The replay measures the new model against a snapshot of an upstream world that no longer exists.

The replay's confidence number measures the wrong variance

Even granting that the replay is a useful smoke test, the confidence number it produces is a narrower interval than the one the team reads it as. The replay's variance is the variance of "outputs given identical inputs across two models in identical state." The production system's variance is the variance of "outputs given the actual input distribution, in the actual state distribution, with the actual upstream-service noise and the actual prompt-cache hit-rate distribution." The second is strictly larger than the first, often by an order of magnitude on the failure modes the team most cares about.

This matters because the threshold most teams pick for shipping a replay-validated migration — "agreement rate above some bar" or "downstream task accuracy within some delta" — assumes the replay's variance is a usable estimate of the production variance. It isn't. The cases where the production system's behavior depends on state the replay cannot reproduce are systematically the cases where the new model's behavior will differ from the old model's in production. They are also systematically the cases the replay scores as "agreement" because both models, given the same frozen context, produce reasonable outputs. The replay tells you the models agree on the easy cases. It cannot tell you what they will do when the context is fragmented, the memory is partially evicted, the tool call returns an unexpected error, or the prompt cache misses on a prefix the model was implicitly relying on.

There is a sharper way to state the problem. Trace replay is a measurement of output reproducibility under a context that the candidate model has been handed for free. Production is a measurement of output quality under a context that the candidate model has to assemble for itself from a noisy, mutable, partially observable world. These are different tasks. A model that is good at the first can be bad at the second. The data the replay produces is silent on which one you've got.

Shadow traffic is not replay with a different label

The instinct of teams who have been bitten by replay's blind spots is to escalate to shadow traffic: send a copy of every live request to the candidate model, compare outputs to the incumbent's, ship when the disagreement rate is low enough. Shadow traffic fixes the worst class of replay bug — it runs the candidate against live state, with the actual upstream services, the actual memory store, the actual tool side effects, the actual prompt cache contention. It is genuinely a stronger signal.

It is not, however, a complete signal. Shadow traffic has its own assumptions that fail quietly. The candidate model is being fed inputs that the incumbent model's behavior shaped. The session's turn 7 input depends on turn 6's output, and in production that turn 6 was produced by the incumbent. The shadow candidate sees a turn 7 input that reflects the incumbent's reasoning, not its own. If the candidate would have produced a different turn 6, the turn 7 input it sees in shadow is not the input it would actually see if it were live. This is the off-policy evaluation problem the reinforcement-learning literature has been wrestling with for two decades, and the LLM migration story almost never names it.

Shadow traffic is also expensive in a way that replay isn't. Replay runs offline against captured data. Shadow runs in real time against live load, which means you're paying double inference cost for the duration of the shadow window. Teams who price shadow as "we'll just leave it on for two weeks" discover after the fact that two weeks of doubled inference is a meaningful line item, and the window gets cut short, and the candidate ships before the long-tail state-dependent failure modes have had a chance to surface.

The honest framing is that replay and shadow are two different validation primitives, with two different cost profiles and two different blind spots. Replay is cheap, fast, and structurally biased toward overconfidence on state-dependent behavior. Shadow is expensive, slower to accumulate signal, and structurally biased toward overconfidence on counterfactual session dynamics. Neither one is a release gate on its own.

What a state-aware migration discipline looks like

The teams that have survived a few migrations have converged on a few discipline shifts that the replay-then-ship script does not contain.

The first is treating replay as a smoke test, not a release gate. A green replay means the candidate model has not regressed on the easy cases — the same answer to the same question in the same context. That's worth knowing. It is not the same as "safe to ship," and the team should not let leadership read it that way. The replay's role is to catch the trivial regressions early, so the more expensive validation primitives are not wasted on candidates that fail the easy cases.

The second is auditing the replay set for state dependence before drawing any conclusions from it. Every trace in the replay corpus carries implicit dependencies — on memory writes from earlier sessions, on tool side effects, on upstream service state, on cache configuration. A state-dependence audit walks the corpus and flags which traces the replay cannot faithfully reproduce. Some traces will be clean: stateless single-turn requests where the input fully determines the output. Some will be dirty: multi-turn sessions with memory dependencies and tool side effects. The replay's headline number should report agreement on the clean slice and agreement on the dirty slice separately, with the understanding that the dirty slice's number is closer to noise than signal.

The third is sized shadow traffic, sized for signal not for ritual. A shadow window of 1% of traffic for two weeks produces less actionable signal than a shadow window of 10% of traffic for three days, because the failure modes you most need to catch are tail behaviors that need a large absolute sample to surface. The decision about shadow sizing should be driven by what failure rate you need to detect with what statistical confidence, not by what fraction "feels safe." When the math says the window has to be three months long to detect the failure rate you care about at the budget you can afford, that is a signal that shadow traffic is not the right primitive for this migration.

The fourth is a canary discipline that takes the replay and shadow numbers as priors, not as posteriors. The canary is the first time the candidate model is producing outputs the production system acts on, and it is the first time the candidate's behavior shapes the input distribution the candidate will see on the next turn. The canary's first 1% deserves more rollback sensitivity than its later phases, because the state-dependent failure modes the prior validation could not see are most likely to surface there. A canary that ramps from 1% to 10% in twenty-four hours has not learned anything the prior validation didn't already say. A canary that holds at 1% for a week and instruments the long-tail outcome metrics — refusal rate by task class, downstream user-action rate, escalation rate — is doing the work the replay couldn't.

The fifth is an honest confidence interval. The team that says "the candidate's agreement rate was 97.2% on the replay set" is reporting a number the harness produced. The team that says "the candidate's agreement rate was 97.2% on the replay set, the replay set has 34% state-dependent traces whose results we are deweighting, the shadow window has a 4-day signal accumulation gap on tail failure modes, and we estimate the probability of an undetected regression at first canary phase as roughly one in fifteen" is reporting a confidence interval that includes the variance the validation could not observe. The second team will be unpopular in pre-launch reviews and well-respected after the first migration that doesn't surprise them.

What replay can and cannot tell you

The architectural realization underneath all of this is unromantic. A trace is a record of behavior in a context. Replay reconstructs the behavior. It does not reconstruct the context. The team that migrates on the strength of a replay agreement number has shipped a confidence claim about a distribution that does not exist outside the replay harness, and the production system will discover the distribution it actually lives in by paying for the discovery with user-visible incidents.

There is no validation primitive that gives you a state-faithful preview of a new model's production behavior at zero risk. Replay buys you cheap and biased. Shadow buys you expensive and counterfactually skewed. Canary buys you real and small-scale. The migration that treats all three as different points on a cost-versus-signal curve, and budgets honest confidence intervals around each, will not avoid all migration incidents. It will at least know which class of failure each new model is most likely to surprise it with — and that is the actual job the team thought it was buying when it stood up the replay harness in the first place.

References:Let's stay in touch and Follow me for more thoughts and updates