Shadow Replay Punishes the Model That Would Have Changed the Conversation
A team I worked with last quarter shipped a new model into shadow replay and watched the win rate sit at 47 percent against the incumbent. Same prompts, same retrieval, a model the vendor's own evals had ranked clearly higher. The shadow harness took last week's production traffic, pumped it through the candidate, fed both responses to an LLM judge, and declared the upgrade roughly a coin flip. The team almost reverted on the spot.
The problem was not the model. The problem was that every user message in the replay had already been conditioned on the old model's previous turn. The candidate wrote a better answer at turn one, the user in the log replied to a different answer that no longer existed, and from turn two onward the judge was scoring a conversation that was not happening. A genuinely better model that changes what the user does next has no ground truth to be scored against. The replay quietly rewards staying on the old rails.
This is not a new problem. The recommender systems community has been beating its head against it for fifteen years under the name off-policy evaluation. LLM teams are now rediscovering it without the literature, and they are drawing the wrong conclusions from their shadow dashboards because of it.
The Silent Assumption in Every Replay Harness
The pitch for shadow replay is seductive. You already have the traffic. You already have the latencies, the token costs, the user reactions. Why run a live experiment, pay the cost of routing real users to a new model, and wait two weeks for significance when you can just rerun history at 3am for the price of inference?
The unstated assumption is that the logged user behavior is a property of the request, not of the conversation. That assumption holds in exactly one case: when each request is independent and the model's output does not influence what arrives next. A spam classifier, a single-shot translation, a content moderation call on a fixed corpus — these can be replayed honestly because the next input is unaffected by the current output. The label was true regardless of what the model said.
Anything that talks back breaks this. A coding assistant whose first suggestion shapes which file the user opens next. A support agent whose clarifying question determines whether turn two is a follow-up or a frustrated repeat. A search refinement loop where the user's next query reflects the previous result set. In every one of these, the logged turn two is a function of the old model's turn one. Replace turn one and turn two becomes a relic. You are scoring the candidate against a counterfactual that the data cannot contain.
This is what makes the LLM judge result misleading. The judge is comparing the candidate's response to a logged user reply that no longer makes sense. If the candidate asked a sharper clarifying question, the logged "yes, do option B" is just noise. The judge sees a mismatch and scores against the candidate. Score one for the incumbent on a turn the incumbent never actually got right.
Multi-Turn Is Where the Math Stops Working
Single-turn replay leaks. Multi-turn replay falls apart.
A reasonable single-turn case looks like this: a user types a question, the model answers, the user gives a thumbs-up or files a follow-up ticket. If the candidate produces a different answer, you do not know whether the user would have given the same thumbs-up, but at least the next request is from a different conversation entirely. The bias is bounded by how often the response shapes the next question, and for transactional traffic that bound is small.
Multi-turn destroys that bound. Once the candidate's turn one differs from the log, every subsequent turn in the trace is sampled from a distribution that does not exist. The user in the log was responding to a different model. Their turn two cannot be replayed against the candidate without one of two unattractive choices: either feed the candidate the old assistant turn (which never happened in this run, so you are measuring something incoherent), or let the candidate produce its own turn two and re-prompt the logged user message (which now no longer fits, because turn one diverged).
The covariate shift problem in imitation learning describes exactly this dynamic. Small errors at the start of a trajectory compound, the state distribution drifts away from the training distribution, and behavior in the new region is unmeasured. Replay evaluation is the offline-eval mirror image: even with a perfect candidate, the moment you leave the logged trajectory you have no labels to score against. Researchers working on agent rollouts now spend serious effort caching execution prefixes and branching at "behaviorally salient" decision points just to make multi-turn evaluation tractable. If you are running multi-turn replay without that machinery, you are not doing offline evaluation — you are doing a vibes check against irrelevant data.
The honest truth: any multi-turn replay where the candidate differs at turn one is a one-turn evaluation with extra noise. Treat it as such.
What Replay Actually Measures Well
- https://scc-comets.com/continuous-evaluation-in-production-shadow-testing-large-language-models
- https://www.codeant.ai/blogs/llm-shadow-traffic-ab-testing
- https://www.confident-ai.com/blog/multi-turn-llm-evaluation-in-2026
- https://www.langchain.com/articles/llm-evaluation-framework
- https://eugeneyan.com/writing/counterfactual-evaluation/
- https://arxiv.org/pdf/2007.12986
- https://arxiv.org/pdf/2409.19824
- https://arxiv.org/html/2604.21480
- https://arxiv.org/html/2411.13768v3
- https://lakefs.io/blog/llm-observability-tools/
