Skip to main content

2 posts tagged with "shadow-testing"

View all tags

Shadow Replay Punishes the Model That Would Have Changed the Conversation

· 9 min read
Tian Pan
Software Engineer

A team I worked with last quarter shipped a new model into shadow replay and watched the win rate sit at 47 percent against the incumbent. Same prompts, same retrieval, a model the vendor's own evals had ranked clearly higher. The shadow harness took last week's production traffic, pumped it through the candidate, fed both responses to an LLM judge, and declared the upgrade roughly a coin flip. The team almost reverted on the spot.

The problem was not the model. The problem was that every user message in the replay had already been conditioned on the old model's previous turn. The candidate wrote a better answer at turn one, the user in the log replied to a different answer that no longer existed, and from turn two onward the judge was scoring a conversation that was not happening. A genuinely better model that changes what the user does next has no ground truth to be scored against. The replay quietly rewards staying on the old rails.

The Model Migration Playbook: How to Swap Foundation Models Without Breaking Production

· 13 min read
Tian Pan
Software Engineer

Every team that has shipped an LLM-powered product has faced the same moment: a new foundation model drops with better benchmarks, lower costs, or both — and someone asks, "Can we just swap it in?" The answer is always yes in staging and frequently catastrophic in production.

The gap between "runs on the new model" and "behaves correctly on the new model" is where production incidents live. Model migrations fail not because the new model is worse, but because the migration process assumes behavioral equivalence where none exists. Prompt formatting conventions differ between providers. System prompt interpretation varies across model families. Edge cases that the old model handled gracefully — through learned quirks you never documented — surface as regressions that your eval suite wasn't designed to catch.