Skip to main content

One post tagged with "model-rollout"

View all tags

The Shadow Deploy That Proved Nothing: When Parallel Calls Miss the Conversation

· 9 min read
Tian Pan
Software Engineer

A shadow deployment is the validation everyone agrees is responsible. You mirror live traffic into a candidate model, log its output, never show the result to the user. The dashboards line up, the candidate's responses look as good as the incumbent's on aggregate quality metrics, the team gets a green signal that the new model is "production-equivalent," and you promote it to a small slice of real traffic. Within a day, user-facing metrics collapse on a class of queries the shadow run had rated as matched.

The team's first instinct is to blame the rollout: maybe a feature flag misfired, maybe a router routed wrong, maybe the new model is silently degraded in production in a way it wasn't in shadow. None of those are true. The shadow worked exactly as designed. What the team measured was the candidate model's output in isolation — a string against a string — and what got promoted was a candidate model whose output reshapes the next user message, the next turn, the abandonment decision, and the path through the rest of the session. The shadow measured the model. Production measures the conversation. Those are not the same unit.