Counterfactual Logging: Log Enough Today to Replay Yesterday's Traffic Against Next Year's Model
Every LLM team eventually gets the same email from a director: "Anthropic shipped a new Sonnet. Run our traffic against it and tell me by Friday whether we should switch." The team opens the production trace store, pulls last month's requests, queues them against the new model — and three hours in, somebody asks why the diff scores look insane on tool-using turns. The answer: nobody captured the tool responses in their original form. The traces logged the model's reply faithfully and stored a one-line summary of what each tool returned. Replaying those requests doesn't replay what the old model actually saw; it replays a heavily compressed projection of it. The migration evaluation isn't measuring the new model. It's measuring the new model talking to a different reality.
This is the failure mode I want to talk about. Most production LLM logs are output-shaped: they answer "what did the model say?" reasonably well, and answer "what did the model see?" only sketchily. That asymmetry is invisible until the day you need to replay history against a new model — at which point it becomes the entire story, because the gap between what was logged and what was sent is exactly the gap between a real evaluation and a fake one.
Call it counterfactual logging: capture today the inputs you'd need to ask "what would that other model have done with this exact request?" tomorrow. The bar isn't "we logged the request." The bar is "we can re-execute the request against a different model and trust the result is meaningful."
