Counterfactual Logging: Log Enough Today to Replay Yesterday's Traffic Against Next Year's Model
Every LLM team eventually gets the same email from a director: "Anthropic shipped a new Sonnet. Run our traffic against it and tell me by Friday whether we should switch." The team opens the production trace store, pulls last month's requests, queues them against the new model — and three hours in, somebody asks why the diff scores look insane on tool-using turns. The answer: nobody captured the tool responses in their original form. The traces logged the model's reply faithfully and stored a one-line summary of what each tool returned. Replaying those requests doesn't replay what the old model actually saw; it replays a heavily compressed projection of it. The migration evaluation isn't measuring the new model. It's measuring the new model talking to a different reality.
This is the failure mode I want to talk about. Most production LLM logs are output-shaped: they answer "what did the model say?" reasonably well, and answer "what did the model see?" only sketchily. That asymmetry is invisible until the day you need to replay history against a new model — at which point it becomes the entire story, because the gap between what was logged and what was sent is exactly the gap between a real evaluation and a fake one.
Call it counterfactual logging: capture today the inputs you'd need to ask "what would that other model have done with this exact request?" tomorrow. The bar isn't "we logged the request." The bar is "we can re-execute the request against a different model and trust the result is meaningful."
The Asymmetry Between What You Logged and What the Model Saw
The default observability stance for LLM apps treats the model call like an HTTP request: log the inputs, log the outputs, maybe log timing. That works for postmortems where you want to know what the model produced. It collapses the moment you want to re-run the request, because production model calls aren't really one-shot inputs. They are the visible surface of a deeper composition.
A real production call is a stack of inputs the model sees, and most of them are constructed at request time and then thrown away:
- A prompt template at a specific version, with a specific revision of the system instructions, role definitions, formatting rules, and few-shot examples. Templates change weekly. The version that was live an hour ago may not exist in your repo any more.
- Variables interpolated into that template from user state, retrieved memory, A/B assignment flags, locale, and so on. The template is the recipe; the rendered prompt is the dish, and the dish is what the model actually ate.
- Retrieved chunks from a vector store or BM25 index, plus the user-permission filters that selected which chunks were eligible. The chunks have content, and that content has a history — yesterday's chunk #427 isn't today's chunk #427 if the document was edited.
- Tool call results, often massaged before being stitched back into the conversation. A
search_orderscall might have returned a 12-row table that the harness converted to a four-line summary because it was over a token budget. - Sampling parameters — temperature, top_p, stop sequences, max_tokens, JSON mode flags, structured output schemas — which are usually defaulted in code and never serialized into the trace.
- The model identifier and provider snapshot, including the dated suffix for hosted models (vendors quietly version their model strings) and the tokenizer revision, which is itself a hidden input that decides what "the prompt" even means.
The OpenTelemetry GenAI semantic conventions spell out an attribute namespace for most of these — gen_ai.request.model, gen_ai.request.temperature, gen_ai.prompt.*, gen_ai.completion.* — and this is genuinely a step forward. The conventions exist; vendors are starting to honor them; you can pick them up off the shelf. But conventions describe the slot, not what you put in it. If your instrumentation logs a JSON-serialized template variable bag and the template itself was hot-reloaded from a different commit, the slot is filled and the replay is still meaningless. The discipline is in capturing the fully resolved state, not the per-component pieces that happened to be available at attribute-write time.
Replayability Has a Specific Schema
The thing that makes a trace replayable isn't a tool. It's a schema decision: every span that represents a model call is paired with a snapshot of the inputs sufficient to reconstitute the call against any model that exposes the same interface.
A working schema, written down once and enforced at the SDK layer rather than the application layer, looks roughly like this:
- The prompt template ID and version hash — not the friendly name, but a content hash of the template at render time. Names get reused; hashes don't.
- The fully rendered prompt, byte-for-byte, as the model received it. This is the load-bearing field. Tools that store the template plus the variables look like they are saving you space, and they are right up until the template gets edited and the renderer logic changes and you can't reproduce the rendering deterministically two months later.
- The tool call results in their original form, before any harness-side summarization or truncation. If your tool returned 50 KB of JSON, log the 50 KB. If your harness then squeezed it into a 2 KB summary before re-feeding the model, log that too — but the original is non-negotiable, because the next model you evaluate may have a longer context window and want the raw payload.
- The retrieval set as content-addressed pointers — chunk IDs plus a content hash for each chunk plus the embedding model version plus the index revision. This is the part where you stop logging text and start logging immutable references. The SafJan post on vector index versioning makes the same case from a compliance angle: a vector index without a versioned content hash is a cache without a coherence model.
- The full sampling configuration, including the defaults that your SDK applied silently. Nothing makes a replay drift like the model being called at temperature 1.0 today because you forgot to pass an explicit zero, when last month's request used 0.2 because the framework had a different default.
- The model identifier with full version pinning, including the dated suffix the vendor happens to expose, plus the tokenizer revision if your stack pins them separately. "We use Claude" is not a pin. "claude-sonnet-4-6 with tokenizer rev 2026-02-14" is.
- The harness state — agent step number, parent span, retry count, branch identifier — that explains which call this was within a multi-step run. An agent's third planner call is not the same problem as its first, and you can't reconstruct that from request bodies alone.
Notice what isn't on the list: the model's response. The response is interesting for analysis, but it is not part of the input you need to replay. Conflating the two is the most common reason teams think they have replayable logs and don't.
- https://uptrace.dev/blog/opentelemetry-ai-systems
- https://www.datadoghq.com/blog/llm-otel-semantic-convention/
- https://opentelemetry.io/blog/2025/ai-agent-observability/
- https://opentelemetry.io/blog/2024/llm-observability/
- https://www.sakurasky.com/blog/missing-primitives-for-trustworthy-ai-part-8/
- https://safjan.com/version-your-vectors-index-versioning-as-the-missing-layer-in-rag/
- https://apxml.com/courses/optimizing-rag-for-production/chapter-1-production-rag-foundations/rag-versioning-experiment-tracking
- https://www.codeant.ai/blogs/llm-shadow-traffic-ab-testing
- https://www.statsig.com/perspectives/shadow-testing-ai-model-evaluation
- https://arxiv.org/html/2505.17716v1
- https://optyxstack.com/security-compliance/llm-logging-without-pii-observability-patterns
- https://portkey.ai/blog/the-complete-guide-to-llm-observability/
- https://www.truefoundry.com/blog/observability-in-llm-workflows
- https://promptive.dev/
- https://docs.langchain.com/langsmith/prompt-engineering-concepts
