Counterfactual Logging: Log Enough Today to Replay Yesterday's Traffic Against Next Year's Model
Every LLM team eventually gets the same email from a director: "Anthropic shipped a new Sonnet. Run our traffic against it and tell me by Friday whether we should switch." The team opens the production trace store, pulls last month's requests, queues them against the new model — and three hours in, somebody asks why the diff scores look insane on tool-using turns. The answer: nobody captured the tool responses in their original form. The traces logged the model's reply faithfully and stored a one-line summary of what each tool returned. Replaying those requests doesn't replay what the old model actually saw; it replays a heavily compressed projection of it. The migration evaluation isn't measuring the new model. It's measuring the new model talking to a different reality.
This is the failure mode I want to talk about. Most production LLM logs are output-shaped: they answer "what did the model say?" reasonably well, and answer "what did the model see?" only sketchily. That asymmetry is invisible until the day you need to replay history against a new model — at which point it becomes the entire story, because the gap between what was logged and what was sent is exactly the gap between a real evaluation and a fake one.
Call it counterfactual logging: capture today the inputs you'd need to ask "what would that other model have done with this exact request?" tomorrow. The bar isn't "we logged the request." The bar is "we can re-execute the request against a different model and trust the result is meaningful."
The Asymmetry Between What You Logged and What the Model Saw
The default observability stance for LLM apps treats the model call like an HTTP request: log the inputs, log the outputs, maybe log timing. That works for postmortems where you want to know what the model produced. It collapses the moment you want to re-run the request, because production model calls aren't really one-shot inputs. They are the visible surface of a deeper composition.
A real production call is a stack of inputs the model sees, and most of them are constructed at request time and then thrown away:
- A prompt template at a specific version, with a specific revision of the system instructions, role definitions, formatting rules, and few-shot examples. Templates change weekly. The version that was live an hour ago may not exist in your repo any more.
- Variables interpolated into that template from user state, retrieved memory, A/B assignment flags, locale, and so on. The template is the recipe; the rendered prompt is the dish, and the dish is what the model actually ate.
- Retrieved chunks from a vector store or BM25 index, plus the user-permission filters that selected which chunks were eligible. The chunks have content, and that content has a history — yesterday's chunk #427 isn't today's chunk #427 if the document was edited.
- Tool call results, often massaged before being stitched back into the conversation. A
search_orderscall might have returned a 12-row table that the harness converted to a four-line summary because it was over a token budget. - Sampling parameters — temperature, top_p, stop sequences, max_tokens, JSON mode flags, structured output schemas — which are usually defaulted in code and never serialized into the trace.
- The model identifier and provider snapshot, including the dated suffix for hosted models (vendors quietly version their model strings) and the tokenizer revision, which is itself a hidden input that decides what "the prompt" even means.
The OpenTelemetry GenAI semantic conventions spell out an attribute namespace for most of these — gen_ai.request.model, gen_ai.request.temperature, gen_ai.prompt.*, gen_ai.completion.* — and this is genuinely a step forward. The conventions exist; vendors are starting to honor them; you can pick them up off the shelf. But conventions describe the slot, not what you put in it. If your instrumentation logs a JSON-serialized template variable bag and the template itself was hot-reloaded from a different commit, the slot is filled and the replay is still meaningless. The discipline is in capturing the fully resolved state, not the per-component pieces that happened to be available at attribute-write time.
Replayability Has a Specific Schema
The thing that makes a trace replayable isn't a tool. It's a schema decision: every span that represents a model call is paired with a snapshot of the inputs sufficient to reconstitute the call against any model that exposes the same interface.
A working schema, written down once and enforced at the SDK layer rather than the application layer, looks roughly like this:
- The prompt template ID and version hash — not the friendly name, but a content hash of the template at render time. Names get reused; hashes don't.
- The fully rendered prompt, byte-for-byte, as the model received it. This is the load-bearing field. Tools that store the template plus the variables look like they are saving you space, and they are right up until the template gets edited and the renderer logic changes and you can't reproduce the rendering deterministically two months later.
- The tool call results in their original form, before any harness-side summarization or truncation. If your tool returned 50 KB of JSON, log the 50 KB. If your harness then squeezed it into a 2 KB summary before re-feeding the model, log that too — but the original is non-negotiable, because the next model you evaluate may have a longer context window and want the raw payload.
- The retrieval set as content-addressed pointers — chunk IDs plus a content hash for each chunk plus the embedding model version plus the index revision. This is the part where you stop logging text and start logging immutable references. The SafJan post on vector index versioning makes the same case from a compliance angle: a vector index without a versioned content hash is a cache without a coherence model.
- The full sampling configuration, including the defaults that your SDK applied silently. Nothing makes a replay drift like the model being called at temperature 1.0 today because you forgot to pass an explicit zero, when last month's request used 0.2 because the framework had a different default.
- The model identifier with full version pinning, including the dated suffix the vendor happens to expose, plus the tokenizer revision if your stack pins them separately. "We use Claude" is not a pin. "claude-sonnet-4-6 with tokenizer rev 2026-02-14" is.
- The harness state — agent step number, parent span, retry count, branch identifier — that explains which call this was within a multi-step run. An agent's third planner call is not the same problem as its first, and you can't reconstruct that from request bodies alone.
Notice what isn't on the list: the model's response. The response is interesting for analysis, but it is not part of the input you need to replay. Conflating the two is the most common reason teams think they have replayable logs and don't.
The Storage Argument Gets Pushed Back Every Quarter
Replayable logs are roughly 10x larger than output-only logs. The first time someone runs the numbers on an annualized observability bill, this becomes a fight. The fight has the same shape every time: the platform team wants to keep them; the FinOps reviewer points out that 90% of the volume is full retrieval payloads on requests nobody will ever reread; the compromise is "we'll sample" or "we'll truncate fields over 4 KB."
The compromise loses. Sampling at 1% means the migration evaluation runs against 1% of historical traffic, which means you can't slice by tenant, by query type, by language, or by anything else that mattered enough for someone to file a ticket. Truncating long fields means the requests that were interesting — the ones with big retrieved contexts, big tool outputs, big histories — are precisely the ones you can't replay. The storage savings are real, but they are paid for by the migration that doesn't happen because the data isn't there to support it.
The argument that wins is the one framed as insurance against the next migration, not as a feature for today. The cost of replayable logs is paid in cold storage, where bytes are cheap. The cost of not having them is paid in engineering quarters spent re-instrumenting and waiting for new traffic to accumulate before any decision can be made. A team that cannot answer "would the new model have done better last week" without instrumenting and waiting two weeks has just told its leadership that it cannot evaluate models in less than two weeks. Once that's established, vendor switches stall, GPT-5 sits unevaluated, and someone notices that competitors who did instrument upstream are shipping faster.
There are real tactics for keeping the bill in line:
- Tier the retention: full-fidelity replayable logs at high resolution for 30 days, then sampled and content-addressed (chunk hashes only, dereferenced from a separate cold archive) for the long tail.
- Deduplicate aggressively: prompt templates, system instructions, retrieval chunks, and tool schemas are extremely repetitive across requests. Hash-and-store-once is straightforward, and it usually cuts the volume by another order of magnitude.
- Compress at write time: zstd on JSON payloads is rarely a bad idea. Many traces have entropy distributions friendly to dictionary compression; train a dictionary on a sample of your traffic and the ratios get even better.
- Hot/cold split: keep the index of what is replayable (span IDs, request IDs, content hashes, tags) hot and queryable. Hydrate the actual replay payloads from cold storage on demand. You do not need to grep terabytes of prompt text in real time; you need to find 5,000 requests by tag and pull their bodies in batch.
None of these are exotic. They are normal data-engineering moves. They are also the difference between a replayable log store that costs three months of engineering to defend at every budget review and one that costs roughly the same as your APM bill.
The Privacy Footprint Gets Heavier and You Have to Plan For It
Output-only logs already raise compliance concerns; replayable logs raise them further. By design, you are now retaining the full text of user inputs, the rendered system prompts that may contain internal information, the contents of retrieved documents (which may include data the user has access to today and won't tomorrow), and tool call results that may include emails, addresses, internal IDs, and other regulated fields.
The mistake is to treat this as a separate problem solved by sprinkling redaction over the pipeline. The right framing is that replayable logs are a regulated dataset from the moment of capture, and the discipline that makes them safe is the same discipline that makes them useful: a schema that knows what it holds.
A few patterns that hold up under audit:
- Sensitivity classification at the field level, written into the schema and enforced at write time. Every span attribute and event field carries a class —
pii,confidential,internal,public. Storage tier, encryption posture, retention period, and access controls all derive from the class. This is the same shape as data-classification regimes that already exist in mature data platforms, applied to a new dataset. - Deterministic, irreversible tokenization for fields that need to survive replay but not be readable. Hash the user's email with a per-tenant salt; the replay still slots a stable token into the rendered prompt, the analyst can't read the email, and the tokenization survives a DSR because the original was never stored.
- Per-tenant retention windows that match the data subject's actual rights, not the longest convenient interval. If you serve the EU, your replay window for personal data is bounded by what your DPIA covers and what you committed to in your privacy notice. The replay tool should refuse to load anything past the boundary.
- Access logging on the replay path, separate from access logging on read-for-analysis. Replaying a year-old request against a new model is a privileged operation, not a routine query, and it should leave an audit trail comparable to a database export.
- A documented data flow from the capture point to the replay path. The compliance failure mode I keep seeing is that the replay tool was added by a platform engineer in a sprint, and the data classification team finds out about it nine months later when an auditor asks. Make the data flow part of the design review, not part of the apology.
The honest framing is that replayable logs are more valuable and more dangerous than output-only logs, and the answer is to invest in both halves of that statement instead of pretending only one half is true.
Replay as a First-Class Engineering Surface
The reason this matters now, more than it did two years ago, is that the cadence of model releases has compressed. A team that ships against frontier models is being asked to evaluate a new candidate roughly every quarter, and the people asking are no longer the model team — they are product leads, finance, and security, all asking different versions of "is the new one better, cheaper, safer for our traffic?" The replay path is the answer to all three questions, and the path only exists if the logs were designed for it.
Built well, the replay surface is something a non-platform engineer can use without filing a ticket: a CLI or a notebook that takes a query like "all traces from cohort X over the last 30 days where the planner took more than 2 steps" and a target like "Claude Sonnet 4.7 at temperature 0.3," fans the requests out, and produces a side-by-side diff with cost, latency, and quality scores. Built badly, it's a script someone wrote once for one migration and that nobody else can run because half the inputs were captured ad hoc.
The architectural distinction worth making is between two things that are easy to confuse: we logged what the model said (cheap, useful for incident review, what most teams have) and we logged what the model saw (expensive, useful for the migration that pays the storage bill back ten times over, what a few teams have). Choose deliberately. The second one is what lets you treat model migrations like a normal engineering decision instead of a multi-quarter project, and the only way to have it on the day you need it is to have decided to capture it the day before.
- https://uptrace.dev/blog/opentelemetry-ai-systems
- https://www.datadoghq.com/blog/llm-otel-semantic-convention/
- https://opentelemetry.io/blog/2025/ai-agent-observability/
- https://opentelemetry.io/blog/2024/llm-observability/
- https://www.sakurasky.com/blog/missing-primitives-for-trustworthy-ai-part-8/
- https://safjan.com/version-your-vectors-index-versioning-as-the-missing-layer-in-rag/
- https://apxml.com/courses/optimizing-rag-for-production/chapter-1-production-rag-foundations/rag-versioning-experiment-tracking
- https://www.codeant.ai/blogs/llm-shadow-traffic-ab-testing
- https://www.statsig.com/perspectives/shadow-testing-ai-model-evaluation
- https://arxiv.org/html/2505.17716v1
- https://optyxstack.com/security-compliance/llm-logging-without-pii-observability-patterns
- https://portkey.ai/blog/the-complete-guide-to-llm-observability/
- https://www.truefoundry.com/blog/observability-in-llm-workflows
- https://promptive.dev/
- https://docs.langchain.com/langsmith/prompt-engineering-concepts
