Agent Incident Forensics: Capture Before You Need It
The customer sends a screenshot to support on a Tuesday. Their account shows a refund posted six days ago that they never asked for. Your CRO forwards the screenshot with one question: "What produced this?" You know an agent did it — the audit log says actor: refund-agent-v3. But the prompt has been edited four times since. The model id rotated last Thursday when finance switched providers to chase a 12% cost cut. The system prompt is templated from three retrieved documents, and the retrieval index was reindexed Monday. The conversation history was trimmed by the runtime to fit a smaller context window.
You can tell the CRO the agent did it. You cannot tell them why. That gap — between knowing an action happened and being able to reconstruct the inputs that caused it — is the gap most agent teams discover the first time someone outside engineering asks a real forensic question.
The classical answer is "we have logs." The classical answer is wrong. Logs of what the agent did (a refund of $84.20 was issued at 14:23:09) are not logs of what produced the action. The latter requires a snapshot of every input that the model saw at decision time, captured at write time, frozen, and indexed by something you can pivot on a week later. Most teams discover during their first incident that they captured the conclusion and not the premise.
What "reconstruct" actually requires
To re-derive an agent action you need a tuple, captured atomically when the action was emitted, that contains every variable the model conditioned on. The OpenTelemetry GenAI semantic conventions formalized a baseline of these fields in 2025 — gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.provider.name, gen_ai.operation.name — but the conventions are a floor, not a ceiling. A complete forensic record needs more.
The minimum captureable set is: the full prompt as the model received it (after templating, retrieval injection, and tool-output interleaving), the model id and exact version string from the provider, every decode parameter (temperature, top-p, top-k, max tokens, frequency and presence penalties, stop sequences, seed if pinned), the system prompt revision hash, the tool schemas as serialized at request time, every tool result the model ingested in the conversation up to that turn, the input record from your own database at the time of the call, the user identity and tenant, the runtime version of your agent harness, and any safety-policy or guardrail configuration that shaped output filtering.
The mistake that recurs is assuming "I logged the prompt" is enough. The prompt at code-write time is not the prompt the model saw. By the time it hit the model it had been mutated by retrieval, by tool results from previous turns, by truncation when the context overflowed, by a hidden suffix the harness adds, and by the safety classifier rewriting "the user asked for X" into "the user asked for X (filtered)". Forensic logging captures the post-mutation form, not the pre-mutation form. If you can only log one of the two, log the post-mutation form.
Decode parameters: the silent third axis
Most teams version their prompt and their model. Almost nobody versions their decode config with the same discipline, and decode config matters more than people realize. A temperature change from 0.2 to 0.4 changes the output distribution materially; a top_p change from 0.95 to 1.0 widens the tail; flipping a presence penalty from 0 to 0.5 will cause the model to avoid repeating itself in ways that look like a behavioral regression. If you replay an incident with the wrong decode config you get a different output and conclude wrongly that the model "got better" or "got worse" when in fact you changed the variable you forgot to pin.
The pragmatic rule: treat the decode config as part of the prompt. Hash it together with the prompt body, store the hash in the trace, and store the full decode config in a content-addressed table keyed by that hash. When the same config recurs across a million requests you store it once. When you replay, you fetch the full config by hash and re-create the exact inference path.
Content-addressed prompt revs
Prompt versions are usually managed as either named tags ("prompt-v3.1") or as the inline string. Both fail forensically. Named tags can be rewritten — a teammate fixes a typo and republishes "prompt-v3.1" without bumping the name. Inline strings are bulky and turn every trace row into a kilobyte of text duplicated across millions of requests. The right primitive is content-addressed: take the SHA-256 of the canonicalized prompt template, store the prompt body once in a prompt_revs table keyed by that hash, and store only the hash in the trace.
This gives you three things at once: deduplication (the same prompt rev is stored once regardless of traffic), tamper evidence (if the body changes, the hash changes, so a teammate cannot silently rewrite history), and trivial diff during incident response (prompt rev a3f9... vs prompt rev b2c4... — pull both bodies and run a textual diff). Hash-chain audit trails extend this: each entry's hash includes the previous entry's hash, so an attacker cannot insert an event between two timestamps without breaking the chain. AuditableLLM is one published framework that uses this construction; you can implement the core idea in fifty lines of application code without adopting a framework.
Conversation history is the trapdoor
The thing that bites teams hardest in production is that the conversation history at turn N is not the same as the conversation history the agent saw at turn N. Runtimes truncate, summarize, and compress history under context-window pressure. Some agents use summarization at every K turns and discard the originals. Some carry tool results inline and trim them when they grow too large. Some prepend a moving "memory" buffer that swaps entries in and out based on recency.
When you reconstruct an incident, replaying with the full transcript will not reproduce the action because the model never saw the full transcript. You have to capture what was sent to the model on the wire, byte-for-byte, with the truncation and summarization already applied. This is annoying because the truncated form is much larger than the delta-from-previous-turn, and most teams default to logging the delta. Resist that default. The wire form is the only form that replays.
Tool results are evidence
In agent systems, the model's behavior is downstream of tool outputs. A search tool returned a stale document. A database query returned a row that has since been updated. A pricing API returned an old quote. The agent's "wrong" decision was correct conditional on the tool results it received — which is usually the answer your CRO actually needs to hear, because it shifts the postmortem from "the model is broken" to "our retrieval index was stale on Tuesday."
Capture every tool input and every tool output as part of the trace, with the same content-addressing approach. If the tool result is large (a 4MB JSON blob from a search index), store the body in object storage and put the SHA in the trace. The Air Canada chatbot incident, where the bot quoted a refund policy that the website's other pages contradicted, is a forensic exercise that turns trivial when you have the retrieval result captured at decision time and unsolvable when you don't. Without the captured retrieval result, you cannot tell whether the model hallucinated the wrong policy or whether retrieval handed it the wrong document. With the result, the postmortem writes itself.
Tiered storage so the bill stays sane
Capturing all of this for every request will produce orders of magnitude more bytes than your current logs. The instinct is to sample, but sampling is wrong for forensics — the request you sampled out is the one the customer escalates next month. The right pattern is tail-based: capture everything and tier the storage.
A practical layout: hot tier on SSD for the last 24–72 hours, where engineers can query interactively during an active incident; warm tier on cheaper block storage for 30–90 days, where postmortems and recent disputes get resolved; cold tier on object storage (S3, GCS) with index-only metadata for 6 months to several years, where compliance and rare deep-dive forensics live. Cold-tier byte cost is roughly two orders of magnitude below hot-tier byte cost, and most forensic queries — "give me the trace for refund 8472" — are satisfied with a hash lookup, not a full scan, so the latency penalty for cold-tier retrieval doesn't matter.
This matters for compliance. The EU AI Act requires high-risk system operators to retain logs for at least six months and to ensure those logs allow "full reconstructability of algorithmic decisions" — a phrase that doesn't soften under interpretation. Six months at cold-tier rates is affordable; six months at hot-tier rates is not. If you intend to be compliant in 2026 and 2027 without lighting your observability budget on fire, the tiering decision is upstream of the compliance decision.
Index by what investigators actually pivot on
Storing the data is half the problem. The other half is being able to find it. The default observability schema indexes by trace id and timestamp, which is fine for the request "what did the agent do at 14:23:09" but useless for the request "show me every agent decision in the last 90 days that touched customer record 41822." Forensic indices need to pivot on business entities — customer id, transaction id, ticket id, document id — not just on observability identifiers.
The pragmatic approach is to extract those entity ids during request capture and index them as separate columns or fields. When the CRO asks "what did the agent do to this customer," you do an entity lookup and get back every trace that touched the customer, ordered by time, with the prompt rev and tool result for each. Without that index, the same query becomes a full scan of a year of data.
Replay is the proof, not the goal
The reason to capture all of this is not just to read it — it's to replay it. Deterministic replay means feeding the captured prompt, decode config, model id, and tool results into a harness and getting the same output (or a measurably similar output, since some models are not bitwise deterministic even at temperature zero). Replay turns the forensic record from a static log into an executable artifact: you can mutate the prompt and re-run, swap models and re-run, fix a tool bug and verify the agent now produces the correct action on the bad input.
This is also how you regression-test before a model rotation. Before swapping providers, replay the last 30 days of agent decisions through the new model and diff. The decisions that diverge are the regressions you would have shipped to production blind. Teams that have this in place treat model rotations as routine; teams that don't treat them as a betting event.
The discipline is to capture before you need it
Every forensic capability listed above costs nothing during steady state — a few extra columns in a trace row, a content-addressed table, a tiered-storage policy. It costs a great deal the first time you need it and don't have it, because there is no retroactive way to capture what the model saw last Tuesday. The agent that issued the bad refund is gone; the prompt rev was overwritten; the tool result was a transient API call; the conversation history was summarized and discarded.
Build the capture before the first incident. The team that does this looks slow in the first quarter and looks fast every quarter after, because every incident lands as a query against the trace store rather than as an archaeology project across six engineers and a long Slack thread. The team that doesn't will eventually be asked, in a high-stakes meeting, "what produced this output," and will have to answer "we don't know."
- https://artificialintelligenceact.eu/article/12/
- https://artificialintelligenceact.eu/article/19/
- https://www.helpnetsecurity.com/2026/04/16/eu-ai-act-logging-requirements/
- https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/
- https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/
- https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-events/
- https://www.mdpi.com/2079-9292/15/1/56
- https://www.sakurasky.com/blog/missing-primitives-for-trustworthy-ai-part-8/
- https://oneuptime.com/blog/post/2026-02-06-tiered-storage-opentelemetry-data/view
- https://logz.io/blog/warm-tier/
- https://medium.com/@sharanharsoor/cost-optimization-in-llm-observability-how-langfuse-handles-petabytes-without-breaking-the-bank-0b0451242d1e
- https://tetrate.io/learn/ai/mcp/mcp-audit-logging
- https://medium.com/@1nick1patel1/the-agent-incident-postmortem-template-i-reuse-every-week-a41c554e8fc8
- https://latentmesh.ai/blog/who-owns-the-agents-mistake/
- https://www.evidentlyai.com/blog/ai-failures-examples
