The Evidence Locker Your Agent Doesn't Keep

May 31, 2026 · 9 min read

Software Engineer

Your trace logs every token. They log every tool call, every retry, every retrieval latency, every model id. They look exhaustive. Then a regulator, a customer, or your own incident channel asks the one question that should be easy: what did the model actually see at the moment it decided? And you discover that your trace recorded the questions but not the answers the model was looking at when it answered.

The retrieved chunks have rotated out of the vector store because the corpus was reindexed last Tuesday. The tool response was a streamed payload you stored only the final-state summary of, because storing the full stream tripled your bill. The system prompt was assembled at runtime from a feature flag that has since flipped twice, and your flag service does not retain historical values by timestamp. You have full observability over what happened — the call graph, the token counts, the latencies. You have nothing about what the model was answering against. That gap is the difference between a trace and a decision record, and most teams have not noticed they only built one of the two.

The Trace Tells You the Shape; The Decision Record Tells You the Substance

A trace is operational. It is built for the SRE question — where did the request go, how long did it take, which call failed. The schema is optimized for that: spans, parent ids, durations, error codes, a sparse handful of attributes per span. Every existing observability vendor extended this schema for LLMs by adding token counts, model ids, and prompt-and-completion strings. That extension is necessary but it is not the same artifact as a decision record.

A decision record is forensic. It is built for a different question — what state of the world did this agent commit to when it acted. That state includes the retrieved documents at the version they were retrieved (not a pointer to a mutable corpus), the tool outputs as the agent received them (not a re-fetch that may have changed), the system prompt as it was assembled at runtime (not the template plus a flag id), the model version including any silent provider-side rollover, and the configuration state of every guardrail and feature flag that gated the agent's behavior. The trace tells you the shape of the run. The decision record tells you the substance the agent had to work with.

The two artifacts overlap, which is what makes the gap hard to see. Your trace already contains the prompt and the completion. It looks like the decision record.

It is not. The prompt in the trace is the rendered string captured after the flag flipped. The completion is the final answer, and the trace tells you the model returned that completion — but it cannot tell you that the retrieval that produced the context block in that prompt was sourced from a chunk that has since been re-embedded and now sorts at rank 7 instead of rank 1. The trace recorded the surface. The substance underneath has moved.

The Snapshot You Discarded for Cost

Every team that has built non-trivial LLM observability has had a conversation that ends with we don't store the full retrieval payload, just the chunk ids and the scores. The reasoning is sound on its face — vector retrievals can return dozens of kilobytes per call, an agent loop can issue tens of retrievals per run, and a system handling a million decisions a month is staring at terabytes of evidence with a long tail of access. Storing the chunk ids and re-deriving the content from the vector store at query time is a fifty-times cost reduction. So that is what gets shipped.

The snapshot that gets discarded for cost is exactly the snapshot that lets you reconstruct the agent's epistemic state when it acted. Six months later the vector store has been reindexed twice. The chunk ids in the trace point at content that no longer exists — or worse, at content that exists but has been re-embedded and re-chunked, and now contains slightly different text.

You can re-run the retrieval against the current store and get a result, but that result is not what the agent saw. You have lost the ability to answer the original question — not because you lacked observability, but because your observability was a pointer system and the things it pointed at moved.

The same pattern recurs everywhere. Tool responses stored as a final-state summary lose the streaming intermediate states the agent reacted to. Feature flags referenced by id rather than snapshotted-by-value lose their historical state when the flag service garbage-collects old configurations. System prompts referenced by template-id-plus-variables lose their meaning when the template is edited in place and the old version is overwritten. In each case the team has a defensible cost reason. In each case the team has, without quite meaning to, made the decisions irreproducible.

Patterns That Close the Gap

The remediation is conceptually simple and architecturally annoying. The principle is that the decision record must be self-contained — recoverable without depending on any mutable system that existed at decision time. The patterns that achieve this:

Immutable per-decision context bundles. For every agent decision worth preserving, write a single bundle that contains the full assembled system prompt, the full retrieved content (not pointers), the full tool responses as observed, the model id including provider version, and the configuration snapshot of every flag the agent read. Write the bundle to cold storage, address it by content hash, and reference it from the trace. The trace stays cheap; the bundle is the evidence.
Retrieval-result hashing with content-addressed lookup. When you retrieve a chunk, hash the chunk content and store the hash in the trace, then ensure the chunk is preserved in a content-addressed store keyed by that hash. Re-indexing the vector store does not invalidate the chunk — it just produces a new vector for the same content, and the old content remains addressable. This is the same trick that powers LLVM's content-addressable storage and every build cache that survives a refactor: identity by content, not by location.
Snapshot, don't reference, the configuration state. Feature flags, prompt templates, model versions, and tool schemas should be written into the trace as values at decision time, not referenced as ids. The convenience of "look up flag X at time T" is exactly the convenience that breaks when the flag service retires the configuration. Storing the value inline costs bytes; storing the reference costs the ability to answer questions about the past.
Tier retention by evidence grade, not by age. Operational traces are cheap to lose after thirty days because they answer the SRE question and the SRE question has a thirty-day half-life. Decision records are not cheap to lose after thirty days because the auditor, the regulator, or the plaintiff's lawyer arrives on a schedule unrelated to your retention policy. The right partition is not by recency but by criticality: routine decisions can age out, incident-grade decisions are preserved indefinitely, and the policy that distinguishes them lives in the system, not in someone's memory.

The annoying part is none of this is a one-line config change. Every existing observability vendor has built schemas around the trace model, and the decision-record model is a different schema with different write patterns and different retention economics. Teams that have built this layer have built it themselves, usually after an incident where they discovered they could not answer the question that mattered.

The Conversation You Have With the Auditor

The scenario that turns this from an architectural opinion into an operational necessity is the one where someone outside the team asks you to reconstruct a decision. It might be a regulator, under one of the audit-trail provisions of the EU AI Act for high-risk systems that take effect in August 2026 and require continuous structured compliance evidence rather than policy documents. It might be a plaintiff in a lawsuit alleging that your agent denied them a service or approved a fraudulent action. It might be your own incident review, asking why the agent shipped a particular wrong answer to a particular user on a particular Tuesday.

In each case the conversation has the same shape. You are asked to reconstruct what the agent had to work with. You open your observability tool, which is excellent, and you confidently produce the trace — spans, model calls, tool invocations, latencies.

The auditor reads it and asks the follow-up question: what did the agent retrieve, and what did the retrieval contain? You point at the chunk ids. The auditor asks to see the chunks. You explain that the vector store has been reindexed. The auditor writes something in a notebook. The conversation has now revealed that the system is observable in the operational sense and unaccountable in the evidentiary sense, and those are not the same property.

The teams that survive this conversation well are not the ones who logged more aggressively. They are the ones who decided, before the conversation existed, that decision records were a different artifact than traces, and built the infrastructure to produce both. The teams that survive it badly produce the trace with confidence and then watch the auditor calibrate downward in real time.

The Forensics Discipline That Has To Exist

The broader point is that decision-time evidence preservation is a discipline, distinct from operational logging, and it needs to be staffed and budgeted and owned. It is not a feature of your observability stack — your observability stack was built for a different question. It is not a feature of your data warehouse — the warehouse stores facts, not the substrate those facts were derived from. It is not a feature of your prompt management tool — that tool versions the template, but it does not snapshot the runtime-assembled result, and the runtime-assembled result is what the model actually saw.

It is its own layer, with its own schema, its own retention policy, its own cost model, and its own SLOs. The SLO is not uptime; it is reconstructability — given a decision id and a question about the agent's epistemic state at that moment, can you produce a sufficient answer? Teams that can will be able to answer the questions of 2027 with the evidence of 2026. Teams that cannot will discover that "we have full observability" was a sentence about the present tense, and that the question they are being asked about is in the past, and that the past is exactly the tense their architecture did not preserve.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Evidence Locker Your Agent Doesn't Keep

The Trace Tells You the Shape; The Decision Record Tells You the Substance

The Snapshot You Discarded for Cost

Patterns That Close the Gap

The Conversation You Have With the Auditor

The Forensics Discipline That Has To Exist

Recommended Reading

About Tian Pan

The Trace Tells You the Shape; The Decision Record Tells You the Substance​

The Snapshot You Discarded for Cost​

Patterns That Close the Gap​

The Conversation You Have With the Auditor​

The Forensics Discipline That Has To Exist​

Recommended Reading

About Tian Pan

The Trace Tells You the Shape; The Decision Record Tells You the Substance

The Snapshot You Discarded for Cost

Patterns That Close the Gap

The Conversation You Have With the Auditor

The Forensics Discipline That Has To Exist