Your Agent's Audit Log Records Everything Except the Reason
Compliance forwards you a ticket. A customer was denied a refund by your support agent three weeks ago, they have escalated, and now someone needs to explain the decision. You feel calm about this, because you instrumented everything. Every prompt, every tool call, every retrieved chunk, every token count, every latency number — it is all in the trace, and you can pull it up in seconds.
You pull it up. You can see the agent received the refund request. You can see it called get_order_history, then check_return_window, then lookup_policy. You can see the exact policy text it retrieved. You can see the final message it sent: refund denied. The trace is complete. Every span is green. And you still cannot answer the question, because the trace shows you that the agent denied the refund and shows you everything it looked at, but it does not show you why those inputs added up to no. The reason lived in how the model weighed the context, and that weighing was never an artifact. It was never written down anywhere.
This is the gap between a trace and an explanation, and almost every team that says "we have full observability" has not noticed they only built the first half.
A complete trace is not a complete answer
The confusion is understandable, because for traditional software a trace really was an explanation. If a deterministic function returns the wrong value, the stack trace plus the inputs is the reason — you can read the branches, follow the conditionals, and the logic that produced the output is right there in the code. The execution path and the justification are the same object.
Agents broke that equivalence quietly. A single agent action now spans multiple model calls, several tool invocations, a retrieval step or two, and maybe a handoff to a sub-agent — and the decisive step is not any of those. It is the moment the model takes a pile of context and collapses it into a choice. That collapse happens inside a forward pass. It does not emit a span. It does not write a row. Your logging captures everything around the decision and nothing of it.
So you end up with a perfect record of the wrong layer. You logged the inputs and the outputs because those are the things that are easy to serialize. Reasoning is not a string your framework hands you; it is a property of how the weights responded to the prompt, and unless you explicitly asked the model to externalize it, it evaporated the instant the response finished streaming. "We can see what happened" is true. "We can see why" was never in scope, and nobody said so out loud.
The practical test is simple. Pull any past decision your agent made and ask: could I defend this specific outcome to someone who is unhappy about it, using only what I stored? Not "could I see the steps" — could I state the reason. For most teams the honest answer is no, and they find that out for the first time when a regulator, a lawyer, or an angry customer is already on the line.
"Full observability" is a smaller claim than it sounds
Observability and explainability get used as synonyms, and they are not even close. Observability answers what the system did: which tools fired, with what arguments, how many tokens, how long each step took. It is the operational view, and it is genuinely valuable — for latency regressions, for cost, for catching a tool that throws. Explainability answers why the system did it: why this tool and not that one, why this output given those inputs, whether the logic actually holds.
You can have flawless observability and zero explainability at the same time. That is, in fact, the default state of a well-instrumented agent. The dashboards are full, the traces are clean, the p99 looks great, and not one byte of it tells you why the refund was denied.
The reason this gap survives so long undetected is that observability tooling is sold as the complete answer, and for debugging it nearly is. When an engineer debugs a bad agent run, they reconstruct the why in their own head — they read the retrieved policy, look at the prompt, and infer the reasoning. That works because the engineer is supplying the missing layer with their own judgment, in real time, for one case. It does not scale, it is not durable, and it is not something you can hand to a non-engineer six months later. The "why" was never in the log; it was always in the debugger's head. Audit is the moment that arrangement falls apart, because audit needs the reason to be an artifact, attributable and frozen in time, not a fresh inference by whoever happens to be looking.
The reason in the log might be fiction
The obvious fix is to make the model show its work — log the chain-of-thought and call it the explanation. This is better than nothing, and it is also a trap if you treat the trace as ground truth.
The uncomfortable finding from interpretability research is that a model's stated reasoning is not a reliable readout of the computation that produced its answer. A 2025 study of chain-of-thought reasoning "in the wild" documented implicit post-hoc rationalization: ask a model "is X bigger than Y?" and "is Y bigger than X?" separately, and it will sometimes answer yes to both, then generate a fluent, confident justification for each contradictory answer. The reasoning was not the cause of the answer. The answer came first, and the reasoning was assembled afterward to fit it.
The same work catalogs restoration errors — the model makes a mistake, silently corrects it, and never mentions the correction — and unfaithful shortcuts, where the visible logic is plainly invalid but lands on the right answer anyway. Rates vary by model: lighter models rationalized at double-digit percentages in some tests, frontier models far less, but none were perfectly faithful. The point is not that chain-of-thought is worthless. It is that an unverified reasoning trace is a plausible story, and a plausible story is exactly the thing you do not want to hand a regulator while presenting it as fact.
This matters because the failure mode is silent and asymmetric. A fabricated rationale does not look fabricated. It looks like a clean, well-structured explanation — often better-written than a real one. If your audit process is "retrieve the logged reasoning and read it aloud," you have built a system that produces confident answers to why with no guarantee the answer is true, and a wrong-but-fluent reason in a legal proceeding is worse than no reason at all.
Capture rationale at decision time, not reconstruct it later
If the reason is not in the trace, and the model's after-the-fact story is not trustworthy, the move is to capture rationale as a deliberate, structured step at the moment of decision — not to recover it afterward.
Three patterns, in increasing strength:
- Make the model commit to a reason before it acts. Force a structured output where the agent states the decision, the principal factors behind it, and the policy or rule it is applying — and require that before the action executes, not after. A reason produced before the action at least has a causal claim on it; a reason produced after is free to be a rationalization. This does not make the reason perfectly faithful, but it changes its status from narration to commitment.
- Snapshot the decision-time inputs as a first-class record. A decision record is not the raw trace. It is a small, structured artifact: which policy version was in force, what the retrieved facts were, what confidence the agent reported, which factors it weighed and how, what alternatives it considered and rejected. The trace says the agent called
lookup_policy; the decision record says this clause, this version, applied this way, and here is what it ruled out. When the policy changes next quarter, the record still shows what was true the day the decision was made. - Treat the rationale as data, not prose. Store reasons as structured fields — factor, weight, policy ID, confidence, outcome — so they are queryable. Then you can ask questions across decisions: how often did "insufficient account history" drive a denial, did that rate jump after a prompt change, are the reasons clustering in a way that looks like bias. Free-text reasoning is unauditable at scale; structured rationale is the difference between a debugging tool and a governance tool.
The cost is real: latency for the extra reasoning step, tokens, schema design, storage. But notice the alternative cost. Reconstructing a reason after the fact means an engineer re-derives it months later, under pressure, from inputs that may no longer reflect the system that made the call — the prompt was updated, the policy moved, the model version rolled. The reconstruction is itself a fresh inference, and a fresh inference is not an audit trail. It is a guess with a timestamp.
Design for the regulator, not the debugger
The deepest mistake is designing the audit trail for the wrong reader. Logs built for engineers optimize for I can find the bug. Logs built for accountability have to optimize for a hostile outside party can be convinced the decision was legitimate — and those are different specs.
The external pressure is no longer hypothetical. The EU AI Act's Article 12 makes automatic event logging a hard requirement for high-risk systems, with obligations under Annex III enforceable from August 2026, minimum log retention measured in months, and penalties reaching into the tens of millions of euros. In US consumer lending, the CFPB has already stated the position plainly: a creditor cannot excuse a vague denial reason by claiming its model is too complex to explain. If the model cannot produce a specific, accurate reason, the model cannot be used for that decision. "It's a black box" is not a legal defense; it is an admission.
So design backward from the audit, not forward from the framework's defaults. A few properties that distinguish a defensible trail from a debugging convenience:
- Attribution. Every decision record ties to a specific principal, model version, prompt version, and policy version. "The agent decided" is not attributable; "this configuration, on this date, decided" is.
- Immutability. The record is frozen at decision time and tamper-evident. A reason you can quietly edit later is not evidence.
- Reconstruction without the live system. You should be able to answer why from the stored record alone, without re-running the agent — because by the time anyone asks, the agent is three versions downstream and the answer it gives now is not the answer it gave then.
- Faithfulness caveats, stated. If a reason is a model self-report, label it as one. Do not let a post-hoc rationalization get promoted to "the official reason" because it was the only text in the log.
None of this is exotic engineering. It is mostly a decision to spend tokens and schema effort on the why layer with the same seriousness you already spend on the what layer — and to make that decision now, while the only cost is a sprint, rather than later, while the cost is a deposition.
The question to ask before you need the answer
Go pull a real decision your agent made last month — a denial, an escalation, a refusal, anything with a stakeholder who could be unhappy. Try to write the one-paragraph reason a regulator would accept, using only what you stored. If you can, your audit trail is real. If you find yourself opening the prompt and inferring, you have an observability system wearing an audit trail's name tag.
The fix is not more logging. You almost certainly log too much already. The fix is to recognize that why is a separate artifact from what, that it has to be captured deliberately at decision time, that the model's own story about its reasoning is a draft and not a verdict, and that the reader you are building for is not a friendly engineer with a debugger — it is a skeptical outsider, months from now, who was not in the room. Build the trail that survives that conversation, and the debugging case comes free. Build only the debugging case, and you will discover the gap at the worst possible moment: with the question already asked, and the reason already gone.
- https://www.isaca.org/resources/news-and-trends/industry-news/2025/the-growing-challenge-of-auditing-agentic-ai
- https://www.ibm.com/think/insights/building-trustworthy-ai-agents-compliance-auditability-explainability
- https://artificialintelligenceact.eu/article/12/
- https://www.helpnetsecurity.com/2026/04/16/eu-ai-act-logging-requirements/
- https://arxiv.org/abs/2503.08679
- https://www.consumerfinance.gov/about-us/newsroom/cfpb-issues-guidance-on-credit-denials-by-lenders-using-artificial-intelligence/
- https://www.skadden.com/insights/publications/2024/01/cfpb-applies-adverse-action-notification-requirement
- https://atlan.com/know/what-are-decision-traces-for-ai-agents/
- https://www.elixirdata.co/blog/ai-agent-decision-traces-vs-logs-audit-trail-compliance
- https://www.leapter.com/post/why-your-ai-agents-cant-pass-an-audit
