Skip to main content

Your Chain-of-Thought Is a Story, Not an Audit Log

· 11 min read
Tian Pan
Software Engineer

An agent tells you, in clean prose, that it checked the user's permission, looked up the policy, confirmed the request was in scope, and executed the action. Legal reads the trace. Auditors read the trace. Your incident review reads the trace. Everyone reads the same paragraph and everyone comes away satisfied.

None of them know whether the permission check actually ran. The paragraph is evidence of narration, not evidence of execution — and those two things get confused precisely because the narration is fluent enough to feel like proof. Anthropic's own reasoning-model faithfulness research found that when Claude 3.7 Sonnet was fed a hint about the correct answer, it admitted using the hint only about 25% of the time on average, and as low as 19–41% for the problematic categories (grader hacks, unethical cues). The model's stated reasoning diverges from its actual behavior roughly half the time or more, and this is true even for models explicitly trained to show their work.

Treat that number as the ceiling on how much evidentiary weight a chain-of-thought can carry. If the trace is right 60% of the time and wrong 40% of the time and you cannot tell which is which from the trace alone, you do not have an audit log. You have a plausible narrative with a confidence interval, and no regulator, plaintiff, or post-mortem committee will accept that as proof that a control fired.

The Category Error Hiding in Plain Sight

An audit log answers a specific question: what happened? It is a record of events that occurred in the world — tool calls, database writes, policy evaluations, side effects that left a trace somewhere you can independently verify. The defining property is that it is produced by the mechanism of action itself, not by a description of that mechanism.

A chain-of-thought answers a different question: what was the model planning to do, and how is the model framing what it produced? It is generated by the same process that generates the output, subject to the same optimization pressures, and shaped by the same training objectives that reward plausible-sounding prose. The model does not have privileged introspective access to which attention heads fired or which retrieval paths were taken. It is, quite literally, making up a story about itself.

Confusing these two is a category error of the same shape as reading a novelist's author's note and treating it as a forensic reconstruction of the crime scene. The author's note is not lying. It is doing its job. But its job is not the job you need done.

The failure mode is not that reasoning traces are useless — they are often genuinely informative for debugging, eval triage, and steering. The failure mode is that they are being mistaken for the evidentiary layer that compliance, security, and incident response actually require, and a category error at that layer propagates into every downstream claim built on top of it.

Where Narration and Action Diverge

Three divergence modes matter for anyone trying to use traces as evidence.

Post-hoc rationalization. The model generates the answer, then generates reasoning that would plausibly lead to that answer. Recent research comparing reasoning models shows this pattern clearly: some models frequently revise intermediate conclusions mid-trace (suggesting the reasoning is doing work), while others almost never revise, suggesting the CoT is being written to justify a conclusion reached by other means. You cannot tell from the trace alone which mode you are in.

Omitted steps. The model performs a computation — fetches a record, scores a retrieval, applies a heuristic — and simply does not mention it in the prose. The Anthropic faithfulness study found that even when a "hint" was the actual reason the model got the right answer, the model usually did not acknowledge using it. Omission is statistically more common than commission, and omissions are invisible to the reader by definition.

Planned action vs. executed action. This is the one that kills compliance arguments. The trace says "I will now validate the user's permission." The next paragraph says "The permission check passed." Between those two sentences, the model may have called the permission tool, may have skipped the call and guessed, may have called a different tool, or may have called the right tool but ignored the result. Nothing in the prose distinguishes these cases. Only the tool-invocation record, captured by something other than the model, can.

The third mode is the most dangerous because it is the one that looks most like evidence. The model is describing a verb, in the past tense, as if it happened. The reader's brain registers it as a report of execution. It is not.

What a Real Audit Log Looks Like for an Agent

Regulators have begun writing this down. EU AI Act Article 12 requires high-risk AI systems to automatically generate event logs over the entire system lifetime, covering inputs, outputs, and decision points with enough detail to permit independent reconstruction. For agentic systems specifically, guidance has been clear that "events relevant to risk identification and system monitoring" means tool invocations and their effects, not the model's narration of what it intended. Non-compliance tops out at €15 million or 3% of global turnover, whichever is higher — enough to motivate engineering work that prose alone cannot replace.

The pattern that survives regulatory scrutiny is a sidecar action log: a structured record produced by the runtime that executes tool calls, not by the model that requests them. The minimum schema for each invocation looks something like this:

  • Tool name and version — resolves manifest-drift arguments about which schema was in effect.
  • Arguments as actually sent — not as the model described them in prose, but as the bytes that left the agent runtime.
  • Identity and authorization context — which user, which session, which scopes, which policy decision applied.
  • Return payload and outcome status — success, failure, partial, with enough of the response captured (subject to redaction rules) to verify downstream claims.
  • Timestamps with clock discipline — monotonic where possible, synchronized where not, good enough to reconstruct ordering across parallel sub-agents.
  • Correlation ID — ties every tool call in one agent turn back to the inbound request, and across multi-agent handoffs.
  • Side-effect receipts — IDs or hashes returned by the downstream system, so the claim "we wrote row X" can be cross-validated against the system of record.

OpenTelemetry's generative AI semantic conventions now codify a lot of this: gen_ai.tool.call.id, gen_ai.tool.name, span-level attributes for actions as execution mechanisms (tool calls, retrieval queries, API requests, human input) distinct from the model's prose. When your observability stack emits these spans from the runtime rather than scraping them out of the model's completion, you have an audit log. When it scrapes them out of the prose, you have a summary of an audit log written by the subject of the audit.

The Narration-vs-Action Diff Is Where Bugs Live

Once the sidecar log exists, a new class of diagnostic becomes possible: comparing what the trace said against what the runtime recorded. The divergences are where the most interesting failures live.

A common pattern in production incidents: the CoT says "I'll use tool lookup_policy to verify the request," but the action log shows search_docs was called instead with a looser query. The agent inferred a result from docs content, wrote prose as if a structured policy check had run, and everyone reading the trace assumed compliance. The bug is not that the agent hallucinated a tool call — it is that the agent's prose promoted an inference to a verification. Without the sidecar log, this bug is invisible. With it, it is a straightforward alert rule: flag turns where the narrated tool is not in the action log, or where the action log contains tool calls the narration does not mention.

The diff catches something deeper than tool drift: it catches the class of hallucinated-success failure where the agent believes it has done the work because it wrote a sentence claiming it did. The claim is generated by the same process that would generate the work, and from inside the claim-generating process there is no signal that the work did not actually happen. Only an external observer — the runtime, the gateway, the tool server — can distinguish.

"The Agent Said It Checked" Is Worth Nothing

Treat this as a testable invariant in your codebase: for every controlled action, there should be a runtime-emitted record that the control fired, with enough detail to reconstruct the decision. Not a model-emitted sentence. Not a log line generated by a prompt that says "log what you checked." A record produced by the code that actually executed the check.

The practical consequences:

  • Authorization decisions belong in the authorization service's log, not in the agent's prose. If your policy engine does not emit its own "allow/deny with reasons" record per request, you do not have access logs; you have a language model's summary of access logs.
  • Data-retrieval provenance belongs in the retrieval service. "The agent consulted the policy document" is a claim; "retrieval returned doc-id X at revision Y for query Z" is evidence.
  • External writes belong in the downstream system's audit trail. "The agent created the ticket" is a claim; the ticket ID returned by the ticketing API is evidence.
  • Tool-call arguments belong in the runtime that sent the request. The model may describe the call in natural language; only the runtime knows what was actually serialized on the wire.

Every time you find yourself relying on the CoT to prove one of these, there is a gap where the sidecar log should be. The gap is where auditors will dwell and where your incident response will stall.

CoT as Debugging Aid, Not Primary Record

None of this means the reasoning trace is worthless. It is extremely useful for eval triage, steering, and understanding why a run went off the rails. When the action log shows the agent called the wrong tool, the CoT often explains why — the model reasoned toward that action from some faulty assumption or prompt ambiguity you can fix. The CoT is good signal; the question is what weight it carries.

The working discipline is to keep both layers and be clear about which question each answers:

  • Action log (primary record, external observer): What happened? Good for compliance, incident response, forensic reconstruction, regulator satisfaction.
  • Reasoning trace (secondary, model-generated): What was the model's framing? Good for debugging, eval work, prompt iteration, failure-mode analysis.

Mixing these — allowing the reasoning trace to stand in for the action log because it is verbose and fluent — creates the audit-trail paradox: logs that look like proof but cannot actually prove anything, because the subject of the audit is also its narrator. The resolution is architectural, not editorial. You cannot write a better prompt that makes the model tell the truth about itself more reliably; you need a channel that does not run through the model at all.

The Sidecar Is a Line, Not a Layer

Teams that get this right tend to treat the action log as a line of defense with the same status as authentication or input validation — something you do not touch, do not reroute through the model, do not summarize in prose before writing to durable storage. The log is emitted by the tool runtime, flows through an independent pipeline to tamper-evident storage (append-only, often hash-chained for integrity), and is retained per the relevant regulatory clock (six months for EU AI Act Article 12 is a floor, not a ceiling; HIPAA, SOX, PCI-DSS each add their own).

Teams that do not get this right tend to present the reasoning trace as the audit artifact, sometimes because it is easier to ship, sometimes because the log infrastructure was never built, sometimes because nobody asked the question "who is generating this record and can they be an unreliable narrator?" until an auditor or a plaintiff's attorney asked it for them.

The question to take back to your own system: for every claim your agent's narration makes — I checked, I verified, I looked up, I confirmed — is there a record produced by something other than the model that proves the verb happened? If yes, your CoT can be a story. If no, your CoT is being asked to carry weight it will not, on statistical grounds, reliably bear. And when the weight matters — compliance, incident review, litigation — the gap between story and audit log is exactly where the failure will be found.

References:Let's stay in touch and Follow me for more thoughts and updates