Skip to main content

The Agent Flight Recorder: Capture These Fields Before Your First Incident

· 13 min read
Tian Pan
Software Engineer

The first time an agent goes sideways in production — it deletes the wrong row, emails the wrong customer, burns $400 of inference on a single task, or tells a regulated user something legally exposed — the team opens the logs and discovers what they actually have: a CloudWatch stream of tool-call names with truncated arguments, a "user prompt" field that captured only the latest turn, and no record of which model version actually ran. The provider rolled the alias forward two weeks ago. The system prompt lives in a config service that wasn't snapshotted. Temperature wasn't logged because the framework default was 0.7 and "everyone knows that." The tool result that triggered the bad action exceeded the log line size and got truncated to "...".

You cannot reconstruct the decision. You can only guess. Six months later you have a pile of "why did it do that" reports with no answers, and the team starts treating the agent like weather — something that happens to you, not something you debug.

The flight recorder discipline is the cheapest thing you will ever ship that prevents this, and the most expensive thing you will ever ship if you wait until the first incident to start. The fields below are the bare minimum, the storage shape is non-negotiable, and the sampling and privacy boundaries have to be designed alongside — not retrofitted.

Why your existing logs cannot reconstruct the decision

Classical request logging assumes the request is the input and the response is the output. For an LLM agent, neither is true. The "input" is a bundle: a system prompt that is itself a large versioned artifact, a tool registry whose schemas are part of the contract, a context window assembled by a non-deterministic truncation policy, RAG-injected chunks pulled from an index that itself drifts, sampling parameters that change the output distribution, and a model identifier that is usually an alias the provider can repoint without notice. The "output" is a sequence: a chain of tool calls and tool results threaded through multiple model turns, possibly with thinking tokens the provider exposes only conditionally.

If any one of those inputs is missing from your record, you cannot replay the decision. And if you cannot replay it, you cannot tell whether the agent did the wrong thing because the model regressed, the prompt drifted, the index was stale, the tool returned different bytes than you expect, or the sampling rolled differently. Every postmortem then collapses into "we updated the prompt and it seems better now."

The classical SRE habit of recording metrics, latency, and a sampled subset of request bodies is not enough. An agent is a non-deterministic distributed system whose observability needs are closer to a payments ledger than a stateless API. You need an append-only record of every byte that crossed the boundary into and out of the model, with the versioned context that produced it, retained long enough to outlast the bug report.

The minimum field set the recorder must capture

There are no optional fields here. Every one of these has been the decisive piece of evidence in someone's postmortem.

Resolved model identity, not the alias. Capture the actual model version the provider returned, not the alias you sent. claude-sonnet-4 is an alias. claude-sonnet-4-20250929 is a version. Aliases roll. Versions don't. If your provider returns a model field in the response, log that field, not the one you put in the request. If it doesn't, pin to a versioned identifier in the request and never use bare aliases in production.

System prompt content hash and a pointer to the immutable artifact. The system prompt is a large blob that changes often. Storing the full text in every span is wasteful and creates a PII surface. Store a content hash (SHA-256 of the rendered prompt, after all template variables are interpolated) on every call, and store the prompt body itself once, in an immutable registry keyed by that hash. This is the same pattern as a content-addressable Git blob, and for the same reason: you want every trace to point at exactly what ran, even if the prompt has been updated 40 times since.

Full tool registry snapshot, not just the names. The tool schemas are part of the input — they affect the model's selection and argument generation. A tool that gained an optional parameter last Tuesday changed the agent's behavior even though no one touched the agent code. Snapshot the full tool registry (names, descriptions, JSON schemas) by content hash, the same way you snapshot the system prompt, and pin the hash on every agent run.

Sampling parameters, every one. Temperature, top-p, top-k, max tokens, presence and frequency penalties, stop sequences, and the seed if you set one. The OpenTelemetry GenAI semantic conventions formalize most of these as gen_ai.request.* attributes; if you are starting fresh, follow that schema so your recorder is portable across vendors.

The complete context that was sent. Not just the latest user turn. The full message array, including all prior turns, all tool results that were threaded back in, and the exact RAG chunks that were injected — each chunk tagged with a source identifier and the version of the index it came from. If your truncation policy dropped earlier turns, record what was dropped and why. If you cannot answer "what did the model see" to byte equivalence, your recorder is incomplete.

The full response, including thinking tokens where the provider exposes them. Some incidents are debuggable only by reading the model's reasoning chain. Capture the thinking output when the API returns it, store it under the same access controls as the visible response, and treat its retention with the same rigor.

Every tool call with full arguments and full results. The arguments the model emitted, the resolved arguments after any post-processing, the result the tool returned (full payload, not truncated), and the latency. Side-effecting tools — anything that writes, sends, pays, or deletes — need 100% capture. No exceptions, no sampling, no log line size limits.

A stable session and trace identifier that ties the whole task together. A single user task may produce dozens of model calls across multiple agent turns. They have to join. The OpenTelemetry GenAI conventions give you gen_ai.conversation.id and gen_ai.agent.id for this. Use them.

Append-only, months of retention, separate vault

The shape of the store matters as much as the fields. Three properties are non-negotiable.

The store must be append-only. Forensic records that can be edited are not forensic records. If the same incident review can produce different evidence depending on who looked at it second, the record has no value in a regulated context and limited value in any context. Use object storage with versioning and write-once semantics, or a ledger-style append log. The audit trail must outlast the question.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates