Skip to main content

The Agent Flight Recorder: Capture These Fields Before Your First Incident

· 12 min read
Tian Pan
Software Engineer

The first time an agent goes sideways in production — it deletes the wrong row, emails the wrong customer, burns $400 of inference on a single task, or tells a regulated user something legally exposed — the team opens the logs and discovers what they actually have: a CloudWatch stream of tool-call names with truncated arguments, a "user prompt" field that captured only the latest turn, and no record of which model version actually ran. The provider rolled the alias forward two weeks ago. The system prompt lives in a config service that wasn't snapshotted. Temperature wasn't logged because the framework default was 0.7 and "everyone knows that." The tool result that triggered the bad action exceeded the log line size and got truncated to "...".

You cannot reconstruct the decision. You can only guess. Six months later you have a pile of "why did it do that" reports with no answers, and the team starts treating the agent like weather — something that happens to you, not something you debug.

The flight recorder discipline is the cheapest thing you will ever ship that prevents this, and the most expensive thing you will ever ship if you wait until the first incident to start. The fields below are the bare minimum, the storage shape is non-negotiable, and the sampling and privacy boundaries have to be designed alongside — not retrofitted.

Why your existing logs cannot reconstruct the decision

Classical request logging assumes the request is the input and the response is the output. For an LLM agent, neither is true. The "input" is a bundle: a system prompt that is itself a large versioned artifact, a tool registry whose schemas are part of the contract, a context window assembled by a non-deterministic truncation policy, RAG-injected chunks pulled from an index that itself drifts, sampling parameters that change the output distribution, and a model identifier that is usually an alias the provider can repoint without notice. The "output" is a sequence: a chain of tool calls and tool results threaded through multiple model turns, possibly with thinking tokens the provider exposes only conditionally.

If any one of those inputs is missing from your record, you cannot replay the decision. And if you cannot replay it, you cannot tell whether the agent did the wrong thing because the model regressed, the prompt drifted, the index was stale, the tool returned different bytes than you expect, or the sampling rolled differently. Every postmortem then collapses into "we updated the prompt and it seems better now."

The classical SRE habit of recording metrics, latency, and a sampled subset of request bodies is not enough. An agent is a non-deterministic distributed system whose observability needs are closer to a payments ledger than a stateless API. You need an append-only record of every byte that crossed the boundary into and out of the model, with the versioned context that produced it, retained long enough to outlast the bug report.

The minimum field set the recorder must capture

There are no optional fields here. Every one of these has been the decisive piece of evidence in someone's postmortem.

Resolved model identity, not the alias. Capture the actual model version the provider returned, not the alias you sent. claude-sonnet-4 is an alias. claude-sonnet-4-20250929 is a version. Aliases roll. Versions don't. If your provider returns a model field in the response, log that field, not the one you put in the request. If it doesn't, pin to a versioned identifier in the request and never use bare aliases in production.

System prompt content hash and a pointer to the immutable artifact. The system prompt is a large blob that changes often. Storing the full text in every span is wasteful and creates a PII surface. Store a content hash (SHA-256 of the rendered prompt, after all template variables are interpolated) on every call, and store the prompt body itself once, in an immutable registry keyed by that hash. This is the same pattern as a content-addressable Git blob, and for the same reason: you want every trace to point at exactly what ran, even if the prompt has been updated 40 times since.

Full tool registry snapshot, not just the names. The tool schemas are part of the input — they affect the model's selection and argument generation. A tool that gained an optional parameter last Tuesday changed the agent's behavior even though no one touched the agent code. Snapshot the full tool registry (names, descriptions, JSON schemas) by content hash, the same way you snapshot the system prompt, and pin the hash on every agent run.

Sampling parameters, every one. Temperature, top-p, top-k, max tokens, presence and frequency penalties, stop sequences, and the seed if you set one. The OpenTelemetry GenAI semantic conventions formalize most of these as gen_ai.request.* attributes; if you are starting fresh, follow that schema so your recorder is portable across vendors.

The complete context that was sent. Not just the latest user turn. The full message array, including all prior turns, all tool results that were threaded back in, and the exact RAG chunks that were injected — each chunk tagged with a source identifier and the version of the index it came from. If your truncation policy dropped earlier turns, record what was dropped and why. If you cannot answer "what did the model see" to byte equivalence, your recorder is incomplete.

The full response, including thinking tokens where the provider exposes them. Some incidents are debuggable only by reading the model's reasoning chain. Capture the thinking output when the API returns it, store it under the same access controls as the visible response, and treat its retention with the same rigor.

Every tool call with full arguments and full results. The arguments the model emitted, the resolved arguments after any post-processing, the result the tool returned (full payload, not truncated), and the latency. Side-effecting tools — anything that writes, sends, pays, or deletes — need 100% capture. No exceptions, no sampling, no log line size limits.

A stable session and trace identifier that ties the whole task together. A single user task may produce dozens of model calls across multiple agent turns. They have to join. The OpenTelemetry GenAI conventions give you gen_ai.conversation.id and gen_ai.agent.id for this. Use them.

Append-only, months of retention, separate vault

The shape of the store matters as much as the fields. Three properties are non-negotiable.

The store must be append-only. Forensic records that can be edited are not forensic records. If the same incident review can produce different evidence depending on who looked at it second, the record has no value in a regulated context and limited value in any context. Use object storage with versioning and write-once semantics, or a ledger-style append log. The audit trail must outlast the question.

Retention is measured in months, not days. A bug report that arrives 90 days after the bad action — common for finance, healthcare, or any flow with delayed human review — needs evidence that still exists. Pick retention to match your slowest feedback loop, not your typical one. If a quarterly compliance audit can ask about an action from the last quarter, you need a quarter of full traces, not a week.

The forensic store is a privacy vault, not a log file. It is the most concentrated PII surface in your system: full user inputs, full model outputs, full tool results that may include account details, transactions, or medical context. Treat its access controls and retention policy with the rigor of a backup vault. Tag spans by sensitivity, restrict raw payload access to a small named role, log every read, and design separate retention for high-risk traces. This is the trade the GenAI semantic conventions encode by storing prompt and response content in events rather than span attributes — events can be filtered or dropped at the collector before they reach indexed storage.

Sample by risk, not by volume

Capturing 100% of everything will, for some workloads, cost more than the inference itself. The fix is not to sample uniformly. It is to sample by the consequences of being unable to replay.

The right discipline is a tiered sampling policy keyed to the path, not the request:

  • 100% on any path that includes a side-effecting tool. Anything that writes, sends, pays, deletes, posts, or mutates external state. The cost of being unable to reconstruct a wrong financial transaction or a leaked customer email is unbounded. Pay for the full record.
  • 100% on high-stakes read paths. Anything that returns a regulated answer (medical, legal, financial) or feeds a decision a human will rubber-stamp.
  • Higher rates on traces that triggered any human-in-the-loop intervention or a guardrail rejection. These are pre-incident signals. Keep them.
  • Lower sampling on low-stakes chat surfaces — small-talk, brainstorming, surfaces with no side effects. 1–10% is reasonable, with the option to flip to 100% temporarily for a debugging window.
  • Always 100% of metadata even when you sample payloads down. Token counts, latency, model version, prompt hash, tool names, and outcomes belong on every trace. Cardinality on these is bounded; the cost is in the payloads.

The corollary: if your gateway routes both a side-effecting workflow and a chat surface, you cannot apply one global sampling rate. The recorder needs to know what kind of path it is on. Tag the span at the call site, not at the collector.

The recorder is not real until you can replay

The flight recorder has a failure mode that is harder to detect than missing fields: it captures what looks like enough but cannot actually reproduce the run. The schema drifted, the truncation policy changed, the tool registry snapshot was a pointer to a mutable file, the system prompt hash was computed before template interpolation rather than after — any one of these turns the trace into an artifact that looks complete and isn't.

The discipline that proves the recorder works is replay. Take a captured trace, feed it into a sandbox harness that stubs every nondeterministic dependency (the model with the recorded response, every tool with the recorded result, the clock with the recorded timestamp), and verify the agent's behavior reconstructs to byte equivalence within sampling tolerance. If the recorded sampling parameters and seed produce the same tokens, your record is byte-equivalent; if you ran at temperature greater than zero without a seed, your record should still reconstruct the control flow — the sequence of tool calls and decisions — even if individual token sequences vary.

Run the replay test as a CI check on a representative sample of production traces. Treat a replay failure the same as a typecheck failure: the build is red. The team that learns the recorder didn't capture system prompt versions during the first incident review now has six months of unanswerable reports. The team that catches that gap in CI ships a fix the same week.

The fields nobody captures until the second incident

Three categories of evidence reliably get missed on the first pass and reliably matter in the second incident.

The harness state, not just the model state. The agent loop has its own state: which step number it is on, which sub-agent invoked it, the budget counters (tokens spent, tool calls made, wall-clock elapsed), the value of any feature flags that gated this run, and the resolved configuration of the routing layer that picked this model. When the bug is "the agent stopped early" or "the agent looped 40 times," the model trace is fine and the harness state is the smoking gun.

The judge configuration on any LLM-as-judge step. When an agent uses an LLM to evaluate or pick between candidates, the judge is itself a production model call with its own version, prompt, and sampling parameters. It needs the same recorder discipline as the primary agent. A judge that silently rolled forward is one of the most common causes of "the metric got better and the product got worse."

The retrieval index version and the exact chunks returned. Not just the query, not just the top-k IDs — the actual chunk text and the version of the index. Indexes are rebuilt. Chunks are re-embedded. Source documents change. An agent that hallucinated a policy because the retrieved chunk was the old version is a different bug from one that hallucinated because the model regressed, and you cannot tell them apart without the chunk text and the index version on the trace.

Build the recorder before the first incident, not after

Every team that has lived through this learns the same lesson: the fields you didn't capture are the fields that mattered. The model version, the prompt hash, the full tool result — these are not exotic asks. They cost a few extra lines per span. The reason they get skipped is that nothing has gone wrong yet, and the team is shipping features. The day something goes wrong, the cost of not having them is paid in trust, not in engineering hours.

Standards are converging. The OpenTelemetry GenAI semantic conventions give you a portable schema for the model-call surface. An IETF draft on agent audit trails proposes a JSON record format for autonomous AI systems. The EU AI Act mandates automatic recording of events for high-risk AI systems starting in August 2026. The shape of the recorder is no longer a research question. The question is whether you wire it up before or after the incident that forces you to.

The cheapest version of this is a wrapper around your provider SDK that adds the eight fields above to a structured log line and writes to an append-only bucket with a 90-day lifecycle. The expensive version is a full agent harness with replay-based CI, content-addressable prompt and tool registries, and a tiered sampling policy keyed to risk class. Most teams need somewhere between. None of them need zero. The agent is a non-deterministic distributed system. Treat its trace the way payments treat their ledger — and the day the call comes in asking "why did it do that," you will have an answer.

References:Let's stay in touch and Follow me for more thoughts and updates