Skip to main content

AI-Native Logging: Capture Decisions, Not Just I/O

· 10 min read
Tian Pan
Software Engineer

A customer support agent was generating hallucinated troubleshooting steps for 12% of tickets. The HTTP logs showed 200 OK across the board. Latency was normal. Error rates were flat. The system looked healthy by every conventional metric — and it was quietly fabricating answers at scale.

When engineers finally instrumented the decision layer, the root cause emerged in minutes: similarity scores for retrieved chunks were all below 0.4, confidence in the context was 0.28, and yet the model's stated output confidence read 0.91. A massive mismatch — invisible in traditional logs, obvious in a trace that captured the decision state.

This is the fundamental problem with applying conventional logging to LLM systems. I/O logs tell you your system ran. AI-native logging tells you whether it reasoned correctly.

What Conventional Logging Gets Wrong

The pattern inherited from web services is simple: log inputs and outputs, record status codes and latency, alert when error rates spike. This works because deterministic services produce the same output for the same input. Failures are binary: the service either returned the right thing or didn't.

LLM agents break all of these assumptions.

A support agent that hallucinates returns HTTP 200 with a plausible-looking response. A retrieval step that pulls in stale or irrelevant documents reports success. A model that picks the wrong tool completes the call without error. The system is running. It is not working. Conventional logs cannot distinguish between these states.

There is a deeper structural problem: conventional logs are flat. An agent processing a single user request might invoke three tools, make four model calls, update memory state twice, and branch its reasoning based on a retrieval result. A linear log stream collapses this entire hierarchy into a sequence of timestamped events with no causal connective tissue. You see that the agent called search_kb, that it then called send_response, but not that the search returned garbage, the model's confidence collapsed, and the response was fabricated anyway.

The decision logic — the stuff that actually determines whether the agent reasoned correctly — lives entirely between those logged events. And in most production systems today, it is completely invisible.

What Gets Lost Between the Lines

Think through what a conventional log captures for a three-step agent interaction:

10:22:34 | tool_call | function=search_kb | status=success
10:22:35 | llm_call | model=claude-3-5 | input_tokens=2500 | output_tokens=180
10:22:36 | tool_call | function=send_response | status=success

Now think through what actually happened:

  • The search query had an embedding mismatch with the index
  • Retrieved chunks had similarity scores of 0.42, 0.38, and 0.35 — all below useful threshold
  • The model recognized the low-quality context (confidence: 0.28) but produced a high-confidence output anyway (0.91)
  • The output contained fabricated steps from a procedure the model "remembered" from training

None of this appears in the log. Three status=success lines, one confused user.

The categories of information that go uncaptured are consistent across production failures:

  • Decision rationale: Why did the agent choose this tool over the alternatives it considered?
  • Rejected alternatives: What did the agent consider and reject, and with what confidence levels?
  • Confidence signals: Where did the model's certainty drop? Where did it recover? Where did input and output confidence diverge?
  • Retrieval quality: What similarity scores did chunks receive? Did the model acknowledge uncertainty about context quality?
  • State mutations: What memory was read before the decision? What was written afterward? How did state shape the next step?
  • Intermediate reasoning: What did the model's chain-of-thought produce between tool calls?

These are not edge-case debugging curiosities. They are the primary diagnostic surface for the most common agent failure modes: hallucination, incorrect tool selection, loop behavior, cascading retrieval degradation.

The Trace-as-Tree Paradigm

The structural fix is to stop thinking about agent logs as linear streams and start thinking about them as trees.

Every user request roots a trace. Every agent step branches it: each model call, tool invocation, retrieval query, and memory operation becomes a child span. Structured attributes attach to each span — not just inputs and outputs, but decision metadata.

Root Span: user_query "billing issue"
├─ Span: agent_step_1
│ ├─ decision_point: tool_selection
│ ├─ intent: "retrieve customer billing information"
│ ├─ alternatives_considered: [query_crm, search_kb, escalate]
│ ├─ selected: query_crm
│ ├─ confidence: 0.92
│ └─ Span: tool_call:query_crm (latency: 342ms, result: customer_record)
├─ Span: agent_step_2
│ ├─ decision_point: response_generation
│ ├─ intent: "explain billing discrepancy to user"
│ ├─ context_confidence: 0.87
│ └─ Span: model_call (tokens: 1250→220, confidence: 0.85)
└─ Span: send_response

This structure does several things flat logs cannot. First, causality becomes explicit: you can see that step 2's reasoning was influenced by step 1's retrieval result. Second, failure surfaces naturally: a span with context_confidence: 0.28 adjacent to a span with output_confidence: 0.91 immediately flags a mismatch worth investigating. Third, the full decision path is reconstructable: you know not just what the agent did but what it could have done instead.

The tree representation is not cosmetic. Research on multi-agent system failures found that state-of-the-art open-source agents achieve correctness rates as low as 25%, with failures including malformed tool calls, misinterpreted instructions, and loss of coherent state after 15+ tool calls. Tree traces surface exactly which branching point produced the breakdown. Linear logs just produce a wall of events with a timeout at the end.

What a Useful Agent Log Entry Looks Like

A well-instrumented span captures three surfaces simultaneously.

The cognitive surface records the agent's reasoning: intent, the alternatives it considered, which it selected, and at what confidence. For extended-reasoning models, this includes chain-of-thought fragments and the points where the model revised its thinking. These entries look expensive to store but are cheap — a few hundred bytes per decision point — and are the primary surface for diagnosing why an agent chose incorrectly.

The operational surface records execution: tool names, arguments, results, latency, retry count. This is the layer most teams already instrument. It is necessary but not sufficient.

The contextual surface records the environment: which prompt template version was used, model parameters, retrieval similarity scores, token usage, user and session identifiers for correlation. This layer makes traces reproducible. When you want to understand why the agent behaved differently on Tuesday than Monday, prompt version and retrieval metadata are what tell you.

Concretely, a useful span for a tool-selection decision contains:

{
"trace_id": "a1b2c3d4",
"span_id": "x1y2z3",
"name": "agent_tool_selection",
"decision_point": "tool_selection",
"intent": "retrieve customer billing information",
"selected_tool": "query_crm",
"alternatives": ["search_kb", "escalate"],
"confidence": 0.92,
"model": "claude-sonnet-4-6",
"input_tokens": 1250,
"latency_ms": 12
}

Compare this to what most teams actually log: function name, status, duration. The useful span is roughly four times larger. Given that storage for LLM observability data is negligible compared to inference costs — a few hundred gigabytes per month costs a few dollars — the tradeoff is straightforwardly correct.

OpenTelemetry as the Emerging Standard

The instrumentation landscape has largely converged on OpenTelemetry as the structural foundation. The OTel GenAI semantic conventions — formalized in 2024 and now part of the official specification — define standardized attribute names so that all tooling interprets spans consistently: gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.response.finish_reason, and the rest.

Most major agent frameworks now emit traces via OpenTelemetry natively. Pydantic AI, LangChain, and smolagents all have OTEL support. The platform layer — LangSmith for LangChain-heavy stacks, Langfuse as the open-source option, Arize Phoenix for RAG-intensive systems — consumes these traces and adds evaluation and alerting on top.

The key instrumentation principle is to wrap every decision boundary, not every function call. The overhead of capturing a span comes from the structure and transmission, not the decision metadata itself. A promiscuous approach that instruments every line produces noise that makes debugging harder. A targeted approach that captures model calls, tool selections, retrieval steps, and state mutations produces a trace that is both usable for debugging and affordable to store.

One practical pattern: instrument at 100% sampling for error cases and for spans where confidence drops below a threshold; apply sampling for routine successful operations. This gives you full fidelity where you need it without generating petabytes of redundant data.

The Confidence Mismatch Signal

One specific pattern is worth calling out because it appears repeatedly across production failure modes: the confidence mismatch between input context quality and output confidence.

Models that have ingested low-quality context — irrelevant retrieval results, truncated tool outputs, contradictory information in the conversation — sometimes produce high-confidence outputs anyway. This mismatch is the hallucination fingerprint. It is invisible in I/O logs because both the retrieval success and the model response success look identical to a correct interaction.

When you capture retrieval similarity scores alongside model confidence signals, the pattern becomes detectable. A span where retrieval_confidence: 0.31 immediately precedes output_confidence: 0.89 should trigger investigation. You can build automated alerting on this mismatch once it is captured in structured form.

The same pattern appears in tool selection: an agent that selects a tool with confidence: 0.48 in a domain where it typically operates with confidence: 0.85+ is signaling uncertainty that warrants attention. Without the decision metadata, you cannot distinguish a confident correct action from an uncertain guess that happened to succeed.

The Practical Case for Verbose Tracing

The pushback against richer logging is usually storage cost or performance overhead. Neither holds up under inspection.

Storage is cheap. A fully verbose trace for a single agent interaction — all three surfaces, full span metadata — runs to about 2KB. At 100,000 requests per month, that is 200GB, costing a few dollars in object storage. Meanwhile, the inference cost for those same requests runs to thousands of dollars. The ratio is so lopsided that storage cost is not a real constraint for any team that is seriously trying to debug agent behavior.

Performance overhead is real but manageable. The solution is asynchronous trace emission: write spans to a local buffer synchronously, flush to the observability backend asynchronously, and accept a bounded delay in trace availability. Most observability SDKs do this by default. The alternative — blocking agent execution on log writes to slow storage — is the actual performance problem, and it is not intrinsic to verbose logging.

The practical case for richer tracing is that the debugging cost it prevents is enormous. A production hallucination that is invisible in I/O logs may require hours of trace replay and reproduction to diagnose. A production hallucination that is visible in a confidence-mismatch span takes minutes.

What This Requires You to Change

AI-native logging is not a new library — it is a discipline applied to the instrumentation you are already building. The shift requires three things.

First, treat decision points as first-class log events. When an agent selects a tool, emit a span with the alternatives and confidence. When a retrieval step returns results, emit similarity scores. When a model call completes, emit the input confidence alongside the output.

Second, structure your logs hierarchically. A flat stream of events is not a trace. A tree of spans with parent-child relationships is. Most modern observability frameworks support this natively; use it.

Third, correlate across surfaces. A trace_id threaded through every span — model calls, tool calls, retrieval steps, state mutations — is what allows you to reconstruct the full decision path when something goes wrong. Without it, you have a collection of events. With it, you have a story.

The agents that are reliably debuggable in production are the ones where engineers can look at a trace and see not just what happened, but why the agent made each choice it made, what it was uncertain about, and where the reasoning went wrong. Getting there does not require exotic tooling. It requires treating decision state as log data — and shipping instrumentation that captures it.

References:Let's stay in touch and Follow me for more thoughts and updates