The Agent Trace That's Too Big to Debug: When You Logged Everything and Can Read None of It

May 17, 2026 · 11 min read

Software Engineer

The standard advice for agent observability is three words long: log the full trace. Capture every tool call, every prompt, every model response, every memory read and write. Teams comply. Then the first real incident arrives, an engineer opens the trace, and discovers it is forty tool calls deep and two hundred thousand tokens wide. The trace is technically complete. It is also practically unreadable.

What follows is a familiar ritual. The engineer scrolls. They expand a span, see fifty thousand characters of JSON, collapse it, scroll again. Ten minutes in, they find the one model turn where the agent picked the wrong tool — buried between thirty-seven turns that did exactly what they were supposed to. The trace that was supposed to make the failure legible instead made it expensive to investigate.

This is not a tooling gap. Every observability vendor will happily ingest your agent spans, and the OpenTelemetry GenAI semantic conventions now give you a clean schema for invoke_agent, chat, and execute_tool spans. The gap is conceptual. We adopted a definition of observability — "did we capture it" — that was correct for request-response services and is quietly wrong for agents. For an agent, capture is the easy part. The hard part, the part that decides whether an incident gets resolved in twenty minutes or two days, is whether a human can find the one thing that mattered while the incident is still warm.

Capture completeness is not an observability strategy

In a traditional service, a trace has maybe five to fifteen spans, and the structure mirrors the call graph you already have in your head. You know the shape before you look. Debugging is mostly a matter of reading the spans in order and noticing which one is red.

An agent trace breaks all three assumptions. The structure is emergent — the agent decided how many steps to take, which tools to call, and when to loop, so you do not know the shape until you read it. The volume is enormous — a single run with eight tool calls routinely produces fifty thousand tokens of log output, and a single span carrying a full prompt can be two hundred thousand characters of JSON on its own. And the failure is rarely a red span. The agent did not crash. It returned a 200, finished its task, and produced an answer that was wrong, slow, or no longer wanted. Nothing in the trace is colored red because nothing technically failed.

So "log everything" produces a trace where the signal — the three steps that changed the outcome — is statistically drowned by the thirty-seven steps that were fine. The engineer's job becomes a search problem with no index. And searching is the worst possible thing to make a human do under incident pressure, because the cost of the search is paid in the currency you have least of: time before the context goes cold and the on-call engineer moves on.

There is a second, slower failure here too. Industry reports on AI observability costs describe bills rising forty to two hundred percent after teams instrument their agent workloads, because chain-of-thought traces generate ten to fifty times the telemetry of an equivalent API call. So the "log everything" reflex is not free. You are paying a large and growing storage bill for data whose primary effect, in the moment you actually need it, is to slow you down. That is a strange thing to optimize for.

The decision spine: what a human actually needs first

When an engineer opens an agent trace during an incident, they are not asking "what did the agent do." They are asking a much narrower question: "where did this go wrong." The answer to that question almost never lives in the token-level detail. It lives in the decisions — the goal the agent was pursuing, the tools it chose, the branch points where it could have gone two ways and picked one.

Call that the decision spine. It is the skeleton of the run: a compact, ordered list of what the agent decided and why, stripped of the prompt bodies and tool payloads. A decision spine for a forty-call trace might be twenty lines long. It fits on a screen. An engineer can read it in fifteen seconds and form a hypothesis — "it called search_orders with an empty query on step nine, that's the divergence" — before expanding a single token-level span.

The token detail still matters. When you need it, you need all of it. But it is reference material, not the front page. The architectural move is to render the spine and the detail as two separate layers, with the spine as the default view and the detail one click away. Most observability tools today invert this: they show you the full trace tree and let you drill in. For a fifteen-span service trace, drilling in is fine. For a forty-span agent trace, drilling in is the scrolling problem with extra steps.

Building the spine is not free, but it is cheap relative to what it saves. Each model turn already declares its intent — the gen_ai.response.finish_reasons field tells you whether it stopped to call a tool, and the tool name and a one-line argument summary are right there in the span. A short summarization pass over the trace, run once at ingest or lazily on first open, turns "decision made" into one readable line. The expensive thing is not generating the spine. The expensive thing is the status quo, where every engineer regenerates the spine in their head, by scrolling, every time.

Jump to the divergence

A decision spine tells you what happened. It does not, by itself, tell you where the run left the rails. Those are different questions, and the second one deserves its own affordance.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Agent Trace That's Too Big to Debug: When You Logged Everything and Can Read None of It

Capture completeness is not an observability strategy

The decision spine: what a human actually needs first

Jump to the divergence

Recommended Reading

About Tian Pan

Capture completeness is not an observability strategy​

The decision spine: what a human actually needs first​

Jump to the divergence​

Recommended Reading

About Tian Pan

Capture completeness is not an observability strategy

The decision spine: what a human actually needs first

Jump to the divergence