Skip to main content

Your Agent Traces Are Lying: Cardinality, Sampling, and Span Hierarchies for LLM Agents

· 11 min read
Tian Pan
Software Engineer

Your tracing dashboard says the agent made eight calls to serve a user request. In reality, it made forty-seven. Your head-based sampler quietly dropped most of them. The ones you kept are technically correct but causally useless — child spans orphaned from a root their parent sampler threw away.

This is not a visualization bug. It is the predictable outcome of pointing distributed tracing infrastructure designed for ten-span HTTP fan-outs at systems that generate hundreds of spans per user turn. Default OpenTelemetry configurations systematically undercount the work agents do, and the teams running those agents usually do not notice until a customer complains about latency the trace viewer says does not exist.

Agent observability is not a harder version of microservice observability. The shape of the data is different, the failure modes are different, and the cost curve is different. Treating it like a spicier web backend is how you end up with a tracing bill that doubles in a quarter while your mean-time-to-diagnose gets worse, not better.

The cardinality math nobody warned you about

Start with a single-turn interaction. A traditional REST endpoint fans out to maybe ten spans: the HTTP handler, a few database queries, a cache read, an external API call. Tracing tools were built with this shape in mind. Tail-sampling processor documentation uses ten-span traces in its examples. Default retention quotas assume it. Span storage pricing is calibrated to it.

Now instrument an agent. A reasonable ReAct loop for a customer support chatbot might look like this for one user message: intent classification (one LLM call), tool selection (one LLM call), parallel tool execution (three tool spans), tool result validation (one LLM call), retrieval (one vector DB query, one reranker call), answer drafting (one LLM call), safety check (one LLM call), tool call for a follow-up lookup (one LLM call plus one API call), final response (one LLM call). That is roughly thirteen operations. Each produces a top-level span plus child spans for HTTP, serialization, and retries. Realistic count: thirty to sixty spans per turn.

Multiply that by a five-turn conversation. Published estimates for mid-sized deployments describe a typical trajectory: 50,000 user messages per day, 200,000 LLM invocations, one million spans, four million metric data points, 400 MB of logs. Cost reports from teams that bolted AI workloads onto existing Datadog, Honeycomb, or New Relic setups land between a 40% and 200% observability bill increase, depending on retention and custom metrics.

The ten-to-fifty-times volume multiplier is not the hardest part. The hard part is that the per-span payload is different. Each gen_ai span under the OpenTelemetry semantic conventions wants to carry prompts, completions, token counts, model parameters, and tool arguments. Each of those attributes can be kilobytes. A traditional span is hundreds of bytes. You are not just paying 10x–50x more for spans — you are paying 10x–50x more for larger spans, and most tracing backends charge by attribute size.

Why head sampling silently corrupts your agent traces

Head-based sampling decides at the root span whether to keep a trace. It is fast, stateless, and makes cost predictable. It is also the default in most OTel SDKs. For traditional services it is fine: losing 90% of healthy traces is acceptable because the 10% you keep are representative.

For agents, head sampling is actively destructive. Two reasons.

First, a single agent run is a dependent sequence, not a statistical population. You do not want to sample 10% of the LLM calls in a run — you want 100% of the calls from the runs you keep, and zero from the ones you drop. Any partial capture produces a trace that lies about what happened. A span tree with three missing leaves does not tell you the agent took an unplanned detour; it shows you an agent that went from step 2 to step 7 with no explanation.

Second, the interesting events are rare. Slow turns, hallucinated tool calls, reasoning loops, cost spikes — these are the traces you need, and they are exactly the ones you cannot identify at the root span. By the time the sampler sees the first LLM call, it has no way to know the agent will go on to make forty-five more. A 1% head sampler applied to a bimodal latency distribution will preserve plenty of fast traces and almost none of the slow ones, because slow traces were always the minority.

There is a subtler failure: sampling drops entire executions, not individual calls, only when your instrumentation is clean. In agent code that spans multiple frameworks — LangGraph on top of the OpenAI SDK on top of a custom tool router — context propagation often breaks. Each framework starts its own trace because nobody passed through the parent context. Your sampler sees each fragment as a separate root span and makes independent decisions. You end up with one kept fragment and four dropped ones, so the "trace" in your viewer is a single disconnected sub-tree.

Designing a span hierarchy that survives retention

Assume you will have to throw most spans away. The hierarchy should be designed so that the spans you keep still answer the questions you actually ask.

The root span is the agent run, not the HTTP request. This is the most common mistake teams make when they bolt gen_ai conventions onto an existing service: they leave the web handler as root, so the agent turn becomes one level deep in a tree that is already four levels deep from middleware. Make the agent run its own trace boundary. Link upward to the HTTP request with a span link if you need the join, but let the agent own the root.

Under the root, use three coarse levels before you go fine-grained:

  1. Turn — one user message and everything done to answer it.
  2. Step — one iteration of the planner/executor loop (intent → plan → execute → observe).
  3. Action — one LLM call, tool invocation, retrieval, or validation within a step.
Loading…
References:Let's stay in touch and Follow me for more thoughts and updates