Your Agent Traces Are Lying: Cardinality, Sampling, and Span Hierarchies for LLM Agents
Your tracing dashboard says the agent made eight calls to serve a user request. In reality, it made forty-seven. Your head-based sampler quietly dropped most of them. The ones you kept are technically correct but causally useless — child spans orphaned from a root their parent sampler threw away.
This is not a visualization bug. It is the predictable outcome of pointing distributed tracing infrastructure designed for ten-span HTTP fan-outs at systems that generate hundreds of spans per user turn. Default OpenTelemetry configurations systematically undercount the work agents do, and the teams running those agents usually do not notice until a customer complains about latency the trace viewer says does not exist.
Agent observability is not a harder version of microservice observability. The shape of the data is different, the failure modes are different, and the cost curve is different. Treating it like a spicier web backend is how you end up with a tracing bill that doubles in a quarter while your mean-time-to-diagnose gets worse, not better.
The cardinality math nobody warned you about
Start with a single-turn interaction. A traditional REST endpoint fans out to maybe ten spans: the HTTP handler, a few database queries, a cache read, an external API call. Tracing tools were built with this shape in mind. Tail-sampling processor documentation uses ten-span traces in its examples. Default retention quotas assume it. Span storage pricing is calibrated to it.
Now instrument an agent. A reasonable ReAct loop for a customer support chatbot might look like this for one user message: intent classification (one LLM call), tool selection (one LLM call), parallel tool execution (three tool spans), tool result validation (one LLM call), retrieval (one vector DB query, one reranker call), answer drafting (one LLM call), safety check (one LLM call), tool call for a follow-up lookup (one LLM call plus one API call), final response (one LLM call). That is roughly thirteen operations. Each produces a top-level span plus child spans for HTTP, serialization, and retries. Realistic count: thirty to sixty spans per turn.
Multiply that by a five-turn conversation. Published estimates for mid-sized deployments describe a typical trajectory: 50,000 user messages per day, 200,000 LLM invocations, one million spans, four million metric data points, 400 MB of logs. Cost reports from teams that bolted AI workloads onto existing Datadog, Honeycomb, or New Relic setups land between a 40% and 200% observability bill increase, depending on retention and custom metrics.
The ten-to-fifty-times volume multiplier is not the hardest part. The hard part is that the per-span payload is different. Each gen_ai span under the OpenTelemetry semantic conventions wants to carry prompts, completions, token counts, model parameters, and tool arguments. Each of those attributes can be kilobytes. A traditional span is hundreds of bytes. You are not just paying 10x–50x more for spans — you are paying 10x–50x more for larger spans, and most tracing backends charge by attribute size.
Why head sampling silently corrupts your agent traces
Head-based sampling decides at the root span whether to keep a trace. It is fast, stateless, and makes cost predictable. It is also the default in most OTel SDKs. For traditional services it is fine: losing 90% of healthy traces is acceptable because the 10% you keep are representative.
For agents, head sampling is actively destructive. Two reasons.
First, a single agent run is a dependent sequence, not a statistical population. You do not want to sample 10% of the LLM calls in a run — you want 100% of the calls from the runs you keep, and zero from the ones you drop. Any partial capture produces a trace that lies about what happened. A span tree with three missing leaves does not tell you the agent took an unplanned detour; it shows you an agent that went from step 2 to step 7 with no explanation.
Second, the interesting events are rare. Slow turns, hallucinated tool calls, reasoning loops, cost spikes — these are the traces you need, and they are exactly the ones you cannot identify at the root span. By the time the sampler sees the first LLM call, it has no way to know the agent will go on to make forty-five more. A 1% head sampler applied to a bimodal latency distribution will preserve plenty of fast traces and almost none of the slow ones, because slow traces were always the minority.
There is a subtler failure: sampling drops entire executions, not individual calls, only when your instrumentation is clean. In agent code that spans multiple frameworks — LangGraph on top of the OpenAI SDK on top of a custom tool router — context propagation often breaks. Each framework starts its own trace because nobody passed through the parent context. Your sampler sees each fragment as a separate root span and makes independent decisions. You end up with one kept fragment and four dropped ones, so the "trace" in your viewer is a single disconnected sub-tree.
Designing a span hierarchy that survives retention
Assume you will have to throw most spans away. The hierarchy should be designed so that the spans you keep still answer the questions you actually ask.
The root span is the agent run, not the HTTP request. This is the most common mistake teams make when they bolt gen_ai conventions onto an existing service: they leave the web handler as root, so the agent turn becomes one level deep in a tree that is already four levels deep from middleware. Make the agent run its own trace boundary. Link upward to the HTTP request with a span link if you need the join, but let the agent own the root.
Under the root, use three coarse levels before you go fine-grained:
- Turn — one user message and everything done to answer it.
- Step — one iteration of the planner/executor loop (intent → plan → execute → observe).
- Action — one LLM call, tool invocation, retrieval, or validation within a step.
The trap is to skip the step level because each step usually has a single action. Do not skip it. The step span is what lets you answer "did the agent loop?" without walking every leaf. The step span carries the planner's natural-language plan as an attribute, which is the single most useful piece of context when a trace looks wrong. Without it, you are reading a list of LLM calls with no narrative.
Tool calls and LLM calls at the action level should follow the gen_ai.* semantic conventions — gen_ai.operation.name, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens. The conventions graduated to stable for core attributes in mid-2025, and most vendors now auto-instrument them. The value is not just visualization: it is that a single query across all agent runs can ask "which model is my p99 tool-calling latency regressing on?" without custom attributes per framework.
One more hierarchy rule: never let retry loops produce sibling spans at the same level as the primary call. Wrap retries in a retry span whose children are the individual attempts. Otherwise a burst of transient 429s from an LLM provider turns one action into five peer spans, and your "number of LLM calls per turn" metric becomes unreliable.
Tail sampling for agents: keep the weird ones
Once your hierarchy is sound, the question becomes what to throw away. Tail-based sampling — deciding after the full trace arrives at the collector — is the only sampling that works for agents, because the signals you care about only exist at the end.
A working tail-sampling policy for agent workloads has four keepers and one rate limiter:
- Keep all errors. Any span with status error, any tool span that timed out, any LLM span that hit a content filter. Volume is naturally low.
- Keep all outliers. Tail-sampling processors support latency and span-count policies. Keep traces above p99 latency and traces with more spans than your expected p95 per-turn count. Reasoning loops and context-overflow recoveries land here.
- Keep all high-cost traces. Agent cost has a fat tail: a small percentage of runs consume most of the token budget. A
gen_ai.usage.total_tokensthreshold policy ensures you never miss the $3 user turn you will explain to your CFO next week. - Keep all low-eval traces. If you run an inline evaluator, attach the score as a trace attribute and keep anything below your trust threshold. This is the only reliable way to study "the model being weird" after the fact.
- Rate-limit the healthy ones. A probabilistic policy at 1–5% on everything else gives you enough baseline traffic to compute healthy-path SLOs without ballooning storage.
The cost of tail sampling is real. The collector must buffer traces in memory until the decision window closes, and decision windows for agents need to be measured in tens of seconds because agent runs are slow. Plan for a tail-sampling collector tier with enough RAM to hold one decision window of full-volume traces, and deploy it as a dedicated pool — not co-located with ingestion collectors. Losing the tail sampler to OOM during a traffic spike is how you silently go back to 100% sampling at the backend for thirty minutes.
Move the payload out of the span
The last lever is the one most teams reach for last and should reach for first: most of the cost is in the attributes, not the span count. A trace of thirty spans with full prompts and completions attached can easily be a megabyte. The same trace with prompts moved to object storage and referenced by ID is twenty kilobytes.
Three patterns, in order of how much operational work they take:
Truncate at the span. Set a hard cap on gen_ai.prompt and gen_ai.completion attribute lengths — 2KB each is generous. If you need the full text, record it once per trace on the root span, not on every child. This costs no new infrastructure and typically removes 60–80% of span storage bill.
Shift payloads to events. OpenTelemetry logs and events are cheaper to store than span attributes in most backends because they are not indexed. Emit the full prompt as a log event linked to the span via trace ID. The viewer will stitch them back together.
Shift payloads to blob storage. Write prompt/completion bodies to S3, R2, or a dedicated store keyed by trace ID, and attach only the object URL to the span. This is the right model for any team with regulatory retention requirements — you can set a shorter lifecycle policy on the trace (hot for queries) than on the payload (warm for audit), and you can encrypt the payload separately without re-encrypting every span.
Whichever pattern you pick, stop treating gen_ai.prompt as a free-form attribute. It is a payload masquerading as metadata, and every tracing backend prices it as the latter.
What changes when you get this right
Operating AI systems in production is fundamentally a visibility problem. The bugs are stochastic, the root causes are emergent, and the only way to debug an agent that looped eighteen times instead of three is to look at all eighteen loops. A trace viewer that shows you four of them is not a tool — it is a liability, because it builds a story your team will believe and act on.
Teams that invest in agent-aware tracing tend to converge on the same three-part answer: a hierarchy that matches the agent's actual control flow, tail sampling that privileges anomalies over coverage, and payload handling that separates the metadata you query from the text you occasionally need to read. None of it is exotic. All of it requires you to stop configuring your tracing like it is still 2018.
The bigger shift is philosophical. Distributed tracing used to be an optimization tool — you turned it on when you needed to hunt down a p99 latency regression. For agents, tracing is the product surface for your engineering team. It is how you know what the system did, why it did it, and whether to trust it tomorrow. Configure it like that matters, because it does.
- https://opentelemetry.io/blog/2025/ai-agent-observability/
- https://opentelemetry.io/docs/specs/semconv/gen-ai/
- https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/
- https://opentelemetry.io/blog/2022/tail-sampling/
- https://opentelemetry.io/docs/concepts/sampling/
- https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/README.md
- https://oneuptime.com/blog/post/2026-04-01-ai-workload-observability-cost-crisis/view
- https://uptrace.dev/blog/opentelemetry-ai-systems
- https://www.datadoghq.com/blog/llm-otel-semantic-convention/
- https://www.datadoghq.com/blog/parent-child-vs-span-links-tracing/
- https://blog.sentry.io/ai-agent-observability-developers-guide-to-agent-monitoring/
- https://openai.github.io/openai-agents-python/ref/tracing/spans/
- https://dev.to/aws/why-ai-agents-fail-3-failure-modes-that-cost-you-tokens-and-time-1flb
