Agent Trace Sampling: When 'Log Everything' Costs $80K and Still Misses the Regression
The bill arrived in March. Eighty-one thousand dollars on traces alone, up from twelve in November. The team had turned on full agent tracing in October on the theory that more visibility was always better. By Q1 the observability line was running ahead of the inference line — and when an actual regression hit production, the trace that contained the failure was buried under twenty million successful spans nobody needed.
The mistake was not the decision to instrument. The mistake was importing a request-tracing mental model into a workload that does not behave like requests.
A typical web request produces a span tree with a handful of children: handler, database call, cache lookup, downstream service. An agent request produces a tree with five LLM calls, three tool invocations, two vector lookups, intermediate scratchpads, and a planner that reconsiders three of those steps. The same sampling policy that worked for the API gateway — head-sample 1%, keep everything else representative — produces a trace store where the median trace is a 200-span monster, the long tail is the only thing that matters, and the rate at which you discover incidents is uncorrelated with the rate at which you spend money.
Why Request-Level Sampling Breaks for Agents
Span-tree size is the first-order issue. Distributed-tracing teams have measured trace volume at roughly five times log volume across typical workloads. Agent workloads are an order of magnitude worse — practitioners estimate a RAG pipeline produces 10–50× the telemetry of an equivalent stateless API call, and a multi-step agent compounds that further. Teams adopting AI workload monitoring on top of an existing tracing setup commonly report observability bill increases of 40–200%.
Cost is only half the problem. The deeper issue is that head-based sampling — the decision-at-trace-origin model that most APM tools default to — was designed for a world where every successful trace looked like every other successful trace, so a 1% sample of successes was statistically the same as a 100% sample. Agent traces violate that assumption. A successful agent run that took eleven tool calls and three planner revisions is not interchangeable with a successful run that took two tool calls and zero revisions. Both pass the eval. Both look fine in aggregate. One of them is going to be the failure mode when next week's prompt change accidentally up-weights the planner's tendency to spiral.
The other failure mode is worse. The trace that contains the actual regression — the rare path where a tool returned a confusing error, the model interpreted it as a user instruction, and the agent made an unauthorized call — is by definition rare. If you head-sample at 1%, you keep that trace one time in a hundred. The other ninety-nine times it happens, you have a metric that ticked up by 0.001 percentage points and no artifact to investigate.
Tail Sampling, Cost-Tier Sampling, and the Hybrid Default
The mature pattern, well-documented in OpenTelemetry's tail-sampling guidance, is to delay the sampling decision until the full trace is available so the policy can keep traces that are interesting and drop traces that are routine. For agents, "interesting" needs three lenses, not one.
Always-trace failure paths. Any trace where a tool returned an error, the model produced a refusal it shouldn't have, the safety filter fired, or the policy engine intervened — keep them all. These are the traces that get pulled up in incident review, in eval-set construction, and in red-team analysis. The volume is small. The marginal storage cost is rounding-error against their value. Datadog's defaults reflect this — they retain a fixed budget of error traces per second (around ten by default) on top of the representative-success budget, and let you mark high-value transactions for 100% retention.
Head-sample successful paths. Routine successes do not need 100% storage. A 1–10% head sample of the boring successful runs gives you the volumetric base rate for SLO computation, dashboard rollups, and traffic-shape analysis. The OpenTelemetry recommendation here is straightforward: SDK-side head sampling at the tracer reduces wire volume before anything expensive happens, and the loss is statistical, not categorical.
Tail-sample by cost percentile. The signal that traditional tail sampling captures — latency outliers — is necessary but not sufficient for agents. Cost is a parallel dimension and often the more useful one. A trace that took 23 seconds is interesting; a trace that consumed $1.40 in tokens because the planner looped is more interesting, because token cost is the signal that catches reasoning-loop failures and runaway tool usage that latency alone might miss (a fast loop is still expensive). Configure the tail sampler to retain traces above the 95th and 99th percentiles of token spend per request, alongside the latency outliers. These two distributions are correlated but not identical, and the traces that live in the gap — high-cost, low-latency — are where prompt regressions hide.
The architecturally useful framing is that this is no longer one sampling decision; it is three policies composed. The OpenTelemetry collector's tail-sampling processor supports policy composition natively — string_attribute, latency, numeric_attribute, and status_code policies can be ANDed and ORed in the configuration. Most teams will run a hybrid: light head sampling (50% or so) at the SDK to cap network egress, then a gateway collector applying the tail policy to what remains.
The Storage Problem Nobody Forecasted
- https://www.datadoghq.com/architecture/mastering-distributed-tracing-data-volume-challenges-and-datadogs-approach-to-efficient-sampling/
- https://opentelemetry.io/blog/2022/tail-sampling/
- https://opentelemetry.io/docs/concepts/sampling/
- https://uptrace.dev/opentelemetry/sampling
- https://uptrace.dev/blog/opentelemetry-ai-systems
- https://openobserve.ai/blog/opentelemetry-for-llms/
- https://openobserve.ai/blog/head-and-tail-based-sampling/
- https://oneuptime.com/blog/post/2026-04-01-ai-workload-observability-cost-crisis/view
- https://oneuptime.com/blog/post/2026-02-06-head-based-vs-tail-based-sampling-opentelemetry/view
- https://www.datadoghq.com/blog/llm-otel-semantic-convention/
- https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/README.md
- https://langfuse.com/docs/observability/overview
- https://pydantic.dev/articles/ai-observability-pricing-comparison
- https://www.digitalapplied.com/blog/agent-observability-2026-evals-traces-cost-guide
