Agent Trace Sampling: When 'Log Everything' Costs $80K and Still Misses the Regression
The bill arrived in March. Eighty-one thousand dollars on traces alone, up from twelve in November. The team had turned on full agent tracing in October on the theory that more visibility was always better. By Q1 the observability line was running ahead of the inference line — and when an actual regression hit production, the trace that contained the failure was buried under twenty million successful spans nobody needed.
The mistake was not the decision to instrument. The mistake was importing a request-tracing mental model into a workload that does not behave like requests.
A typical web request produces a span tree with a handful of children: handler, database call, cache lookup, downstream service. An agent request produces a tree with five LLM calls, three tool invocations, two vector lookups, intermediate scratchpads, and a planner that reconsiders three of those steps. The same sampling policy that worked for the API gateway — head-sample 1%, keep everything else representative — produces a trace store where the median trace is a 200-span monster, the long tail is the only thing that matters, and the rate at which you discover incidents is uncorrelated with the rate at which you spend money.
Why Request-Level Sampling Breaks for Agents
Span-tree size is the first-order issue. Distributed-tracing teams have measured trace volume at roughly five times log volume across typical workloads. Agent workloads are an order of magnitude worse — practitioners estimate a RAG pipeline produces 10–50× the telemetry of an equivalent stateless API call, and a multi-step agent compounds that further. Teams adopting AI workload monitoring on top of an existing tracing setup commonly report observability bill increases of 40–200%.
Cost is only half the problem. The deeper issue is that head-based sampling — the decision-at-trace-origin model that most APM tools default to — was designed for a world where every successful trace looked like every other successful trace, so a 1% sample of successes was statistically the same as a 100% sample. Agent traces violate that assumption. A successful agent run that took eleven tool calls and three planner revisions is not interchangeable with a successful run that took two tool calls and zero revisions. Both pass the eval. Both look fine in aggregate. One of them is going to be the failure mode when next week's prompt change accidentally up-weights the planner's tendency to spiral.
The other failure mode is worse. The trace that contains the actual regression — the rare path where a tool returned a confusing error, the model interpreted it as a user instruction, and the agent made an unauthorized call — is by definition rare. If you head-sample at 1%, you keep that trace one time in a hundred. The other ninety-nine times it happens, you have a metric that ticked up by 0.001 percentage points and no artifact to investigate.
Tail Sampling, Cost-Tier Sampling, and the Hybrid Default
The mature pattern, well-documented in OpenTelemetry's tail-sampling guidance, is to delay the sampling decision until the full trace is available so the policy can keep traces that are interesting and drop traces that are routine. For agents, "interesting" needs three lenses, not one.
Always-trace failure paths. Any trace where a tool returned an error, the model produced a refusal it shouldn't have, the safety filter fired, or the policy engine intervened — keep them all. These are the traces that get pulled up in incident review, in eval-set construction, and in red-team analysis. The volume is small. The marginal storage cost is rounding-error against their value. Datadog's defaults reflect this — they retain a fixed budget of error traces per second (around ten by default) on top of the representative-success budget, and let you mark high-value transactions for 100% retention.
Head-sample successful paths. Routine successes do not need 100% storage. A 1–10% head sample of the boring successful runs gives you the volumetric base rate for SLO computation, dashboard rollups, and traffic-shape analysis. The OpenTelemetry recommendation here is straightforward: SDK-side head sampling at the tracer reduces wire volume before anything expensive happens, and the loss is statistical, not categorical.
Tail-sample by cost percentile. The signal that traditional tail sampling captures — latency outliers — is necessary but not sufficient for agents. Cost is a parallel dimension and often the more useful one. A trace that took 23 seconds is interesting; a trace that consumed $1.40 in tokens because the planner looped is more interesting, because token cost is the signal that catches reasoning-loop failures and runaway tool usage that latency alone might miss (a fast loop is still expensive). Configure the tail sampler to retain traces above the 95th and 99th percentiles of token spend per request, alongside the latency outliers. These two distributions are correlated but not identical, and the traces that live in the gap — high-cost, low-latency — are where prompt regressions hide.
The architecturally useful framing is that this is no longer one sampling decision; it is three policies composed. The OpenTelemetry collector's tail-sampling processor supports policy composition natively — string_attribute, latency, numeric_attribute, and status_code policies can be ANDed and ORed in the configuration. Most teams will run a hybrid: light head sampling (50% or so) at the SDK to cap network egress, then a gateway collector applying the tail policy to what remains.
The Storage Problem Nobody Forecasted
Run those three policies and the storage bill stops being linear in traffic. It becomes a function of error rate, cost-distribution shape, and the percentage of business-critical paths flagged for full retention. That is good — it means cost scales with information value rather than with raw QPS. It is also harder to forecast, which catches finance teams off guard.
The decision_wait parameter is the operational control that matters most here. The OpenTelemetry guidance recommends setting it to 2–3× your p99 trace latency, because spans from slow downstream services can arrive after the sampling decision and produce incomplete traces. Agent traces have p99 latencies that span tens of seconds, so decision_wait windows of 30–90 seconds are normal. That is 30–90 seconds of in-memory span buffering at the collector, which makes the collector itself a critical component — if it crashes mid-window, you lose every trace in that buffer, including the errors. Multiple practitioner write-ups flag this as the surprise: tail sampling moves the reliability problem upstream into your observability infrastructure.
The storage tiering question is the next layer down. Hot storage for error and high-cost traces (queryable in seconds, retained 30–90 days), warm storage for representative successes (queryable in minutes, retained a week), cold storage for raw telemetry (compressed, retained for the duration regulators or eval teams need). Vendor pricing models map onto this tiering unevenly. Langfuse charges per "billable unit" — an aggregate of traces, spans, and eval scores, where a 10-span trace with one auto-eval costs about 12 units. Arize bills on both span count and raw GB. LangSmith bills per trace. The cost arbitrage between vendors at moderate scale (50M spans/month) can run an order of magnitude or more in either direction depending on which dimension your traces happen to load.
The self-hosting math has shifted enough to deserve a recalculation in 2026. Open-source tracing stacks (Langfuse, Phoenix, OpenObserve) running on a 4-core/16GB node handle 5M spans/day for $50–80/month of compute, and an 8-core/32GB node with NVMe handles 10M+. For a team that has already crossed the $5K/month threshold on managed observability for AI traces, the self-host break-even is closer than it used to be, and the operational cost is mostly absorbed by the SRE team rather than landing on the AI engineer's budget.
The Cost-Spike Failure Mode
There is one tail-sampling failure mode worth knowing about before it happens to you. If your sampling policy retains all traces above some latency or cost threshold, and you experience a network event or vendor degradation that pushes a large fraction of traces over that threshold simultaneously, your trace exporter will dump every one of them into the storage backend at once. The bill spikes. Practitioner reports describe surprise charges of several thousand dollars from a single bad afternoon when an upstream provider's latency tripled and the tail sampler dutifully retained every now-anomalous trace.
The mitigation is a cap on the percentage of traces the tail sampler is allowed to retain in any rolling window. The OpenTelemetry collector supports rate limiting on the export side; configure it to bound your worst-case bill. Reasonable defaults are something like "retain 100% of errors, 100% of cost-anomalies, and at most 5% of total trace volume in any one-hour window — drop the rest with a warning." That preserves the incident-debugging value while putting a ceiling on the bill.
Operational Discipline
Three pieces of discipline distinguish teams that get value from agent tracing from teams that get bills.
The sampling policy is configuration, not code. It needs to be reviewable, version-controlled, and tunable without a deploy. The collector contrib's tail-sampling processor is YAML-configurable; treat the policy as you would treat any production config — diffs reviewed, changes alerted, rollback documented.
The trace store is where eval sets come from. The strongest argument for keeping every error trace is that those traces become the negative examples in your next eval iteration. A trace store that has discarded 99% of failures has discarded 99% of your eval material. The economics of paying to keep error traces look different when you account for what each trace would have cost to reconstruct from scratch in a labeling pipeline.
Cost-per-correctness is the right unit. The most useful metric to pin to the sampling policy is not "spans per dollar" but "incidents resolved per dollar of trace storage." Teams that build that dashboard tend to discover that their highest-value spend is the always-on retention of error traces and the cost-percentile tail; the rest of the spend is debatable.
The Architectural Realization
Agent observability is not a request-tracing problem with a higher cardinality multiplier. It is a different storage problem, with a different ratio of signal to volume, a different cost-attribution surface, and a different relationship between what you keep and what you can debug. The reflex to "log everything" is what stateless-service intuition tells you to do. It produces a trace store that is expensive to maintain, slow to query, and missing the long-tail traces that contain the actual incidents.
The shift is from sampling-as-cost-control to sampling-as-curation. The traces you keep are the ones future-you will want — the failures, the cost outliers, the runs that took an unusual path through the planner — and the volume of routine successes you keep is just enough to compute aggregate health. Configure the policy that way, treat the collector as production infrastructure, cap the worst-case bill, and the eighty-thousand-dollar surprise becomes a fifteen-thousand-dollar tool that your incident-response team actually uses.
- https://www.datadoghq.com/architecture/mastering-distributed-tracing-data-volume-challenges-and-datadogs-approach-to-efficient-sampling/
- https://opentelemetry.io/blog/2022/tail-sampling/
- https://opentelemetry.io/docs/concepts/sampling/
- https://uptrace.dev/opentelemetry/sampling
- https://uptrace.dev/blog/opentelemetry-ai-systems
- https://openobserve.ai/blog/opentelemetry-for-llms/
- https://openobserve.ai/blog/head-and-tail-based-sampling/
- https://oneuptime.com/blog/post/2026-04-01-ai-workload-observability-cost-crisis/view
- https://oneuptime.com/blog/post/2026-02-06-head-based-vs-tail-based-sampling-opentelemetry/view
- https://www.datadoghq.com/blog/llm-otel-semantic-convention/
- https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/tailsamplingprocessor/README.md
- https://langfuse.com/docs/observability/overview
- https://pydantic.dev/articles/ai-observability-pricing-comparison
- https://www.digitalapplied.com/blog/agent-observability-2026-evals-traces-cost-guide
