Sampling Bias in Agent Traces: Why Your Debug Dataset Silently Excludes the Failures You Care About
The debug corpus your team stares at every Monday is not a representative sample of production. It is an actively biased one, and the bias is in exactly the wrong direction. Head-based sampling at 1% retains the median request a hundred times before it keeps a single rare catastrophic trajectory — and most teams discover this only when a failure mode that has been quietly recurring for months finally drives a refund or an outage, and they go looking for examples in the trace store and find none.
This is not an exotic edge case. It is the default behavior of every observability stack that was designed for stateless web services and then pointed at a long-horizon agent. The same sampling math that worked fine for HTTP request tracing systematically erases the trajectories that matter most when each "request" is a thirty-step plan that may invoke a dozen tools, regenerate three subplans, and consume tens of thousands of tokens before something subtle goes wrong on step twenty-seven.
The fix is not "sample more." Sampling more makes the bill explode without changing the bias — you just get more of what you already had too much of. The fix is to change what you sample, keyed on outcomes you can only know after the trajectory finishes. That requires throwing out the head-based defaults and rebuilding the retention layer around tail signals, anomaly weighting, and bounded reservoirs that survive the long tail of agent execution.
Why Head-Based Sampling Is The Wrong Default For Agents
Head-based sampling makes the keep/drop decision at the start of a trace, before anything has happened. It is fast, stateless, and works well when the trace cost is roughly uniform and the failure rate is roughly constant — neither of which holds for agents.
Two properties of agent workloads break the head-based assumption:
- Trace cost is bimodal, not normal. A bored agent answers in two LLM calls; a confused one spends fifty. If you sample 1% of starts uniformly, you keep a representative slice of trace counts but a wildly skewed slice of behaviors, because the long, expensive, interesting trajectories are individually rarer per unit of started request. The traces with the most signal are the ones least likely to be retained.
- Failure rates are not stationary. Failures cluster around new tool deployments, around specific user cohorts, around content distributions that drift week over week. A 1% uniform sample needs roughly 100 occurrences of a failure mode in the wild before it expects to capture one example — and by that point the failure has already shipped through to enough users to matter. AI workloads also generate roughly 10–50× more telemetry per request than traditional API calls, so the temptation to sample harder grows in tandem with the value of the rare retained trace.
The combination is brutal. You retain a mountain of mediocre, cheap, successful traces, and the corpus you debug from over-represents the common case and under-represents the failures worth fixing. The team's mental model of "what production looks like" gradually fuses to the most boring slice of it.
Tail-Based Sampling: Decide After You Know What Happened
Tail-based sampling defers the keep/drop decision until the trace completes and outcome signals are in hand. The OpenTelemetry community settled on this as the right default for distributed systems years ago for the same reason it now matters for agents: the most useful signal arrives at the end. A trace that finishes with an error, an unusual token spend, or a low judge score is worth more than a thousand trace starts.
A workable retention policy on top of tail sampling looks like this:
- 100% of traces with errors, timeouts, or unhandled exceptions. These are the cheapest correct decision in the entire pipeline.
- 100% of traces above a cost threshold — for example, the top 5% by token spend, or any trace that exceeded a per-request budget. Expensive traces are either failures-in-disguise or load-bearing edge cases; either way, you want them.
- 100% of traces below an eval-score threshold when an online judge or rubric runs. If you have a quality signal at all, retention should follow it.
- A small probabilistic baseline — 1–5% of healthy, cheap, passing traces — so you keep a coverage layer for distribution drift detection.
This is a different shape of dataset than head-based sampling produces. The corpus skews toward trouble on purpose, which is the whole point. When you go looking for "what does it look like when the agent gets confused about the schema of a tool we deprecated last week," the answer is in the corpus instead of permanently lost.
The trade-off tail sampling imposes is buffering. You have to hold spans in memory until the trace completes (or until an event horizon expires), which constrains the topology of your collector deployment. Hindsight, the research system that made the post-hoc trace selection idea concrete, runs with a typical event horizon of around five seconds; long-horizon agents will need a lot more, and the buffer cost is real. You are explicitly trading some collector RAM for a debug corpus that does not lie to you.
Anomaly-Biased Retention Beyond Errors
Errors and timeouts are the easy axis. The harder and more valuable axis is the silent failure — the trace that completes, returns a plausible-looking answer, and still represents something you should fix. These do not flag themselves. You have to define the anomaly signals that surface them.
A few that pay for themselves quickly:
- Outcome divergence from declared plan. If your agent emits a plan in step one and the executed action sequence diverges past some edit-distance threshold, retain the trace. This catches mid-flight steering failures and silent replanning loops that completed but did so wastefully.
- Tool retry counts above baseline. A trace that called the same tool four times with mutating arguments before "succeeding" is almost always interesting, even if the final response was acceptable.
- Token-per-useful-output ratio. Traces where the ratio of consumed tokens to characters of final answer is in the long-tail right edge usually represent the model thrashing. Keep all of them.
- Disagreement against a cheaper baseline. Run a Haiku-class judge or a deterministic rule check on a fraction of traces; retain every trace where the judge and the agent's own self-report disagree. These are the post-hoc rationalizations and the hallucinated successes — exactly the failures users complain about and trace stores never seem to contain.
Datadog's tracing agent has shipped a version of this idea for a long time: even with aggressive head-based sampling enabled, it overlays a separate error-trace ingestion path that guarantees a floor of error coverage (ten errors per second, by default). The principle generalizes — your retention layer should be the union of a probabilistic floor and a set of outcome-keyed filters, never just one or the other.
Cost-Aware Reservoir Sampling For Long-Horizon Agents
The remaining problem is bounding the bill. If you naively keep "every error trace," a single broken downstream dependency turns into a million identical retained traces overnight, and your observability cost spikes to match. The literature on weighted reservoir sampling solves exactly this shape of problem: maintain a fixed-size reservoir, where each candidate item enters with probability proportional to its weight, and items already in the reservoir are evicted by newcomers in a way that preserves the weighted sample property.
Mapped onto agent traces, that becomes:
- Per-failure-mode reservoirs, not a single global one. Hash the trace by error-class-plus-tool-plus-prompt-version (or whatever your dimensionality of interest is) and run a bounded reservoir per bucket. A noisy bucket cannot drown out a quiet one. You always have some coverage of every failure mode in production, and you stop paying for the thousand-and-first identical retry storm.
- Weight by anomaly strength, not uniformly. A trace that is three standard deviations above baseline cost gets a higher reservoir weight than one that is barely past the threshold. Over a debugging session, the reservoir gravitates toward the most extreme exemplars of each failure class, which is what humans actually want to look at.
- Time-decay the weights. A trace from yesterday should outweigh a trace from three months ago for the same failure class, because production has moved on. Exponential decay on the reservoir weights handles this without a separate eviction policy.
The combination — bounded per-bucket reservoirs, anomaly-weighted entry, time-decayed weights — gives you a debug corpus whose size you control, whose composition tracks current production, and whose contents systematically over-represent the failures you should be debugging. The bill becomes predictable: it scales with the number of buckets you care about, not the number of times the worst bucket fires.
The Diagnostic Question To Ask Your Trace Store This Week
You can run a one-line audit on whether your existing sampling is producing a usable debug corpus: take the last user-reported failure your team triaged, and ask the trace store for examples that look like it from the seven days before the user reported it. If the answer is zero, your sampling is biased away from the failures you are paid to find. If the answer is "we had to manually re-enable verbose logging and wait for it to happen again," same problem, slower diagnosis.
Most teams that ask the question for the first time discover they have been debugging from a corpus that excludes their hardest failure modes by construction, while paying full price to retain the easiest ones in bulk. The fix is not adding more storage. It is moving the keep/drop decision to the end of the trace, weighting it by outcome anomalies, and bounding it inside per-failure-mode reservoirs that survive long enough to still be there when someone finally goes looking.
The shape of your debug dataset is a design choice. Right now most teams have made it by accident. Make it on purpose, before the next post-mortem reveals that the trace you needed was sampled out a week before the incident.
- https://www.digitalapplied.com/blog/agent-observability-2026-evals-traces-cost-guide
- https://oneuptime.com/blog/post/2026-04-01-ai-workload-observability-cost-crisis/view
- https://oneuptime.com/blog/post/2026-02-06-head-based-vs-tail-based-sampling-opentelemetry/view
- https://opentelemetry.io/blog/2022/tail-sampling/
- https://www.langchain.com/articles/agent-observability
- https://www.datadoghq.com/architecture/mastering-distributed-tracing-data-volume-challenges-and-datadogs-approach-to-efficient-sampling/
- https://innodata.com/trace-datasets-for-agentic-ai/
- https://wandb.ai/site/articles/ai-agent-observability/
- https://arxiv.org/abs/1904.04126
- https://arxiv.org/pdf/2202.05769
- https://www.usenix.org/system/files/nsdi23-zhang-lei.pdf
- https://www.helicone.ai/blog/the-complete-guide-to-LLM-observability-platforms
