Trace Sampling for Agents: Which of 10 Million Daily Spans Are Worth Keeping

April 24, 2026 · 11 min read

Software Engineer

A web service request produces five spans on a busy day. A modern agent session produces fifty, sometimes a thousand if the planner decides to recurse. The uniform 1% sampler your platform team copy-pasted from the microservices era will, by definition, drop the rare failure you actually care about — because the failure is rare, and uniform sampling has no opinion about rarity.

The honest version of "we have full observability on our agents" sounds different than the marketing version. It sounds like: we keep the traces that matter, drop the ones that don't, and we know in advance which is which. Every word in that sentence is load-bearing, and the platform teams that ignored sampling design until the bill arrived are now learning the discipline backwards — under cost pressure, after a quarter of incidents that were "in the data" but evicted before anyone looked.

This post is about the four decisions that drive your agent trace bill and your incident MTTR at the same time: what to sample at trace start, what to keep at trace end, how to stratify so rare populations don't disappear, and how long to keep what you kept. None of these are technically novel — most have OpenTelemetry primitives — but the operating points that work for a stateless API don't work for an agent, and copying the defaults forward is how teams end up with both an expensive observability stack and a thin one.

Why Web-Service Sampling Defaults Break for Agents

The defaults you inherited assume a roughly linear relationship between traffic and trace volume: more requests, more spans, more storage, all scaling proportionally. The instrumentation a typical service emits — entry span, a few downstream calls, a database hit — fits in a kilobyte and tells you the request's story end-to-end.

Agents violate every part of that model. A single user message fans out into intent classification, retrieval, ranking, multiple LLM turns with reasoning, several tool calls each producing their own retries, and a final synthesis pass. Industry telemetry estimates put a typical agent trace at roughly twenty-five kilobytes — about fifty times larger than a traditional API trace — and a moderately-busy product at five hundred thousand traces per day with millions of spans riding on top. AI workloads generate ten to fifty times more telemetry than the equivalent legacy service, and that ratio shows up directly on the observability invoice.

The next default that breaks is head-based sampling. A TraceIdRatioBasedSampler flips a coin at trace start and either keeps everything that follows or nothing — cheap, simple, and exactly wrong for agent workloads. The interesting properties of an agent trace (did it error, did it cost more than expected, did the eval score drop, did the planner loop) are not knowable at trace start. The trace where the agent recursively retried a malformed tool call thirty times and burned $4 looks identical at span zero to the trace where it answered correctly on the first turn. Head-based sampling drops both with the same probability, and you find out which you needed when an account manager opens a ticket about a customer's bill.

The third default that breaks is uniform sampling across populations. Production agents serve heterogeneous traffic: a handful of enterprise tenants generating millions of sessions a day, alongside hundreds of small tenants generating dozens. A flat one-percent sampler retains thousands of traces from the elephants and zero from the long tail — and the failure modes you most need to debug, like a tenant whose schema doesn't match a tool's expected input, live exclusively in the long tail.

What "Interesting" Means at Trace End

Tail-based sampling moves the keep/drop decision to the moment the trace completes. The collector buffers spans for the lifetime of the trace, evaluates a policy when the root span closes, and then either ships the whole trace to long-term storage or discards it. The cost is real — buffering memory, additional collector infrastructure, the operational complexity of running a stateful sampling tier — but for agents, it's the only sampling mode that knows what just happened.

The policy that holds up under contact with production has roughly four buckets:

Keep everything that errored or timed out. Errors are by far the most valuable traces in the corpus, and they are by definition rare. A 100% retention rule on status=error costs you almost nothing in storage and saves you the conversation where the on-call engineer asks why the failing trace from last Tuesday isn't in the system.
Keep everything above a cost threshold. Define the threshold as the top one to five percent of traces by token spend, or as an absolute dollar cutoff per session. Agents fail expensively before they fail loudly — the trace where your planner spun in a loop for thirty seconds is your incident, and it is also the trace your CFO asks about next quarter.
Keep everything with a low eval score. If you have online scoring or a heuristic for output quality (refusal rate, JSON parse failure, downstream user-rejection click), retain 100% of the traces that scored below threshold. These are the traces that taught you something the model didn't already know.
Keep a stratified slice of the healthy population. One to five percent is fine for the rest, but stratify by tenant, by feature, by tool-call type, or by user cohort so the sample preserves shape. A flat probabilistic sampler over an unbalanced corpus produces a sample that looks nothing like the population.

The fourth bullet is where most teams under-invest, and it's the one that costs them later. Errors and cost outliers tell you about today's incident. The stratified healthy sample is your eval feedstock and your distributional baseline — it's how you'll notice next quarter that the seventh-largest tenant's average tool-call latency drifted, or that the new model rev quietly changed the refusal rate on a specific task type.

Stratification Is the Difference Between Coverage and Bias

Agents serve power-law-distributed traffic. A naive sampler under a power law produces a sample that looks like the head and tells you nothing about the tail. The corrective is to define strata explicitly — tenant tier, region, task type, tool used — and target a per-stratum sample rate or a per-stratum minimum.

Per-stratum minima are the version of this that actually works. Set a floor: every tenant retains at least N traces per day, every task type retains at least M traces per day, regardless of the global rate. The elephants will be over-represented anyway because they generate so much volume; the floor protects the tail without explicitly capping the head.

The stratification also has to be principal-aware in a way the legacy patterns aren't. Tag at the root span, propagate to every child is the non-negotiable: user_id, tenant_id, session_id, task_type, model_version, prompt_version, tool_set attached at the entry point and inherited through every LLM call, retrieval span, and tool invocation. Without that propagation, the tail-based sampler at the collector has no way to make a per-tenant decision, and the cost-attribution rollup you'll need at the end of the quarter has nothing to roll up.

The stratification surface is also where rare-failure detection lives. The traces that contain genuinely novel agent failures — a new prompt-injection vector, a tool returning malformed JSON for the first time, a retrieval surfacing a document that violates a content policy — are by definition in the long tail of attribute combinations. A sampler that targets coverage of attribute combinations rather than just trace count is the one that catches those before they become public.

The Storage Tier Hierarchy You Actually Need

Once tail-based sampling has selected what to keep, the next decision is how long to keep it. Treating "kept" as a binary — either trace is in storage or it's gone — is the second bill-shaped mistake teams make.

The tiering that holds up has at least three rungs:

Hot tier, twenty-four to seventy-two hours. Full trace payloads, indexed and queryable, available for live debugging. This is the tier the on-call engineer hits during an incident, and it's expensive per gigabyte. Most traces should expire here.
Warm tier, thirty days. Full payloads at lower cost (object storage with a query layer, or a columnar OLAP system like ClickHouse that achieves ten-to-fifty-times compression on agent traces). This is the tier your eval team and fine-tune feedstock pipelines read from. Cheap enough that retaining the stratified healthy sample plus everything error-flagged is sustainable.
Cold tier, indefinitely, only for explicitly promoted traces. Traces tied to incidents under investigation, traces tagged for legal or compliance hold, traces curated as evaluation gold-set members. Cheap per gigabyte, expensive to query. The promotion mechanism — incident response promotes traces, eval team promotes traces, nothing else — is the discipline that keeps the cold tier from becoming a second hot tier.

A fourth tier worth designing in: a metadata-only index that survives even after the payload expires. Per-trace metadata — duration, total cost, error status, top-level user/tenant tags, eval score — is tiny compared to the trace itself, and keeping ninety days of metadata costs almost nothing. When a customer asks about a session from two months ago and the payload is gone, the metadata can still tell you the trace existed, when it ran, what it cost, and whether it succeeded — enough to either move on or, if the answer matters, to schedule a re-execution.

The metadata index is also what makes the per-tenant promise honest. "We don't have the trace, but we have the receipt" is a real answer for the long tail. "We don't know if a trace ever existed" is the answer that ends a customer relationship.

The Finance Conversation Nobody Wanted To Have

The platform team that promised "we trace everything" did so when the agent caseload was small enough that the promise was true and cheap. The promise becomes a problem somewhere between five hundred thousand traces a day and ten million — the inflection where your observability vendor's invoice starts arriving with the same energy as your inference bill, and the CFO starts asking which line item is which.

The walk-back is uncomfortable but unavoidable. The honest update sounds like: we will keep one hundred percent of the traces that matter, with explicit definitions of "matter," and we will be transparent about what we drop. The sampling policy becomes a documented contract — versioned, reviewed, owned — rather than a default someone set in 2024 and nobody touched since.

The retention numbers should appear on the same review document as the inference budget. Both are storage costs that scale with usage and both are subject to the same tradeoff: spend more for higher-fidelity post-incident reconstruction, or spend less and accept that some incidents will be debugged from logs and metrics alone. Treating them as a unified observability budget — with the stratification policy, the tier transitions, the cold-tier promotion criteria, and the metadata-index retention all written down — is what separates the platform teams that survived the agent-cost transition from the ones still arguing about it in retro.

Sampling Is Now an Observability-Defining Decision

The pattern across all four of these — what to sample at start, what to keep at end, how to stratify, how long to retain — is that sampling is no longer a knob you set once and forget. For agents, sampling is the load-bearing piece of your observability strategy. The decisions you make in the sampling tier determine whether next quarter's incident is a forty-five-minute investigation or a three-day archeology project, and whether your eval pipeline gets fed representative data or a biased slice of one tenant's usage.

The teams shipping reliably have already converged on a recognizable shape: tail-based sampling at the collector, stratified retention with per-tenant minima, three-tier storage with explicit promotion criteria, a metadata index that outlives the payload, and a documented policy that the on-call engineer, the eval team, and the finance partner have all signed off on. None of this is exotic. What's exotic is doing it before the bill forces the conversation, instead of after.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Trace Sampling for Agents: Which of 10 Million Daily Spans Are Worth Keeping

Why Web-Service Sampling Defaults Break for Agents

What "Interesting" Means at Trace End

Stratification Is the Difference Between Coverage and Bias

The Storage Tier Hierarchy You Actually Need

The Finance Conversation Nobody Wanted To Have

Sampling Is Now an Observability-Defining Decision

Recommended Reading

About Tian Pan

Why Web-Service Sampling Defaults Break for Agents​

What "Interesting" Means at Trace End​

Stratification Is the Difference Between Coverage and Bias​

The Storage Tier Hierarchy You Actually Need​

The Finance Conversation Nobody Wanted To Have​

Sampling Is Now an Observability-Defining Decision​

Recommended Reading

About Tian Pan

Why Web-Service Sampling Defaults Break for Agents

What "Interesting" Means at Trace End

Stratification Is the Difference Between Coverage and Bias

The Storage Tier Hierarchy You Actually Need

The Finance Conversation Nobody Wanted To Have

Sampling Is Now an Observability-Defining Decision