Skip to main content

One post tagged with "sampling"

View all tags

Sampling Bias in Agent Traces: Why Your Debug Dataset Silently Excludes the Failures You Care About

· 9 min read
Tian Pan
Software Engineer

The debug corpus your team stares at every Monday is not a representative sample of production. It is an actively biased one, and the bias is in exactly the wrong direction. Head-based sampling at 1% retains the median request a hundred times before it keeps a single rare catastrophic trajectory — and most teams discover this only when a failure mode that has been quietly recurring for months finally drives a refund or an outage, and they go looking for examples in the trace store and find none.

This is not an exotic edge case. It is the default behavior of every observability stack that was designed for stateless web services and then pointed at a long-horizon agent. The same sampling math that worked fine for HTTP request tracing systematically erases the trajectories that matter most when each "request" is a thirty-step plan that may invoke a dozen tools, regenerate three subplans, and consume tens of thousands of tokens before something subtle goes wrong on step twenty-seven.

The fix is not "sample more." Sampling more makes the bill explode without changing the bias — you just get more of what you already had too much of. The fix is to change what you sample, keyed on outcomes you can only know after the trajectory finishes. That requires throwing out the head-based defaults and rebuilding the retention layer around tail signals, anomaly weighting, and bounded reservoirs that survive the long tail of agent execution.