Skip to main content

The OpenTelemetry Tail Sampler That Dropped Exactly the LLM Spans Your Post-Mortem Needed

· 11 min read
Tian Pan
Software Engineer

A user pings support: "the assistant told me to cancel my service to update my address, that's insane." Your team opens the incident, asks for the conversation ID, drops it into the tracing UI, and gets a polite "no spans found for this trace." The 24-hour retention window closed an hour ago. The tail sampler decided this conversation was a routine success because the response was a syntactically valid JSON object, returned with a 200, in 1.4 seconds. By every signal your collector understood, nothing happened.

The model returned a sentence that destroyed a customer relationship, and your observability pipeline classified it as uneventful. This is not a bug in the sampler. The sampler did exactly what you configured it to do. The problem is that the policy you wrote was designed for a request-response world where "success" and "worth keeping" were close enough to be the same thing, and you ported it unmodified into a system where they are not.

A Sampling Policy Designed for the Wrong Failure Mode

Tail-based sampling is one of the most useful tools the OpenTelemetry ecosystem ships. You let traces complete, look at the full shape, and decide what to keep. The canonical configuration is roughly: keep everything that errored, keep the long-latency outliers, keep a small probabilistic sample of the rest, drop the bulk. It is the right answer for a payments API or an order pipeline, where the population of traces you most want to debug is exactly the population that returned a non-200 or stretched past a p99 boundary.

The policy is built on an implicit contract: the system can signal its own failures. A 5xx is a failure. A 10-second response that should have taken 200ms is a failure. The trace tells you, through the values it emitted, that it is worth your attention. The sampler's job is to listen to those signals and keep what the application has already labeled.

LLM workloads break the contract in the most boring way imaginable: they look fine. The model returns a value. The value parses. The function-calling layer accepts it. The downstream service consumes it without complaint. The latency lands in the normal band because the model produces tokens at roughly its usual rate regardless of whether those tokens form a correct answer or a confidently wrong one. Nothing in the span attributes carries the information that the output was a hallucination, an off-policy refusal, a tool call against the wrong identifier, or a piece of generic advice that a domain expert would recognize as wrong on sight.

Your sampler sees the trace, sees no error, sees latency in the green band, and classifies it according to the only rules it has. The rules say: this is a successful trace, keep 1% of these, the rest expire at the 24-hour boundary. The post-mortem you are about to write is now constrained by a decision made by a configuration file three quarters ago for a different kind of system.

When "200 OK" Stops Carrying the Same Information

The interesting failure modes in LLM systems sit in a layer that the transport protocol cannot see. The provider returns a 200 because the inference completed. Your tool layer returns a 200 because the JSON conformed to the schema. Your downstream service returns a 200 because the action it was instructed to perform did not raise. Every link in the chain reports success, and the user experience is that the assistant told them something untrue, called the wrong tool, or gave them an answer that was right on the surface and structurally misaligned with the question they actually asked.

This is the part of the architecture conversation that observability teams have not finished having. In a request-response world, the status code is a compressed summary of "did the system do the thing." That summary is good enough that the entire industry agreed to build sampling, alerting, and SLO frameworks on top of it. In a probabilistic-output world, the status code only summarizes "did the inference complete" — which is a much weaker claim. A model that has drifted, a prompt that has aged out of relevance, a tool description that misled the planner, a context window that ran out of useful signal three turns ago — none of these surface as 5xx. They surface as a user who stops trusting the product.

The practical consequence is that the most expensive failures in your AI feature are exactly the ones that look most ordinary to a sampler that was tuned for a service that returned errors when things went wrong. The samples you need for the debugging conversation tomorrow are the ones you most predictably throw away today.

Putting Semantic Quality Into the Sampling Decision

The fix is not to keep more traces. Storage budgets are real and the volume of inference spans in any serious agent system makes a 100% retention policy untenable. The fix is to widen the sampler's notion of "interesting" beyond status and latency, and let signals from the application's quality layer participate in the decision.

The most direct version of this is judge-driven sampling. At ingest, you run a cheap evaluator against the conversation — a fast classifier, a small judge model, a structured rubric over the final output — and write its score into a span attribute. The tail sampler then has a new policy: keep any trace whose judge score falls below some threshold, in addition to the error and latency policies. The cost of running the judge is paid in inference, not storage, and the inference is at small-model rates against outputs you already produced. The benefit is that the policy is now sensitive to a signal the application can generate but the transport could not.

The judge does not have to be sophisticated. A response-relevance check, a "did the output match the question shape," a "does the function call reference an entity that appears in the prompt" — these are minimal, cheap, deterministic-enough rubrics that filter out the genuinely uneventful traffic and surface the candidates a human would want to look at. Once you have the signal in the span, the sampler does the work it was always designed to do: keep the unusual, discard the routine, where "unusual" now includes a category the protocol could not name.

A complementary pattern is to give judge-graded retention an upgrade path. Several of the platform players have already adopted the convention that when an online evaluator runs against a trace, the trace gets auto-upgraded to extended retention. The mechanism is the same in spirit: the act of evaluating the trace is itself a signal that the trace is worth keeping, and the retention layer should react to that rather than treating evaluation and storage as independent concerns.

Conversation-Scoped Retention and the User-Feedback Loop

Judge-driven sampling catches a class of failures the model itself can identify. It does not catch the failures only the user can identify — the answer that was technically correct but missed the point, the recommendation that was reasonable in isolation and wrong for this specific customer, the explanation that sounded fine until the user actually tried to use it. For these, the signal arrives later, often hours later, when the user marks a response unhelpful or escalates to support.

Conversation-scoped retention closes this gap. The unit of retention is not the individual span or trace; it is the session. The collector tags every span in a conversation with a stable session identifier, holds the full set in extended retention for some window measured in days rather than hours, and only finalizes the sampling decision once that window closes without negative feedback. If feedback arrives within the window, the entire session is upgraded to permanent retention regardless of how innocuous any individual span looked at the moment it landed.

The cost model here is bounded by the fraction of sessions that fall inside the deferred-decision window. For most products that fraction is small enough to absorb. The benefit is that the post-mortem conversation no longer starts with "the traces expired." It starts with the full session that produced the complaint, including the turns before the bad output where the model's context was already drifting, including the tool calls that succeeded structurally and shaped the wrong plan.

The deeper shift here is treating the user's feedback as a first-class observability signal rather than an analytics signal. Engineering teams have historically treated thumbs-up/thumbs-down and "this didn't help" links as product telemetry, routed to a different system than the one that holds the trace data. The result is that the post-mortem requires correlating two stores that were not designed to be joined, on identifiers that may or may not survive the trip between systems. The simplest version of the fix is to write the feedback event as a span on the original trace and let the existing trace-retention policy carry both.

Surfacing the Signals the Sampler Was Never Told to Recognize

Beyond judge scores and user feedback, there is a longer list of signals AI workloads can emit that traditional samplers were never built to consider. Tool-call rejection rates inside a single conversation — the model trying the same argument shape against the same tool until it gives up — say something the latency histogram does not. Token consumption that spikes far above the median for a given prompt class usually indicates the agent has entered some kind of degenerate loop, even if it eventually terminated successfully. Refusal-pattern drift, where the model starts hedging or declining where it previously answered, indicates a model-version change underneath you that your release notes did not announce.

None of these are standard tail-sampling policies. They are derived signals that have to be computed at ingest, written into span attributes, and exposed to the sampler. The GenAI semantic conventions in OpenTelemetry are converging on a vocabulary for some of this — token usage, model parameters, tool-call structure — but the conventions remain in active development through 2026, and the application-specific signals (your judge score, your domain rubric, your feedback signal) live above the standard layer and require teams to invent their own attribute schema. That work is unavoidable. The standard layer will eventually stabilize what is common; the things that matter most for your product will not be part of it.

The discipline this requires is unfamiliar. Sampling policies historically lived with the platform team and changed once every few quarters when somebody noticed the storage bill. AI-aware sampling policies need to live closer to the application: the team writing the evaluators is the team who knows which output patterns are worth preserving, and the policy needs to follow the rubric. Treating the sampling configuration as an artifact owned jointly by the platform team and the AI team — versioned with the eval suite, reviewed when the eval rubric changes — is a small organizational shift that solves a large class of "we don't have the trace anymore" incidents.

What Changes in the Post-Mortem

The version of this conversation that engineering leadership cares about is what gets recovered when a real incident lands. Today, when a user complains about an output your AI feature produced, the answer time-to-first-useful-data is bounded by whether the sampler kept the trace. If it did, you can start investigating immediately. If it did not, you spend the first hour of the incident trying to reproduce the failure, often unsuccessfully because reproducing a probabilistic system's failure is not a reliable thing.

In the version where sampling policies have evolved to recognize semantic quality, the trace is almost always there. The judge flagged it at ingest, or the user feedback upgraded the retention, or the unusual token signature in the agent loop caused the policy to keep it. The conversation starts with data. The data is imperfect — judge scores are themselves noisy, feedback is sparse, derived signals can mislead — but the alternative is an empty UI where a sample size of zero is teaching your team that this kind of failure cannot be debugged.

The cultural piece worth naming is that observability practices were imported into AI systems from a world where they worked beautifully. Tail sampling was one of the most elegant designs distributed tracing produced. It is not the discipline's fault that it was tuned for a different failure surface. The work in front of teams running AI features in production is to keep the parts of the observability stack that still apply — span structure, context propagation, attribute conventions — and update the parts that quietly assumed "200 OK" carried as much information as it used to. The sampler is one of the parts that needs updating, and the longer the update is deferred, the more post-mortems begin with the sentence that should not be possible in 2026: we don't have the data.

References:Let's stay in touch and Follow me for more thoughts and updates