The OpenTelemetry Tail Sampler That Dropped Exactly the LLM Spans Your Post-Mortem Needed
A user pings support: "the assistant told me to cancel my service to update my address, that's insane." Your team opens the incident, asks for the conversation ID, drops it into the tracing UI, and gets a polite "no spans found for this trace." The 24-hour retention window closed an hour ago. The tail sampler decided this conversation was a routine success because the response was a syntactically valid JSON object, returned with a 200, in 1.4 seconds. By every signal your collector understood, nothing happened.
The model returned a sentence that destroyed a customer relationship, and your observability pipeline classified it as uneventful. This is not a bug in the sampler. The sampler did exactly what you configured it to do. The problem is that the policy you wrote was designed for a request-response world where "success" and "worth keeping" were close enough to be the same thing, and you ported it unmodified into a system where they are not.
