Skip to main content

The Streamed-Response Trace Schema Gap: Why Your APM Lies About LLM Latency

· 10 min read
Tian Pan
Software Engineer

A pager fires at 02:14: customer reports that the assistant "freezes mid-sentence" on long answers. You open the trace. The span for the LLM call shows 8.4 seconds — green, within SLO, no error attribute, finish reason stop. The dashboard widget that aggregates p95 latency for that endpoint is sitting at 9.1s, exactly where it has been for a month. By every signal the APM exposes, the request succeeded.

The user saw the first 200 milliseconds look great, watched the next four seconds produce a coherent paragraph, then watched the same three-sentence fragment repeat for the remaining four seconds before the connection ended. The stuck content loop is a real failure, and the trace knows nothing about it — because the trace was designed for a system that finishes when it returns, not for a system whose behavior is the wall of intermediate state it produced along the way.

This gap is the single most expensive thing an AI engineering team can ignore in 2026. The dashboards say the system is healthy. The users say it is broken. Both are correct, because they are looking at different events, and the events the dashboard collapses into a duration metric are exactly the events the users are reacting to.

Why the request/response span model fails for streams

Distributed tracing was built around a contract that fit microservices well: a span has a start time, an end time, and a status. The interesting events — the auth check, the cache miss, the downstream call — happen at boundaries that look like other spans nested inside, and the duration between start and end is a reasonable proxy for "how long did the work take." Add a finish-reason attribute, a token count, an error event, and you have something close to what production engineers need to debug a typical RPC.

A streamed LLM response violates the contract in a specific way. The work is not bounded; it is a process unfolding in time, and the user is consuming the intermediate state as it arrives. Time-to-first-token at 180ms feels instant. Time-to-first-token at 2.4s feels broken even if total wall time is identical. A model that produces 600 tokens evenly across six seconds feels fluid. A model that produces 540 of those tokens in the first 1.2 seconds and then stalls for the next 4.8 seconds is the same total latency and a different product. The single duration metric the span carries cannot tell those apart. Worse, every dashboard built on that metric will report them as equivalent.

The OpenTelemetry GenAI semantic conventions that stabilized in early 2026 land most of the static attributes the community needed — model name, input and output token counts, finish reason, prompt and completion as events. They explicitly do not solve the temporal-shape problem. The span still ends when the stream ends, and the attributes still describe the totals, not the trajectory.

What gets lost between the boundaries

Five distinct failure modes live entirely inside the window the span is currently treating as opaque:

The first is time-to-first-token regression. A model that starts producing in 180ms versus 2.4s is the same gen_ai.response.duration in many implementations, because the duration metric covers the whole stream. Users disagree with the duration; they tell you the slow one feels broken even though it produced more tokens per second on average. TTFT is the single biggest perceived-latency lever for streaming UIs, and the span model treats it as an internal detail of the first chunk handler.

The second is mid-stream stalls and content loops. Token-per-output-token (TPOT) and time-between-tokens (TBT) are useful averages, but the average drowns out the tail. A four-second stall in the middle of an eight-second response barely moves TPOT because TPOT is normalized by the large number of tokens that arrived quickly before and after. The paper Metron from the inference-serving community has been explicit about this for over a year: TPOT and normalized latency are statistically obscuring the user-visible failures. Tail-TBT is what reveals stalls. Nobody in the typical observability stack is computing tail-TBT.

The third is mid-stream tool-call decisions. In agentic streams, the model emits tokens, decides to call a tool, the tool runs, the model resumes streaming. Each transition is a state change the span model has no first-class place for. Some teams approximate this with child spans for tool calls, which works for the dispatch but loses the relationship between when in the output stream the decision happened and what the model was producing right before it. "We called the search tool 1.8 seconds in" is a different debugging signal than "we called the search tool."

The fourth is partial-output abort signals. The user hits stop, the client disconnects, the model emits a finish_reason of length because the context budget ran out mid-thought. To the APM these often resolve as "request completed" because the stream technically ended. To the user, three of those four were failures. Untangling them requires recording the tail event as something more granular than "completed."

The fifth is content-quality drift inside the window. The output looked good for the first 200 tokens, drifted into hedging language, and ended with a hallucinated citation. There is no static span attribute that captures this, but a checkpointed quality probe at fixed token intervals — even a cheap one — would let you bisect when the drift started. Today, almost no production stack does this. The team finds out from a thumbs-down five hours later.

The discipline that has to land

Treat streaming as a real-time data stream whose observability primitives are different from request/response. That is the architectural framing. In practice it decomposes into three concrete moves.

Token-time-axis events as first-class span attributes. Stop relying on a single duration field. Record at minimum: first-token timestamp, finish timestamp, per-chunk arrival timestamps for at least the first N chunks and a logarithmic sample beyond that, and the cumulative output length at each. The OpenTelemetry GenAI conventions already encourage emitting inputs and outputs as events rather than attributes; the same mechanism extends naturally to per-chunk events. Most teams have all the data inside their streaming SDK and throw it away before it crosses the span boundary.

Partial-output checkpoints captured at fixed intervals or interesting tokens. Every N tokens, or at every transition the application cares about (the model starts emitting JSON, the tool-call delimiter appears, a stop sequence is approached), record a checkpoint event with the partial content length, current token count, and elapsed time. This is what makes a stall debuggable: you can see the timeline of length=412 at t=1.1s, length=412 at t=2.0s, length=412 at t=3.0s. The dashboard widget that detects "checkpoint stalled" is a one-line alert once the data is there.

A tail-event taxonomy that distinguishes how the stream actually ended. "Completed" is not enough. Distinguish at minimum: stream_completed_natural (model emitted stop), stream_completed_length_cap (hit max tokens), stream_stalled (no token in N seconds while connection alive), client_disconnect (TCP/SSE close mid-stream), server_abort (upstream killed it), tool_handoff (paused for tool call), safety_intervention (refusal mid-stream). Half of these resolve as identical status codes in raw HTTP semantics; the difference matters enormously for product decisions.

A short example of what the trace should hold

Here is what one span carrying a stalled response should look like in attribute form, compressed:

gen_ai.request.model = claude-opus-4-7
gen_ai.response.ttft_ms = 184
gen_ai.response.duration_ms = 8420
gen_ai.response.finish_reason = stop
gen_ai.tail_event = stream_stalled
gen_ai.response.tbt_p50_ms = 24
gen_ai.response.tbt_p99_ms = 31
gen_ai.response.tbt_max_ms = 4180
gen_ai.checkpoints = [
{ t: 184, tokens: 1, bytes: 4 },
{ t: 612, tokens: 38, bytes: 162 },
{ t: 1280, tokens: 102, bytes: 480 },
{ t: 2400, tokens: 188, bytes: 894 },
{ t: 3200, tokens: 188, bytes: 894 },
{ t: 4400, tokens: 188, bytes: 894 },
{ t: 8420, tokens: 188, bytes: 894 }
]

The duration is still 8420ms. The finish reason is still stop. The same dashboard query that passed it as healthy now has every signal needed to flag it as a stall: TBT max of 4.18 seconds despite a p99 of 31ms, checkpoints frozen for the last 5+ seconds, tail event labeled. The widget the SRE team writes against this is straightforward; the data was simply absent before.

What it costs to ignore the gap

The hidden bill for treating streams as opaque is paid in three places, all of them invisible on the dashboard that says everything is fine.

The first is silent quality regressions after a model swap. A new model version that has worse TTFT but better total tokens will look identical on the existing latency dashboard and feel worse in user surveys. The team will rationalize the satisfaction drop ("seasonality, unrelated UI changes, a product launch nearby") because they have no instrumentation to attribute it to TTFT specifically. Engineers spend weeks reverse-engineering customer feedback that the trace already saw and discarded.

The second is invisible reliability incidents. Stuck content loops, mid-stream stalls, and tool-handoff failures aggregate into a class of bug that does not show up in error rate, does not page anyone, and slowly trains users that the product is flaky. Every user-reported complaint requires an engineer to pull the literal raw event log from the LLM provider — if they even kept it — because the trace summary erased the signal. Mean-time-to-diagnosis stretches from minutes to days.

The third is the wrong architectural conclusions. A team looking at request/response-shaped dashboards will keep optimizing for total latency, which is the metric the dashboards show. They will buy a faster inference cluster. They will not invest in time-to-first-token, will not invest in tail-TBT, will not build the partial-output observability. Six months in, the user complaints have not moved, the cluster is more expensive, and the team is shipping the wrong fix because the instrumentation pointed at the wrong number.

Streaming as a different kind of system

The architectural realization at the bottom of this is simple: streaming responses are not a faster form of request/response. They are a real-time data stream whose interesting events are temporal, whose failure modes are shaped like time-series anomalies, and whose user experience is determined by the trajectory of intermediate state rather than the boundary values. The observability primitives the industry built for stateless RPC do not fit. Span-with-attributes captures the shape badly enough that teams ship measurable regressions and never see them.

The fix is not exotic. The data is already inside the SDK. The mechanism for emitting it (span events, with timestamps, in the same trace) is already in OpenTelemetry. What is missing is the discipline to treat the wall of intermediate state as load-bearing — to record TTFT and tail-TBT as first-class attributes, to checkpoint partial output, to distinguish how a stream actually ended. Once a team does that, the next on-call rotation will look at a stalled trace and immediately see the stall. That is the entire bar. It is also the bar most production AI stacks are not clearing in 2026, and the gap is showing up in every quarterly retention review where the metric "feels worse than the dashboard says."

The teams that close it first will not look like they are doing anything special. Their pagers will fire on real failures, their model swaps will land without a satisfaction cliff, and their post-incident reviews will reference the specific token at which the stream went wrong. The teams that do not will keep writing reports about how the model "regressed" and never know which 4.2-second window did the damage.

References:Let's stay in touch and Follow me for more thoughts and updates