Skip to main content

17 posts tagged with "streaming"

View all tags

The Streamed-Response Trace Schema Gap: Why Your APM Lies About LLM Latency

· 10 min read
Tian Pan
Software Engineer

A pager fires at 02:14: customer reports that the assistant "freezes mid-sentence" on long answers. You open the trace. The span for the LLM call shows 8.4 seconds — green, within SLO, no error attribute, finish reason stop. The dashboard widget that aggregates p95 latency for that endpoint is sitting at 9.1s, exactly where it has been for a month. By every signal the APM exposes, the request succeeded.

The user saw the first 200 milliseconds look great, watched the next four seconds produce a coherent paragraph, then watched the same three-sentence fragment repeat for the remaining four seconds before the connection ended. The stuck content loop is a real failure, and the trace knows nothing about it — because the trace was designed for a system that finishes when it returns, not for a system whose behavior is the wall of intermediate state it produced along the way.

Hyrum's Law for Streamed Reasoning: Pacing, Pauses, and Intermediate Tokens Are an Undocumented Contract

· 11 min read
Tian Pan
Software Engineer

A team upgrades from a frontier model to its faster successor. The eval suite is green. Final answers match. Tool-call schemas are identical. The structured outputs validate against the same JSON schema they always did. They ship. Within a day, support tickets pile up: "the assistant feels rushed," "it's not really thinking anymore," "something is off." The product manager pulls telemetry and finds task-completion rates unchanged. The engineering team double-checks the eval and the schema and finds nothing wrong. The complaint is real, but the contract — as the team defined it — is intact.

What changed is the texture of the stream. The old model paused for 800 milliseconds before calling a tool, emitted a "Let me check that..." preamble, and dribbled tokens at roughly 35 per second with natural-feeling clusters around clause boundaries. The new model emits tokens at 90 per second, never pauses, and skips the preamble entirely. None of that was in any documented contract. All of it was load-bearing.

This is Hyrum's law, and streaming makes its surface area enormous. Any observable behavior of your system will be depended on by somebody — and a streaming AI surface exposes far more observable behavior than the team realizes.

Streaming Tool Results Break Request-Response Agent Planners

· 10 min read
Tian Pan
Software Engineer

A SQL tool ships rows as they come off the wire. The agent calls it expecting a result. The harness, written a year earlier when every tool was request-response, dutifully buffers the whole stream into a single string before invoking the model. Forty seconds later, the buffer is 200 KB, the context window is half-eaten, and the agent is reasoning about row 47,000 of a query it could have stopped at row 30. Nobody designed this failure — it falls out of treating "the tool returned" as the only event the planner reacts to.

The shift to streaming tools is happening below the planner's awareness. SQL engines emit progressive result sets. Document fetchers yield pages. Search APIs return hits in batches as relevance scores stabilize. MCP's Streamable HTTP transport, the 2025-03-26 spec replacement for HTTP+SSE, makes incremental responses a first-class transport mode rather than an exotic capability. The wire is ready. The planners on top of it are not.

Streaming Structured Output: Why Your Parser Hangs on Token 47

· 11 min read
Tian Pan
Software Engineer

The first time a team builds a streaming AI feature with structured output, the bug is always the same. The model is generating fine. The chunks are arriving fine. But somewhere around token 47, the parser hangs, the UI freezes, or — worse — a half-formed enum value gets routed to a downstream tool that quietly does the wrong thing. The team adds a try/catch around JSON.parse, considers themselves done, and ships. Two weeks later, a sibling team complains that the streaming UI feels janky after the response gets long. A quarter later, an incident review asks why a "Delete" tool call fired on a record that the model was still describing as "DeleteIfEmpty."

The bug is not in any single token. The bug is that token-streaming and structured output are architecturally at odds, and most frameworks paper over the conflict with prayer. A schema says "this is a complete object." A token stream says "here are the bytes one at a time." Every intermediate state between those two endpoints is, by definition, invalid against the schema. The team's job is to decide what to do during those intermediate states — and most teams have not made that decision explicitly.

Inter-Token Jitter: The Streaming UX Failure Your p95 Dashboards Can't See

· 11 min read
Tian Pan
Software Engineer

Your latency dashboard is green. Time-to-first-token is under the 800ms target on p95. Total completion time is under the four-second budget on p99. Then a senior PM forwards a support thread: "the assistant froze for like three seconds in the middle of an answer," "it stuttered and then dumped a whole paragraph," "I thought it crashed." Three users uninstalled this week with the same complaint. Nobody on the team can reproduce it on their laptop, and every metric you log says the system is healthy.

The metric that would explain the bug is the one you're not measuring: the distribution of gaps between consecutive tokens. A clean p95 total time can hide a stream where 8% of responses contain a 2.5-second pause halfway through, and to a user watching characters appear in real time, that pause reads as a broken system — not a slow one. Your dashboard is measuring the movie's runtime; your user is watching the movie.

Speculative Decoding Is a Streaming Protocol Decision, Not an Inference Optimization

· 12 min read
Tian Pan
Software Engineer

The "identical output" guarantee that ships with every speculative decoding paper is a guarantee about token distributions, not about what your user sees. Read the proofs carefully and you find a clean mathematical equivalence: the rejection-sampling acceptance criterion is designed so that the output distribution after speculation is exactly the distribution the target model would have produced on its own. That guarantee binds the bytes that leave the inference engine. It says nothing about the bytes that arrived on the user's screen five hundred milliseconds ago and have to be taken back.

If you stream draft tokens to the client the moment the small model emits them, you are running an A/B test on your own users every time the verifier rejects a suffix. Half a paragraph rewrites itself. A function name changes after the IDE has already syntax-highlighted it. A TTS voice has already pronounced "the answer is likely no" before the verifier swaps in "the answer is yes, with caveats." The math says the final distribution is the same as the slow path. The user's experience says they watched the model change its mind in public.

This is the part of speculative decoding that doesn't make it into the speedup numbers. It is also the part that turns "free 3× throughput" into a half-quarter of streaming-protocol work that nobody scoped.

Streaming JSON Parsers: The Gap Between Tokens and Typed Objects

· 12 min read
Tian Pan
Software Engineer

The model is emitting JSON token by token. Your UI wants to render fields the moment they materialize — a confidence score before the long answer body, the arguments of a tool call as the model fills them in. Then someone wires up JSON.parse on every chunk and the whole thing falls over, because JSON.parse is all-or-nothing. It needs a balanced document to return anything. Until the model emits the closing brace, you have nothing to show.

This is not a parser problem you can fix with a try/catch. The standard JSON parser was designed against a content-length-known HTTP response. Partial input is not a state it models — it is "input error." When you treat a token stream as if it were an HTTP body, you inherit thirty years of "the document is either complete or invalid," and your UI pays the bill.

The Cancellation Tax: Your Inference Bill After the User Hits Stop

· 9 min read
Tian Pan
Software Engineer

Your stop button is a lie. When a user clicks it, your UI stops rendering tokens; your provider, in most configurations, keeps generating them. The bytes never reach a browser, but they reach your invoice. The gap between what the user saw and what you paid for is the cancellation tax, and it is the single most under-reported line item on AI cost dashboards.

The reason the tax exists is structural. Autoregressive inference is a GPU-bound pipeline: by the time your client closes the TCP connection, the model has already been scheduled, KV-cached, and is emitting tokens at 30–200 per second. Most serving stacks do not check for client liveness between tokens. They finish the job, log the usage, and bill you. The client saw ten tokens; the log recorded eight hundred. Langfuse, Datadog, and every other observability platform will faithfully report the eight hundred, because that's what the provider's usage block reported.

The Output Commitment Problem: Why Streaming Self-Correction Destroys User Trust More Than the Original Error

· 10 min read
Tian Pan
Software Engineer

A user asks your agent a question. Tokens start flowing. Three sentences in, the model writes "Actually, let me reconsider — " and pivots to a different answer. The revised answer is better. The user closes the tab.

This is the output commitment problem, and it is one of the most consistently underestimated UX failures in shipped AI products. The engineering mindset treats self-correction as a feature — the model noticed its own error, that is the system working as intended. The user-perception mindset treats it as a disaster — the product demonstrated, live, that its first confident claim was wrong. Those two readings are both correct, and they do not reconcile on their own.

The core asymmetry is that streaming makes thinking legible, and legible thinking is auditable thinking. A model that hallucinated silently and then produced a clean final answer would look competent. The same model, streaming every half-thought, looks like it is flailing. The answer quality is identical. The perception is not.

Time-to-First-Token Is the Latency SLO You Aren't Instrumenting

· 11 min read
Tian Pan
Software Engineer

Pull the last week of production traces and look at your latency dashboard. You almost certainly have p50 and p99 on total request latency. You probably have token throughput. You may even have a tokens-per-second chart, because a provider benchmark talked you into it. What you almost certainly do not have is a per-model, per-route, per-tenant histogram of time to first token — the single number that governs how fast your product feels.

This is not a small oversight. For any streaming interface — chat, code completion, agent sidebars, voice — perceived speed is set by how long the user stares at a blinking cursor before anything appears. Once the first token lands, the user is reading; subsequent tokens compete with their reading speed, not with their patience. Total latency matters for throughput planning and budget. TTFT matters for whether the product feels alive.

The gap between these two numbers is widening. Reasoning models can produce identical total latency to their non-reasoning siblings while pushing TTFT from 400 ms to 30 seconds. A routing change that "keeps latency flat" can silently turn a snappy assistant into a hanging window. If you are not graphing TTFT, you are shipping UX regressions you cannot see.

The Latency Perception Gap: Why a 3-Second Stream Feels Faster Than a 1-Second Batch

· 11 min read
Tian Pan
Software Engineer

Your users don't have a stopwatch. They have feelings. And those feelings diverge from wall-clock reality in ways that matter enormously for how you build AI interfaces. A response that appears character-by-character over three seconds will consistently feel faster to users than a response that materializes all at once after one second — even though the batch system is objectively faster. This isn't irrational or a bug in human cognition. It's a well-documented perceptual phenomenon, and if you're building AI products without accounting for it, you're optimizing for the wrong metric.

This post breaks down the psychology behind latency perception, the metrics that actually predict user satisfaction, the frontend patterns that exploit these perceptual quirks, and when streaming adds more complexity than it's worth.

SSE vs WebSockets vs gRPC Streaming for LLM Apps: The Protocol Decision That Bites You Later

· 11 min read
Tian Pan
Software Engineer

Most teams building LLM features pick a streaming protocol the same way they pick a font: quickly, without much thought, and they live with the consequences for years. The first time the choice bites you is usually in production — a CloudFlare 524 timeout that corrupts your SSE stream, a WebSocket server that leaks memory under sustained load, or a gRPC-Web integration that worked fine in unit tests and silently fails when a client needs to send messages upstream. The protocol shapes your failure modes. Picking based on benchmark throughput is the wrong frame.

Every major LLM provider — OpenAI, Anthropic, Cohere, Hugging Face — streams tokens over Server-Sent Events. That fact is a strong prior, but not because SSE is fast. It's because SSE is stateless, trivially compatible with HTTP infrastructure, and its failure modes are predictable. The question is whether your application has requirements that force you off that path.