Skip to main content

Inter-Token Jitter: The Streaming UX Failure Your p95 Dashboards Can't See

· 11 min read
Tian Pan
Software Engineer

Your latency dashboard is green. Time-to-first-token is under the 800ms target on p95. Total completion time is under the four-second budget on p99. Then a senior PM forwards a support thread: "the assistant froze for like three seconds in the middle of an answer," "it stuttered and then dumped a whole paragraph," "I thought it crashed." Three users uninstalled this week with the same complaint. Nobody on the team can reproduce it on their laptop, and every metric you log says the system is healthy.

The metric that would explain the bug is the one you're not measuring: the distribution of gaps between consecutive tokens. A clean p95 total time can hide a stream where 8% of responses contain a 2.5-second pause halfway through, and to a user watching characters appear in real time, that pause reads as a broken system — not a slow one. Your dashboard is measuring the movie's runtime; your user is watching the movie.

TTFT and TPOT Are Snapshots; Streaming Is a Continuous Experience

The standard LLM serving metrics — time-to-first-token (TTFT), total completion time (E2EL), and time-per-output-token (TPOT) — were designed for benchmarking, not for UX. TTFT tells you how long the user waits before anything happens. E2EL tells you how long the full answer takes. TPOT, defined as (E2EL - TTFT) / output_tokens, is the average decode interval per token.

Average is the trap. TPOT is a scalar derived from start-and-end timestamps; it cannot represent a distribution. A response that emits 200 tokens in four seconds with one 2-second pause has the same TPOT as one that emits 200 tokens at a perfectly steady 20ms cadence. The user experiences these as completely different products. One feels alive; the other feels dead and then alive again.

What you actually need is per-stream inter-token latency (ITL) as a first-class signal — measured at the token boundary, recorded as a distribution per stream, and aggregated as a distribution-of-distributions across streams. Anyscale and NVIDIA both call this out: ITL variance, often referred to as jitter, is more detrimental to perceived UX than a consistently slow but steady stream. Users will accept a 60ms steady cadence; they will not accept a 25ms cadence with periodic 1.5-second hangs, even though the second one finishes faster on the wall clock.

A healthy ITL chart looks like a narrow flat band: p50, p95, and p99 hugging each other. The moment p99 starts pulling away from p50 mid-decode, you have a UX bug whose blast radius is invisible to every other metric on your dashboard.

Where Mid-Stream Pauses Actually Come From

Once you start measuring ITL distributions, the next question is what causes the long-tail spikes. The honest answer is "five different things, and you need to attribute them," because the mitigations are different.

Provider-side scheduling jitter. Most hosted inference engines run continuous batching: incoming requests join an in-flight batch at the next decode step. When a large prefill request arrives mid-batch, the next decode step gets stretched while the prefill is fused in (or chunked in, depending on the scheduler). Your stream pauses. There is nothing wrong with your code, and your provider's aggregate p99 dashboard probably looks fine because the spike is per-stream, not per-request.

KV-cache pressure. During a long generation, if total KV-cache memory across the running batch exceeds the device budget, the scheduler must either swap a stream's cache to host memory, evict it entirely (forcing recomputation when it resumes), or pause it until memory frees. Each of these manifests as an ITL spike on the affected stream — and the affected streams are disproportionately the long ones, which are the streams users are most invested in.

Speculative-decoding rejection events. Spec decoding feels free when draft acceptance rates are high, but every rejected draft is a wasted forward pass that the user pays for in latency. On out-of-distribution prompts where the draft model disagrees with the target, you can see acceptance drop from 80% to 30%, and what looked like a 2× speedup becomes a 1.2× slowdown — concentrated in the middle of the response, where the model has drifted from the draft's training distribution.

Network buffering between provider and client. TCP buffering, intermediate proxies, and the browser's own paint scheduling can all batch tokens in transit. The provider sent you 50 tokens at a steady cadence; the browser revealed them all in one frame because it decided that was efficient. The pause is real to the user even though no inference component caused it.

Connection-layer hiccups. Mobile networks, flaky Wi-Fi, and CDN reroutes inject pauses that have nothing to do with the model. Research on LLM streaming under unstable networks (e.g., the Eloquent transmission scheme) shows that retransmits during packet loss are perceived as model stalls when the user has no reason to think otherwise.

The taxonomy matters because the fix differs. KV pressure wants a smaller batch or paged-attention-aware admission control. Spec-decoding rejection wants a draft model targeted to your traffic mix or a confidence-based fallback. Network jitter wants a client-side buffer. Lumping all of these under "p99 latency is bad" hides which one is biting you today.

The Metric That Has To Land: ITL as a First-Class SLI

You cannot fix what you cannot see, and you cannot see ITL jitter from the metrics most teams expose. The change is straightforward in description and slightly tedious in implementation.

Record a timestamp at the moment each token (or chunk, if you can't get per-token granularity) becomes available to the client. The deltas between those timestamps are the per-stream ITL series. Emit them as a histogram per stream and aggregate across streams to produce:

  • Per-stream p99 ITL — the worst gap inside a single response. This is the user-perceived stall metric. If it's 2 seconds, someone watched the screen freeze.
  • Cross-stream p99 of per-stream p99 ITL — the tail of the tail. This catches the streams that contain the worst stalls, not the streams whose stalls happen to align with everyone else's.
  • Jitter ratiop99_ITL / p50_ITL per stream. A ratio above 5 is a flag that the stream is unstable even if the mean looks fine.

Where to measure matters. Server-side ITL captures provider behavior; client-side ITL captures the experience the user actually has, including network effects. You want both, because mitigations land at different layers. Don't derive ITL from start-to-end timestamps and a token count; that gets you TPOT, which is the metric that hid the bug in the first place.

A practical SLO framing: define ITL goodput as the percentage of streams whose entire ITL distribution stayed under a target threshold (e.g., no single inter-token gap above 400ms). This is harsher than TPOT-based goodput and closer to what users actually feel. Recent work on LLM serving SLOs explicitly argues for this kind of distribution-aware metric over scalar means, because the scalar version cannot detect partial-stream failure.

Smoothing Is a Trade, Not a Free Lunch

Once you can see jitter, the next decision is how much you're willing to spend to hide it. The architectures fall into three buckets, each with a different cost profile.

Client-side smoothing buffers are the cheapest move. The client receives tokens at whatever cadence the network delivers them and releases them to the rendering layer at a steady rate — typically a 33ms-per-chunk reveal that stays comfortably above the screen refresh boundary. You buy a few hundred milliseconds of extra latency at the start of the stream in exchange for the elimination of mid-stream stalls smaller than the buffer depth. Google's Bard and several research systems (notably Andes) used this approach explicitly. The user perceives this as smoother even though end-to-end time is fractionally worse, because their loss function is variance, not mean.

Server-side accept-then-flush patterns trade slightly more compute for hidden micro-pauses. The server holds back tokens for a small window (e.g., 50ms) and then flushes them at a fixed cadence, masking sub-window scheduling jitter from the client entirely. This is most effective when you control the client and can keep the client buffer shallow.

Mid-stream provider failover is the heavyweight option. A watchdog at the streaming layer detects an ITL spike beyond a threshold (say, 1.5 seconds with no token), cancels the in-flight stream, and reissues the remaining generation against a fallback provider with the conversation-so-far as context. Done well, this can hide a provider-side stall from the user; done badly, it produces a coherence break where the second half of the answer doesn't match the first half. Reserve this for surfaces where stalls are catastrophic (voice, real-time agents) and accept the operational complexity.

The decision boils down to which surface is paying for which failure. A coding assistant where users skim tokens as they appear benefits enormously from smoothing — they don't read at 50 tokens/second anyway, so the extra few hundred milliseconds is invisible. A voice agent where TTS reads the stream live needs the opposite trade-off: minimize buffering, accept some jitter, and use server-side patterns that smooth without adding head-of-line latency.

Evals That Actually Catch Jitter Bugs

The eval discipline most teams use for streaming is "did the response come out, and was it correct?" That eval cannot fail on a stream that produced the right text with a 2-second mid-response pause. To catch jitter regressions before they hit users, you need synthetic eval harnesses that look at the shape of the stream, not just the content.

Two patterns work in practice:

  • Stream-replay against the client. Record real provider streams (timestamps + chunks), inject synthetic pauses at controlled positions, and feed them into your client renderer. Score user-perceived quality on the rendered stream — either via a panel of raters or via a perceptual model that approximates human reading-flow judgment. This isolates the client's smoothing from the provider's behavior, so a regression in either layer surfaces independently.
  • Jitter-injected load tests against the serving stack. Run a representative traffic mix against your inference layer with deliberately overprovisioned concurrency, capture per-stream ITL distributions, and gate releases on the cross-stream p99-of-p99 ITL not regressing. Most providers will pass a TTFT/TPOT regression test even when their KV pressure profile has shifted; the jitter-shaped test catches that drift.

These evals are not hard to build, but they require committing to the idea that streaming is a continuous experience that needs continuous evaluation. A scoreboard of single-number averages will keep blessing the version that broke the UX.

The Architectural Realization

The reason this metric stays missing on most teams is cultural, not technical. Latency dashboards inherit their vocabulary from request/response systems, where one timestamp at start and one at end captures everything that matters. Streaming inverts that contract: the response isn't an event, it's a process, and the user's loss function depends on the shape of the process, not its endpoints. Your p95 is a snapshot. Your user is watching a movie, and snapshots cannot detect a film that buffers mid-scene.

The path forward is to treat ITL distributions the way SREs already treat percentile distributions for everything else: as a first-class signal, with goodput defined against the distribution, not the mean. Smoothing buffers, server-side flushing, and failover policies become design choices applied per-surface based on what the surface is trading for. The teams that get this right ship streaming experiences that feel calm even when the underlying inference layer is ragged. The teams that don't keep wondering why their dashboards are green and their NPS is sliding.

If you take one thing into next sprint: add per-stream p99 ITL to the dashboard you check before every release. The first time you look at it, you will find a bug. That's the point.

References:Let's stay in touch and Follow me for more thoughts and updates