Skip to main content

Speculative Decoding Is a Streaming Protocol Decision, Not an Inference Optimization

· 12 min read
Tian Pan
Software Engineer

The "identical output" guarantee that ships with every speculative decoding paper is a guarantee about token distributions, not about what your user sees. Read the proofs carefully and you find a clean mathematical equivalence: the rejection-sampling acceptance criterion is designed so that the output distribution after speculation is exactly the distribution the target model would have produced on its own. That guarantee binds the bytes that leave the inference engine. It says nothing about the bytes that arrived on the user's screen five hundred milliseconds ago and have to be taken back.

If you stream draft tokens to the client the moment the small model emits them, you are running an A/B test on your own users every time the verifier rejects a suffix. Half a paragraph rewrites itself. A function name changes after the IDE has already syntax-highlighted it. A TTS voice has already pronounced "the answer is likely no" before the verifier swaps in "the answer is yes, with caveats." The math says the final distribution is the same as the slow path. The user's experience says they watched the model change its mind in public.

This is the part of speculative decoding that doesn't make it into the speedup numbers. It is also the part that turns "free 3× throughput" into a half-quarter of streaming-protocol work that nobody scoped.

The Mechanism, From a Streaming Perspective

Speculative decoding pairs a small draft model with a large target model. The draft proposes K tokens at a time. The target verifies all K in parallel — one forward pass instead of K — and accepts a prefix of them based on a rejection-sampling criterion that compares the two models' probabilities token-by-token. Whenever a token is rejected, everything after it in the draft is also discarded, and the target produces a single corrected token from the last accepted position. The cycle restarts.

Acceptance rates in production EAGLE-style systems sit in the 60–80% range. EAGLE-3 keeps acceptance flat across draft positions instead of decaying, which is why it sees 3–6× wall-clock speedups over autoregressive decoding. But "60–80% acceptance" is per-token, on a per-step window of 4 to 8 tokens. The chance that every token in a given speculation window is accepted is much lower — and the chance that some draft window in a long response gets a rejection is essentially one.

So the question for a streaming serving stack is not whether rejections happen. They will, on every nontrivial response. The question is what your protocol does when they do.

Two Protocols, Two Different Products

You can pick exactly one of these:

Token-by-token push. The server flushes each draft token to the client as soon as it's emitted. Latency is minimum-possible. Time-to-first-token tracks the draft model's speed, not the target's. The cost is that any rejected suffix has to be retracted — usually by sending a "rewind to position N" control frame, then streaming the corrected continuation. The client's text widget rewrites itself live.

Accept-then-flush. The server holds draft tokens in a buffer until verification confirms the prefix that survives. Only verified tokens leave the server. The output is monotonic — it only grows, never rewinds. The cost is added latency: roughly one draft+verify cycle of buffering, sometimes more if you batch.

Most chat surfaces today use a third option that pretends to be the first: token-by-token push with no retraction support, on the assumption that the user won't notice or care. They might not. Until they do — when a code completion that they already started copying changes mid-stream, or when a generated SQL query in a notebook cell gets a column name swapped after they've already pressed run.

The interesting design realization is that this is not really a server-side performance choice. It's a product surface choice. You should be picking the protocol per-surface, not per-deployment.

Surfaces Where Retraction Is Free

Pure conversational chat, rendered as plain text in a scrolling viewport, is forgiving. Users have come to expect generated text to have small course-corrections; the cognitive load of "the model edited the previous sentence" is low when the previous sentence is still on screen and the edit is local.

Rendered markdown chat is forgiving until the markdown is partially rendered as a structured element — a table that grew a column, a code block that got re-fenced, an LaTeX equation that was syntactically valid mid-stream and is now being replaced with a different valid form. Retraction at the token level can produce intermediate states that aren't valid markdown, which means the renderer either flickers or has to debounce, which means the time-to-first-paint of the rich element gets worse than the protocol promised.

If your client renders streamed content into a textarea or a contenteditable, you also have to think about cursor behavior, selection, and undo stack. A naive "just rewrite the suffix" is fine until the user clicks into the middle of the response while it's still streaming.

Surfaces Where Retraction Is Catastrophic

Voice TTS pipelines are the obvious one. Once a phoneme has been synthesized and played, you cannot un-play it. If your TTS is consuming the LLM stream tokenwise to keep mouth-to-ear latency under a second, you have already committed to the audio for the rejected suffix. Your options at that point are bad: cut audio mid-word and resynthesize (jarring), let the wrong content finish and contradict yourself one sentence later, or buffer enough tokens before TTS starts that you've given up the latency win that motivated streaming in the first place. The right answer is usually to buffer at sentence or clause boundaries — but that means the TTS layer needs the accept-then-flush protocol, not token-by-token, regardless of what the chat UI on the same backend uses.

Structured form-fill and tool calls belong here too. If the model is streaming JSON into a typed parser that incrementally updates a UI form — date picker, dropdown selection, address fields — the parser may have already committed to a value, fired a side effect (autocomplete API lookup, validation), or shown the user a confirmation. Retraction means rolling back UI state, which most form libraries are not built for. Worse, if a tool call has already begun executing because the JSON for the tool name and arguments was streamed and parsed before verification finished, you have just moved the read-after-write race from the inference engine into the tool-execution layer.

Generated code that's running in a sandbox is the variant that bites the hardest in agentic systems. If your harness streams generated code into a worker that begins evaluating it as soon as a syntactically complete statement arrives, a rejected suffix isn't just a UI rewrite — it's potentially a partially-executed file write, a partially-mutated database row, or an HTTP request that already left the building. The mathematical equivalence to the target model's output buys you nothing here. The system already took an action based on a token the verifier later threw away.

The pattern is consistent: the moment a consumer downstream of the LLM stream is hard to roll back, your speculative decoding protocol is no longer just an inference detail. It's a correctness property of the whole pipeline.

The Eval That Catches This: User-Visible Token Churn

Inference teams measure speculative decoding with acceptance rate and tokens-per-second. Neither metric is the right one for the streaming question. Acceptance rate is a property of the model pairing, not of the user experience. Tokens-per-second is the throughput out of the inference engine — it doesn't distinguish between "the user saw monotonically growing output" and "the user saw a 200-character paragraph rewrite itself three times."

The metric you actually want is user-visible token churn: for each output, the ratio of (total tokens streamed to client) / (final tokens in the response). With pure token-by-token push and a 70% acceptance rate, this number can easily reach 1.3–1.5 — meaning roughly a third of the bytes you sent down the wire were retracted before the response was final. With accept-then-flush it's exactly 1.0 by construction.

Adding this metric to the evaluation harness changes the conversation from "we got a 3× speedup, ship it" to "we got a 3× speedup with X% retraction overhead, here's the breakdown by product surface." It also gives you a knob to tune: lowering the speculation window K trades acceptance-rate-driven throughput for less retraction. The optimum is different for chat than for voice than for code execution, and the data tells you so.

A useful sub-metric is retracted-byte severity: not all retracted tokens are equal. A retracted prefix of "the answer is" being replaced with "the answer is" (same prefix, different continuation) is invisible to the user. A retracted "yes" being replaced with "no" is a content reversal. Detecting reversals usually means computing a string-distance or semantic-distance between the rejected suffix and its replacement, which is more work but catches the cases that look bad in user testing.

Architectural Patterns That Hold Up

A few patterns are worth committing to early, before the cost of retrofitting them gets quoted in person-quarters:

Per-surface protocol selection at the gateway. The serving layer should expose at least two streaming modes — token-push and accept-then-flush — and clients should declare which one they want via the request. The same model, same draft, same verifier, can serve both: the difference is purely whether the gateway holds the token in a buffer until verification confirms it, or releases it immediately. Treating the protocol as a request-time choice rather than a deployment-time configuration is the cleanest decoupling.

Server-side commit boundaries. Even in the token-push mode, the gateway can elect to flush at semantic boundaries — sentence end, JSON object end, code-block end — rather than every token. This is a soft accept-then-flush that gives most of the latency back without paying for full-buffer retraction risk. The sentence-end flush model is what TTS pipelines almost always settle on; there's no reason it has to be a TTS-specific behavior.

Retraction-aware client protocols. If you're going to do token-push, the wire format needs explicit "rewind to position N" frames, and clients need to handle them as a first-class event, not as a malformed stream. SSE and WebSocket-based stacks usually need a thin envelope on top to express this; QUIC-based stacks can model it more naturally with stream offsets. Either way, "the stream is monotonic append-only" is a property the protocol should not silently assume.

Tool-call execution gates. No tool call fires until its full JSON has been verified by the target model, regardless of whether intermediate tokens were streamed to the client for display purposes. The display stream and the action stream are different streams, with different commit semantics. Conflating them is how speculative decoding turns into a confused-deputy bug.

The Org Conversation This Is Really About

Speculative decoding usually arrives in a serving stack as an inference-team initiative: a 3× throughput win, validated against perplexity and downstream eval suites, ready to ship. The eval suites are correct that quality is preserved. The throughput numbers are correct that latency is improved. What's missing is the cross-functional review with the surfaces that consume the stream: the chat product team, the voice agent team, the agentic tool framework, the SDKs that customers use to integrate.

Those surfaces have implicit contracts with the inference layer that the inference team didn't sign. "The stream is append-only" is not a property the inference team ever promised — but it's a property every downstream consumer assumed, because every prior decoding strategy provided it for free.

The pattern to adopt is to treat speculative decoding rollouts the way you treat protocol changes: with explicit RFCs to consuming surfaces, opt-in client behavior, surface-specific eval metrics, and a deprecation path for clients that aren't retraction-aware. The inference team gets to ship the speedup. The product surfaces get to control how that speedup is exposed. And the org gets to stop relearning, in incident postmortems, that "identical output distribution" is not the same property as "identical user experience."

Closing Thought

The thing speculative decoding is doing well is borrowing a trick from CPU branch prediction: speculate on the cheap path, verify in parallel, roll back if wrong. CPUs got away with this for decades because the speculation was invisible — the rolled-back work was never observable from outside the pipeline. The hardware paid the cost of speculation but the software contract held.

Modern serving stacks broke that invariant the moment they started streaming draft tokens to a remote client over a public network. The speculation is now observable, by definition, because the speedup requires it to be observable — that's the whole point of pushing tokens early. The architectural job in front of us is to put the observability back where it belongs: at well-defined commit boundaries, under client control, with metrics that measure user experience and not just engine throughput. Speculative decoding is here to stay. The streaming protocol around it is the part still being designed.

References:Let's stay in touch and Follow me for more thoughts and updates