Skip to main content

2 posts tagged with "speculative-decoding"

View all tags

Speculative Decoding Is a Streaming Protocol Decision, Not an Inference Optimization

· 12 min read
Tian Pan
Software Engineer

The "identical output" guarantee that ships with every speculative decoding paper is a guarantee about token distributions, not about what your user sees. Read the proofs carefully and you find a clean mathematical equivalence: the rejection-sampling acceptance criterion is designed so that the output distribution after speculation is exactly the distribution the target model would have produced on its own. That guarantee binds the bytes that leave the inference engine. It says nothing about the bytes that arrived on the user's screen five hundred milliseconds ago and have to be taken back.

If you stream draft tokens to the client the moment the small model emits them, you are running an A/B test on your own users every time the verifier rejects a suffix. Half a paragraph rewrites itself. A function name changes after the IDE has already syntax-highlighted it. A TTS voice has already pronounced "the answer is likely no" before the verifier swaps in "the answer is yes, with caveats." The math says the final distribution is the same as the slow path. The user's experience says they watched the model change its mind in public.

This is the part of speculative decoding that doesn't make it into the speedup numbers. It is also the part that turns "free 3× throughput" into a half-quarter of streaming-protocol work that nobody scoped.

Speculative Decoding in Practice: The Free Lunch That Isn't Quite Free

· 10 min read
Tian Pan
Software Engineer

Your 70-billion-parameter model spends most of its inference time waiting on memory, not doing math. Modern GPUs can perform hundreds of arithmetic operations for every byte they read from memory, yet autoregressive Transformer decoding performs only a handful of operations per byte loaded. The hardware is idling while your users are waiting. Speculative decoding exploits this gap by having a small, fast model draft multiple tokens ahead, then letting the large model verify them all in one parallel pass. The promise is 2–3x latency reduction with mathematically identical output quality. The reality is more nuanced.

After two years of production deployments across Google Search, coding assistants, and open-source serving frameworks, speculative decoding has graduated from research curiosity to standard optimization. But "standard" does not mean "drop-in." The technique has sharp edges around draft model selection, batch size sensitivity, and memory overhead that determine whether you get a 3x speedup or a net slowdown.