Skip to main content

Speculative Decoding in Practice: The Free Lunch That Isn't Quite Free

· 10 min read
Tian Pan
Software Engineer

Your 70-billion-parameter model spends most of its inference time waiting on memory, not doing math. Modern GPUs can perform hundreds of arithmetic operations for every byte they read from memory, yet autoregressive Transformer decoding performs only a handful of operations per byte loaded. The hardware is idling while your users are waiting. Speculative decoding exploits this gap by having a small, fast model draft multiple tokens ahead, then letting the large model verify them all in one parallel pass. The promise is 2–3x latency reduction with mathematically identical output quality. The reality is more nuanced.

After two years of production deployments across Google Search, coding assistants, and open-source serving frameworks, speculative decoding has graduated from research curiosity to standard optimization. But "standard" does not mean "drop-in." The technique has sharp edges around draft model selection, batch size sensitivity, and memory overhead that determine whether you get a 3x speedup or a net slowdown.

How Speculative Decoding Actually Works

The core idea borrows from CPU branch prediction. Instead of generating tokens one at a time from the expensive target model, you run a cheap draft model forward for K steps, producing K candidate tokens. Then the target model processes all K candidates in a single forward pass—which costs roughly the same as generating one token, because the bottleneck is loading weights from memory, not the computation itself.

The verification step uses a modified rejection sampling scheme. For each drafted token, the target model compares its own probability distribution against the draft model's. If the target model assigns equal or higher probability to the drafted token, it accepts. If not, it accepts with probability proportional to the ratio of target-to-draft probability, and rejects otherwise. When a token is rejected, all subsequent draft tokens are discarded, and the target model samples a correction token from an adjusted distribution.

This rejection sampling mechanism is what makes the guarantee possible: the output distribution is mathematically identical to what the target model would have produced on its own. You are not approximating or distilling. You are generating the exact same distribution, faster.

The expected number of tokens accepted per verification round follows a clean formula based on the acceptance rate α and the number of speculated tokens γ:

τ = (1 - α^(γ+1)) / (1 - α)

At α = 0.8 with γ = 5, you accept roughly 4.5 tokens per round on average. Since each round costs approximately one target-model forward pass, that is a 4.5x reduction in the number of expensive forward passes needed.

The Draft Model Selection Problem

Draft model selection is where the "free lunch" framing breaks down. The choice determines your acceptance rate, which determines whether speculative decoding helps or hurts. And the intuition most people start with—pick the smallest model that still predicts well—turns out to be incomplete.

Recent large-scale benchmarks revealed a counterintuitive finding: a draft model's language modeling accuracy (perplexity) has little correlation with its throughput contribution to speculative decoding. The draft model's latency is a far stronger determinant of end-to-end speedup. A slightly less accurate model that runs 3x faster will outperform a more accurate but slower draft, because the verification step catches errors anyway.

Size ratio matters. In practice, draft models should be 1/10 to 1/50 the size of the target. Llama 3.2-1B drafting for Llama 3.1-70B achieves strong results precisely because same-family models share tokenization and training distribution, yielding higher acceptance rates than a generic small model of similar size.

Domain specificity matters more than scale. Off-the-shelf draft models often struggle on domain-specific tasks or very long contexts. Fine-tuning a draft model on your production query distribution can improve acceptance rates by 20–40%. For high-volume workloads, this investment pays for itself quickly. One practical insight from production teams: curating the data mix—balancing conversational, instruction-following, and code domains—had more impact on draft quality than simply scaling dataset size.

The acceptance rate threshold is around 0.55–0.60. Below this, verification overhead consumes the gains from parallel token generation. At acceptance rates of 0.6 or higher with 5+ speculated tokens, you can reliably expect 2–3x speedups. Below 0.5, you are likely better off without speculative decoding entirely.

When Speculative Decoding Hurts

The technique has a clear failure mode that maps directly to hardware utilization. Speculative decoding trades extra compute for reduced memory traffic. At low batch sizes (1–10 concurrent requests), the GPU is memory-bound—sitting idle between weight loads—so spending compute on draft-then-verify is pure upside. But as batch size increases, the GPU becomes compute-bound, and speculative decoding's extra verification work starts competing for the arithmetic capacity that is now the bottleneck.

Concrete numbers from production benchmarks:

  • Batch size 1–10: 2–3x speedup, the sweet spot
  • Batch size 10–30: Diminishing returns, 1.3–1.8x typical
  • Batch size 32+: Often performs worse than standard decoding

This means speculative decoding is ideal for interactive, latency-sensitive applications—chatbots, code completion, real-time assistants—where you are serving individual users and care about time-to-first-token and inter-token latency. It is counterproductive for batch processing workloads like bulk document summarization or offline evaluation, where you are already saturating GPU compute with large batches.

There are additional scenarios where it backfires:

  • Very short responses: If your typical output is under 20 tokens, there is not enough generation to amortize the draft overhead.
Loading…
References:Let's stay in touch and Follow me for more thoughts and updates