Speculative Decoding in Practice: The Free Lunch That Isn't Quite Free
Your 70-billion-parameter model spends most of its inference time waiting on memory, not doing math. Modern GPUs can perform hundreds of arithmetic operations for every byte they read from memory, yet autoregressive Transformer decoding performs only a handful of operations per byte loaded. The hardware is idling while your users are waiting. Speculative decoding exploits this gap by having a small, fast model draft multiple tokens ahead, then letting the large model verify them all in one parallel pass. The promise is 2–3x latency reduction with mathematically identical output quality. The reality is more nuanced.
After two years of production deployments across Google Search, coding assistants, and open-source serving frameworks, speculative decoding has graduated from research curiosity to standard optimization. But "standard" does not mean "drop-in." The technique has sharp edges around draft model selection, batch size sensitivity, and memory overhead that determine whether you get a 3x speedup or a net slowdown.
How Speculative Decoding Actually Works
The core idea borrows from CPU branch prediction. Instead of generating tokens one at a time from the expensive target model, you run a cheap draft model forward for K steps, producing K candidate tokens. Then the target model processes all K candidates in a single forward pass—which costs roughly the same as generating one token, because the bottleneck is loading weights from memory, not the computation itself.
The verification step uses a modified rejection sampling scheme. For each drafted token, the target model compares its own probability distribution against the draft model's. If the target model assigns equal or higher probability to the drafted token, it accepts. If not, it accepts with probability proportional to the ratio of target-to-draft probability, and rejects otherwise. When a token is rejected, all subsequent draft tokens are discarded, and the target model samples a correction token from an adjusted distribution.
This rejection sampling mechanism is what makes the guarantee possible: the output distribution is mathematically identical to what the target model would have produced on its own. You are not approximating or distilling. You are generating the exact same distribution, faster.
The expected number of tokens accepted per verification round follows a clean formula based on the acceptance rate α and the number of speculated tokens γ:
τ = (1 - α^(γ+1)) / (1 - α)
At α = 0.8 with γ = 5, you accept roughly 4.5 tokens per round on average. Since each round costs approximately one target-model forward pass, that is a 4.5x reduction in the number of expensive forward passes needed.
The Draft Model Selection Problem
Draft model selection is where the "free lunch" framing breaks down. The choice determines your acceptance rate, which determines whether speculative decoding helps or hurts. And the intuition most people start with—pick the smallest model that still predicts well—turns out to be incomplete.
Recent large-scale benchmarks revealed a counterintuitive finding: a draft model's language modeling accuracy (perplexity) has little correlation with its throughput contribution to speculative decoding. The draft model's latency is a far stronger determinant of end-to-end speedup. A slightly less accurate model that runs 3x faster will outperform a more accurate but slower draft, because the verification step catches errors anyway.
Size ratio matters. In practice, draft models should be 1/10 to 1/50 the size of the target. Llama 3.2-1B drafting for Llama 3.1-70B achieves strong results precisely because same-family models share tokenization and training distribution, yielding higher acceptance rates than a generic small model of similar size.
Domain specificity matters more than scale. Off-the-shelf draft models often struggle on domain-specific tasks or very long contexts. Fine-tuning a draft model on your production query distribution can improve acceptance rates by 20–40%. For high-volume workloads, this investment pays for itself quickly. One practical insight from production teams: curating the data mix—balancing conversational, instruction-following, and code domains—had more impact on draft quality than simply scaling dataset size.
The acceptance rate threshold is around 0.55–0.60. Below this, verification overhead consumes the gains from parallel token generation. At acceptance rates of 0.6 or higher with 5+ speculated tokens, you can reliably expect 2–3x speedups. Below 0.5, you are likely better off without speculative decoding entirely.
When Speculative Decoding Hurts
The technique has a clear failure mode that maps directly to hardware utilization. Speculative decoding trades extra compute for reduced memory traffic. At low batch sizes (1–10 concurrent requests), the GPU is memory-bound—sitting idle between weight loads—so spending compute on draft-then-verify is pure upside. But as batch size increases, the GPU becomes compute-bound, and speculative decoding's extra verification work starts competing for the arithmetic capacity that is now the bottleneck.
Concrete numbers from production benchmarks:
- Batch size 1–10: 2–3x speedup, the sweet spot
- Batch size 10–30: Diminishing returns, 1.3–1.8x typical
- Batch size 32+: Often performs worse than standard decoding
This means speculative decoding is ideal for interactive, latency-sensitive applications—chatbots, code completion, real-time assistants—where you are serving individual users and care about time-to-first-token and inter-token latency. It is counterproductive for batch processing workloads like bulk document summarization or offline evaluation, where you are already saturating GPU compute with large batches.
There are additional scenarios where it backfires:
- Very short responses: If your typical output is under 20 tokens, there is not enough generation to amortize the draft overhead.
- High-temperature creative generation: Random sampling with high temperature produces distributions that draft models predict poorly, tanking acceptance rates.
- Memory-constrained deployments: The draft model's weights (1–8 GB), its KV cache, and verification tensor allocations all consume GPU memory that could otherwise go to larger batch sizes or longer context windows.
The Framework Landscape: vLLM, SGLang, and TensorRT-LLM
Speculative decoding moved from experimental to production-ready across all major serving frameworks in 2025. Each framework has different strengths.
vLLM is the easiest starting point. It supports draft-model speculative decoding, EAGLE, and n-gram based speculation out of the box. Configuration is a few command-line flags. The community is large, bugs get found and fixed quickly, and the continuous batching integration is mature.
SGLang offers marginally better performance at moderate concurrency and a stronger draft model training story through its SpecForge tooling. However, production teams have found bugs in SGLang's token counting for speculative batches that deflated measured throughput by approximately 35%. The server computed output token counts incorrectly, making benchmarks look worse than reality. The lesson: always validate speculative decoding benchmarks with server-side metrics, not just client-side timing.
TensorRT-LLM delivers the highest raw performance on NVIDIA hardware, particularly with FP8 quantization on H200 GPUs (3.6x throughput improvement reported). But it requires model conversion through NVIDIA's build pipeline and is less flexible for experimentation.
One production lesson that applies across all frameworks: three bugs were found in official EAGLE3 model releases during validation work, including one that caused generation to silently produce truncated outputs—the server returned valid-looking responses shorter than requested length, with no error or warning. Silent failures in speculative decoding are particularly insidious because the technique's correctness guarantee only holds when the implementation is bug-free.
Beyond Draft Models: EAGLE, Medusa, and Self-Speculative Approaches
The draft-model approach is the most proven, but the field has developed alternatives that trade different things.
EAGLE and EAGLE-3 attach a lightweight autoregressive prediction head directly to the target model's internal layers, extracting embeddings from low, middle, and high layers. This eliminates the separate draft model entirely. EAGLE-3 uses dynamic draft trees to explore multiple generation paths simultaneously, generating longer branches of predictable text. Typical acceptance rates hit 80%, yielding 2.5–2.8x speedups. The downside is that you need a pre-trained EAGLE variant for your specific model, and these are not always available.
Medusa adds multiple prediction heads to the target model that each predict a future token position. It avoids draft model overhead but requires modifying and retraining the target model itself—a significant investment that most teams cannot justify unless they control the entire model lifecycle.
Self-speculative decoding (SWIFT) uses the target model itself as the draft by skipping certain layers during the draft phase. This requires no additional model artifacts at all. The tradeoff is modest speedups (1.3–1.6x), making it most attractive when GPU memory is too tight for a separate draft model.
N-gram speculation is the simplest approach: it looks at patterns in the prompt or previously generated text and drafts tokens based on string matching. It requires no neural network at all and works surprisingly well for structured outputs like JSON, SQL, or templated text. For unstructured natural language, it provides minimal benefit.
Designing Speculative Decoding In from Day One
The most consequential practical insight from production teams is architectural: speculative decoding must be designed in early. Retrofitting it into an existing serving pipeline is painful because it touches batching logic, memory management, KV cache allocation, and monitoring.
Teams that treat speculative decoding as a first-class architectural primitive report cleaner capacity planning and more predictable tail latency. This is especially true for non-streaming products—search results, document analysis, tool-use agents—where users do not see intermediate tokens. In these cases, all perceived latency is back-loaded, and speculative decoding's ability to reduce decode-phase latency translates directly into better user experience.
For monitoring, you need to track acceptance rate as a first-class metric alongside the usual latency and throughput numbers. Acceptance rate drift is your early warning signal for distribution shift—if your users start asking different types of questions, your draft model's alignment degrades silently. A generic draft model delivering baseline gains today can become a bottleneck tomorrow as your query distribution evolves.
Adaptive systems like Online Speculative Decoding (OSD) address this by continuously fine-tuning the draft model to the evolving query distribution during serving, using knowledge distillation. This is the frontier: speculative decoding as a living system that adapts to your traffic, not a static optimization you configure once and forget.
The Decision Framework
Before enabling speculative decoding, run through this checklist:
- Workload type: Interactive and latency-sensitive? Proceed. Batch offline processing? Probably skip.
- Batch size: Typically under 10 concurrent requests per GPU? Strong candidate. Regularly above 32? Likely counterproductive.
- Output length: Typical responses over 50 tokens? Good. Under 20 tokens? Marginal benefit.
- Acceptance rate: Benchmark your draft model on real production queries. Below 0.55? Do not enable. Above 0.65? Expect meaningful gains.
- Memory headroom: Can you spare 2–8 GB for draft model weights plus KV cache overhead? If not, consider self-speculative or n-gram approaches.
- Framework maturity: Are you on vLLM, SGLang, or TensorRT-LLM? First-class support exists. Custom serving stack? Expect significant integration work.
Speculative decoding is the most impactful single optimization for interactive LLM inference available today. Google uses it in production for AI Overviews in Search. It reduces not just latency but also energy costs and hardware requirements—fewer expensive forward passes means fewer machines for the same traffic. But it demands careful draft model selection, honest benchmarking on your actual workload, and architectural investment in your serving pipeline. The lunch is real. The bill comes in engineering time.
- https://research.google/blog/looking-back-at-speculative-decoding/
- https://bentoml.com/llm/inference-optimization/speculative-decoding
- https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/
- https://nebius.com/blog/posts/moe-spec-decoding
- https://introl.com/blog/speculative-decoding-llm-inference-speedup-guide-2025
- https://blog.premai.io/speculative-decoding-2-3x-faster-llm-inference-2026/
- https://www.bentoml.com/blog/3x-faster-llm-inference-with-speculative-decoding
- https://huggingface.co/blog/lujangusface/tw-eagle3-gpu
