Skip to main content

Speculative Decoding in Production: Free Tokens and Hidden Traps

· 9 min read
Tian Pan
Software Engineer

Most LLM inference bottlenecks come down to one uncomfortable fact: the GPU is waiting on memory bandwidth, not compute. Each token generated requires loading the entire model's weights from HBM, and that transfer dominates runtime. Speculative decoding was designed to exploit this gap — but the gains depend on conditions your benchmark almost certainly didn't test.

Teams that ship speculative decoding into production often see it underperform lab numbers by 40–60%. Not because the technique is flawed, but because the workload characteristics differ in ways that matter: larger batch sizes, shorter outputs, stricter output constraints. Understanding when speculative decoding actually helps — and when it silently hurts — is the prerequisite for deploying it responsibly.

How It Works: Draft, Verify, Repeat

The mechanism is a classic parallel-processing trick. Instead of generating one token at a time with the expensive target model, you generate K candidate tokens (typically 5–7) with a lightweight draft model, then verify all K in a single forward pass of the target model.

The target model checks each candidate sequentially: if draft token i matches what the target would have picked, it's accepted and the process moves to i+1. At the first mismatch (position j), the target model provides the correct token for position j, and the draft cycle restarts. Any accepted prefix — even a partial one — is cheaper than generating those tokens sequentially with the target.

The output guarantee is the whole point: mathematically, the final token sequence is identical to what the target model would have generated alone. No approximations, no quality loss. Just latency reduction when the draft model is right often enough.

The Real Speedup Numbers

Published benchmarks tend to report the best-case scenario. Here's what holds up in practice:

  • vLLM on CNN/DailyMail summarization: 2.8x speedup using prompt lookup decoding (no auxiliary model)
  • TensorRT-LLM on NVIDIA H200: up to 3.6x throughput improvement
  • AWS Trainium, decode-heavy workloads: up to 3x acceleration
  • EAGLE-3 draft models: 3.0–6.5x speedup vs. vanilla autoregressive generation

These numbers are real — but they were measured at batch size 1–4 with long outputs and high acceptance rates. That's the regime where speculative decoding was designed to shine.

The acceptance rate (α) is the core variable. It measures how often the draft model predicts the same token the target model would have chosen. At α = 0.6, you get roughly 2.4x speedup with 5 speculative tokens. At α = 0.8, that becomes 3.7x. Below α = 0.5, verification overhead outweighs the benefit and you're slower than baseline.

Typical real-world acceptance rates fall between 0.6 and 0.8 — not the near-perfect 0.95 that some theoretical treatments assume. Task type matters enormously: summarization and completion of structured patterns yield high acceptance; open-ended creative generation and multilingual outputs yield low acceptance.

The Measurement Mistake That Misleads Everyone

Most speculative decoding evaluations measure token throughput (tokens generated per second across a batch). Your users experience wall-clock latency (time from their request to the last token). These metrics diverge exactly when batch sizes grow.

At batch size 1, speculative decoding reduces wall-clock latency because the GPU is memory-bandwidth-bound: generating more tokens per target-model pass is genuinely faster. At batch size 32+, the GPU is compute-bound. Verification overhead grows with batch size because the target model must process the entire batch × K speculative tokens simultaneously. The latency per individual request can actually increase.

The transition point varies by GPU and model size, but as a rough guide: speculative decoding helps wall-clock latency when your concurrent request count is below 4–8. Above that, the gains from speculation are consumed by verification overhead at scale.

This explains a common failure mode: a team benchmarks speculative decoding at batch size 1 during development, sees 2.5x speedup, ships it — then observes 5% regressions in production where p50 concurrency is 16. The technique works exactly as advertised. The workload changed.

Draft Model Selection Is Not Obvious

The draft model must share an identical tokenizer and vocabulary with the target model. This is a hard constraint. Vocabulary mismatch causes acceptance rates to collapse to near-zero, making speculative decoding slower than baseline without any warning signal beyond degraded performance metrics.

Within that constraint, draft model selection is a trade-off between three factors:

Draft model size vs. acceptance rate: Smaller drafts are faster to run but produce lower-quality candidates. A 1.5B parameter draft for a 13B target is a common starting point; a 7B draft for a 70B target. The optimal size depends on your task distribution, not just raw parameter counts.

Output embedding dominance: The language modeling head (the matrix mapping from hidden states to vocabulary logits) often dominates draft model inference time for large vocabularies (50k+ tokens). A "small" draft model with a large vocabulary can be slower than expected. Recent vocabulary trimming techniques address this by pruning rarely-used tokens from the draft vocabulary.

EAGLE-style fine-tuned drafts vs. independent models: Purpose-trained draft models that predict based on the target's hidden states (EAGLE, EAGLE-2, EAGLE-3) consistently outperform independently trained small models for the same vocabulary. EAGLE-3 maintains 70–80% acceptance rates across all generation positions, while naive draft models degrade at longer positions due to error accumulation.

If you don't want to train a draft model, vLLM's prompt lookup decoding uses n-gram matching against the input prompt — no auxiliary model required. It achieves 2.8x speedup on repetitive tasks like summarization and code completion where the output echoes input phrases. The overhead is negligible, making it a safe first experiment.

Where It Fails in Production

Three failure modes show up repeatedly once speculative decoding handles real traffic.

Structured output corruption. When speculative decoding is combined with reasoning parsers and structured output constraints (JSON schemas, regex grammars), tokens can be silently dropped during batch verification. The constraint enforcement never completes correctly. This is documented in vLLM and affects any workflow that combines speculative decoding with constrained generation. The mitigation is to disable speculative decoding for constrained generation paths entirely.

Ragged tensor misalignment. In batch inference, different sequences in a batch accept different numbers of speculative tokens, creating misaligned tensor shapes that GPU parallelism handles poorly. Naive implementations can silently produce incorrect outputs at non-trivial probability. Newer implementations (2025) solve the alignment problem but at a performance cost that reduces the speedup.

MoE routing breakdown. Mixture-of-Experts models route tokens through different expert subnetworks. Draft and target models may route the same token through different experts, which breaks the acceptance rate math entirely. Speculative decoding on MoE architectures often performs worse than baseline. If your serving stack uses a MoE model (Mixtral, DeepSeek-MoE, or similar), validate acceptance rates empirically before committing to the architecture.

The Operational Overhead Nobody Mentions

Running speculative decoding means running two models, not one.

GPU memory usage grows by 10–20GB on an H100 for the draft model state alone. You need to version, test, and update two models in concert — any tokenizer change to the target model requires retraining the draft model. KV cache management becomes more complex because both draft and target maintain separate KV states that must be kept synchronized across requests.

For teams running on managed inference providers (Bedrock, Vertex AI, Together AI), speculative decoding may already be active under the hood — some providers apply it transparently to reduce latency. In that case, the operational burden falls on the provider, not you. The flip side is you have no control over when it's applied or disabled.

For self-hosted serving with vLLM or TensorRT-LLM, enabling speculative decoding requires choosing a draft model, monitoring acceptance rates per traffic segment, and debugging degraded performance when the acceptance rate drops on a new request distribution. It's not a configuration flag you set once and forget.

When to Actually Use It

Speculative decoding is the right tool in a narrow but important set of conditions:

  • Your serving workload is latency-sensitive and primarily interactive (batch size 1–4)
  • Output sequences are long (>500 tokens) — the longer the generation, the more the math favors speculation
  • Your task has predictable structure (summarization, code completion, continuation tasks with high n-gram overlap)
  • You're running on a single GPU or a small cluster where memory bandwidth, not compute, is the bottleneck
  • You can tolerate the engineering cost of maintaining a draft model

If your priority is maximizing throughput for batch workloads, improving time-to-first-token (which speculative decoding doesn't address), or minimizing operational complexity, the technique likely isn't worth the investment. Quantization, continuous batching, and efficient KV cache management typically give more reliable wins for those goals.

The speedup is real. But it's a speedup for a specific regime — interactive, long, structured generation at low concurrency. Before deploying, run your actual traffic distribution through a latency simulation at expected concurrency. The benchmark you read almost certainly tested the best case.

Where This Goes

The current trajectory is toward self-speculative decoding — methods that use the target model itself to generate draft tokens by skipping layers (LayerSkip, Speculative Streaming). These eliminate the vocabulary compatibility problem and the operational burden of a separate model, at the cost of some speedup compared to a purpose-trained draft. For most production teams, that trade-off will increasingly favor the self-speculative approach as those techniques mature.

Speculative decoding scaling laws (2025) now allow predicting optimal draft model size before training, which reduces the experimental cost of finding the right draft configuration. And EAGLE-3's improved acceptance rate stability across long sequences suggests the remaining gap between lab and production benchmarks is closing. The technique is becoming less fragile, but the workload-matching requirement remains unchanged.

The free tokens are real. So are the traps. Map your workload before you ship.

References:Let's stay in touch and Follow me for more thoughts and updates