Use a small draft model to propose multiple tokens in parallel, then verify them with the large target model in a single forward pass. Achieves 2-3× latency speedup with identical output distribution.
Autoregressive LLM generation is sequential: you generate one token at a time, each step requires a full forward pass through the model. For a 70B parameter model, each token costs ~140 GFLOPS. The throughput bottleneck is memory bandwidth — loading all 140B parameters from HBM to compute units to generate a single token. On an A100 80GB, this limits you to roughly 50 tokens/second for a 70B model.
Speculative decoding exploits the observation that a small model (e.g. 7B) can correctly predict the next 3–5 tokens most of the time. If we can verify all those predictions in a single large-model forward pass, we get multiple tokens per expensive call.
At each step of speculative decoding:
The acceptance criterion min(1, p/q) is the key insight. It guarantees that the output distribution of speculative decoding is identical to that of the target model — no quality degradation. The math: if we accept x̃ with probability p(x̃)/q(x̃) when p ≤ q, and always accept when p ≥ q, the resulting marginal distribution equals p.
When a token is rejected at position i, we sample from the residual distribution max(0, p - q) / Z rather than simply resampling from p. This ensures no token is wasted — the rejection produces a valid target sample that's different from the draft token.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
def speculative_decode(draft_model, target_model, tokenizer, prompt, gamma=4, max_new=100):
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
generated = input_ids.clone()
for _ in range(max_new // gamma + 1):
# Step 1: draft gamma tokens
draft_ids = generated.clone()
draft_probs = []
for _ in range(gamma):
with torch.no_grad():
logits = draft_model(draft_ids).logits[:, -1, :]
probs = torch.softmax(logits, dim=-1)
next_tok = torch.multinomial(probs, 1)
draft_probs.append(probs[0, next_tok[0, 0]].item())
draft_ids = torch.cat([draft_ids, next_tok], dim=-1)
# Step 2: verify with target model (single forward pass)
with torch.no_grad():
target_logits = target_model(draft_ids).logits[0, generated.shape[1]-1:, :]
target_probs = torch.softmax(target_logits, dim=-1)
# Step 3: accept/reject
accepted = 0
for i in range(gamma):
draft_tok = draft_ids[0, generated.shape[1] + i].item()
p = target_probs[i, draft_tok].item()
q = draft_probs[i]
if torch.rand(1).item() <= p / q:
accepted += 1
else:
break
new_tokens = draft_ids[:, generated.shape[1]:generated.shape[1] + accepted]
generated = torch.cat([generated, new_tokens], dim=-1)
# Add one target-sampled token
bonus_tok = torch.multinomial(target_probs[accepted], 1).unsqueeze(0)
generated = torch.cat([generated, bonus_tok], dim=-1)
return tokenizer.decode(generated[0], skip_special_tokens=True)
The draft model is the most important hyperparameter. Requirements: (1) same vocabulary as the target model, (2) fast enough that γ draft steps cost less than 1 target step, (3) high acceptance rate ≥ 0.7. Good pairs:
For a 70B/7B pair, if acceptance rate α=0.8 and γ=4, the expected tokens per target call is α × γ + 1 ≈ 4.2, giving ~4× fewer target calls and ~2–3× wall-clock speedup (draft overhead reduces the theoretical gain).
Speculative decoding primarily reduces latency (time for a single request), not throughput (total tokens/second across many requests). When the GPU is already fully utilised with batched requests, speculative decoding may not help — the verification step has the same batch dimension as the target-only baseline. It shines in:
Speculative decoding accelerates autoregressive LLM inference by using a small, fast draft model to propose multiple candidate tokens, which a larger target model then verifies in a single parallel forward pass. Tokens that the target model agrees with are accepted; the first rejected token triggers regeneration from that point. When the draft model's predictions align well with the target, multiple tokens are accepted per target model step, increasing effective throughput.
| Configuration | Acceptance Rate | Speedup | Memory Overhead |
|---|---|---|---|
| Same-family (7B draft / 70B target) | 70–85% | 2–3× | +7B params |
| Cross-family (mismatched vocab) | N/A | N/A (incompatible) | N/A |
| Self-speculative (layer skipping) | 65–75% | 1.5–2× | None |
| Medusa (multiple heads) | 60–80% | 2–3× | Small MLP heads |
The acceptance rate — the fraction of draft tokens accepted by the target model — is the key metric determining speculative decoding speedup. Acceptance rates above 70% typically produce 2× or better throughput improvements; rates below 50% may produce little benefit after accounting for the overhead of running the draft model. Acceptance rates vary significantly by task type: continuation of predictable text (filling in boilerplate, completing common phrases) achieves high acceptance rates, while creative generation and complex reasoning produce lower rates as the draft model's predictions diverge from the target's choices.
Speculative decoding preserves exact output equivalence with standard autoregressive decoding — every accepted token sequence has the same probability under the target model's distribution as if it had been generated without speculative decoding. This mathematical equivalence means speculative decoding is a pure latency optimization with no quality trade-off, unlike quantization or pruning which exchange some quality for efficiency. The determinism guarantee is important for applications that require reproducible outputs or audit trails of generation decisions.
# Speculative decoding with HuggingFace
from transformers import pipeline
# Load draft and target models
pipe = pipeline(
"text-generation",
model="meta-llama/Llama-3.1-70B-Instruct",
assistant_model="meta-llama/Llama-3.2-3B-Instruct", # draft
device_map="auto",
)
# Draft model must share vocabulary with target model
Draft model selection is critical for achieving high speculative decoding speedup. The draft model must use the same vocabulary and tokenizer as the target model to ensure token-level compatibility — a draft model tokenizing "Python" as one token must produce the same token ID as the target model for acceptance verification to work correctly. For Llama-family models, using a smaller Llama variant (3B draft for 70B target) provides good vocabulary compatibility. Cross-family speculative decoding (using a Mistral model as draft for a Llama target) is not directly supported without additional alignment techniques.
Batch speculative decoding in high-throughput serving scenarios introduces complexity because different requests in the same batch may have different acceptance rates and different numbers of accepted tokens per step. The batch cannot advance uniformly by the same number of tokens per step, requiring careful state management to track where each request is in its generation process. Production serving frameworks like SGLang and vLLM implement batched speculative decoding with these complexities handled internally, but the performance benefit per request decreases as batch sizes grow due to the heterogeneous acceptance pattern across requests.
Speculative decoding acceptance rate monitoring provides a leading indicator of serving efficiency. When acceptance rates drop below approximately 60%, the overhead of running the draft model exceeds its speedup benefit, and disabling speculative decoding improves overall throughput. Acceptance rates vary significantly with prompt type: structured output generation (JSON, code) tends to have high acceptance because the draft model handles predictable patterns well, while creative generation with high entropy outputs has lower acceptance. Monitoring acceptance rates by request category enables adaptive speculative decoding that activates only for request types where it provides net benefit.
Speculative decoding latency gains are most pronounced for long-generation tasks. For short responses of 50 tokens or fewer, the overhead of initializing the draft model and verification loop reduces net speedup substantially. The break-even generation length — below which speculative decoding adds latency rather than reducing it — depends on draft model size and target model size, typically falling in the 20–40 token range for well-matched model pairs.