Speculative Decoding

The decoding bottleneck
Speculative decoding algorithm
Acceptance sampling
Python implementation sketch
Draft model selection
Throughput vs latency impact
Gotchas

SECTION 01

The decoding bottleneck

Autoregressive LLM generation is sequential: you generate one token at a time, each step requires a full forward pass through the model. For a 70B parameter model, each token costs ~140 GFLOPS. The throughput bottleneck is memory bandwidth — loading all 140B parameters from HBM to compute units to generate a single token. On an A100 80GB, this limits you to roughly 50 tokens/second for a 70B model.

Speculative decoding exploits the observation that a small model (e.g. 7B) can correctly predict the next 3–5 tokens most of the time. If we can verify all those predictions in a single large-model forward pass, we get multiple tokens per expensive call.

SECTION 02

Speculative decoding algorithm

At each step of speculative decoding:

Draft: Run the small draft model autoregressively for γ steps, generating γ draft tokens x̃₁, ..., x̃_γ with draft probabilities q(x̃_i).
Verify: Run the large target model on the original context + all γ draft tokens in a single forward pass (possible because it's not causal during verification). Get target probabilities p(x̃_i) for each draft position plus p(x_{γ+1}) for the next new token.
Accept/reject: For each draft token in order, accept it with probability min(1, p(x̃_i)/q(x̃_i)). If rejected, sample a correction token from a normalised residual distribution. Once any token is rejected, discard all subsequent draft tokens.
On average α × γ tokens are accepted per verification step, where α is the acceptance rate (typically 0.7–0.9 for a good draft model).

SECTION 03

Acceptance sampling

The acceptance criterion min(1, p/q) is the key insight. It guarantees that the output distribution of speculative decoding is identical to that of the target model — no quality degradation. The math: if we accept x̃ with probability p(x̃)/q(x̃) when p ≤ q, and always accept when p ≥ q, the resulting marginal distribution equals p.

When a token is rejected at position i, we sample from the residual distribution max(0, p - q) / Z rather than simply resampling from p. This ensures no token is wasted — the rejection produces a valid target sample that's different from the draft token.

SECTION 04

Python implementation sketch

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

def speculative_decode(draft_model, target_model, tokenizer, prompt, gamma=4, max_new=100):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
    generated = input_ids.clone()

    for _ in range(max_new // gamma + 1):
        # Step 1: draft gamma tokens
        draft_ids = generated.clone()
        draft_probs = []
        for _ in range(gamma):
            with torch.no_grad():
                logits = draft_model(draft_ids).logits[:, -1, :]
            probs = torch.softmax(logits, dim=-1)
            next_tok = torch.multinomial(probs, 1)
            draft_probs.append(probs[0, next_tok[0, 0]].item())
            draft_ids = torch.cat([draft_ids, next_tok], dim=-1)

        # Step 2: verify with target model (single forward pass)
        with torch.no_grad():
            target_logits = target_model(draft_ids).logits[0, generated.shape[1]-1:, :]
        target_probs = torch.softmax(target_logits, dim=-1)

        # Step 3: accept/reject
        accepted = 0
        for i in range(gamma):
            draft_tok = draft_ids[0, generated.shape[1] + i].item()
            p = target_probs[i, draft_tok].item()
            q = draft_probs[i]
            if torch.rand(1).item() <= p / q:
                accepted += 1
            else:
                break

        new_tokens = draft_ids[:, generated.shape[1]:generated.shape[1] + accepted]
        generated = torch.cat([generated, new_tokens], dim=-1)

        # Add one target-sampled token
        bonus_tok = torch.multinomial(target_probs[accepted], 1).unsqueeze(0)
        generated = torch.cat([generated, bonus_tok], dim=-1)

    return tokenizer.decode(generated[0], skip_special_tokens=True)

SECTION 05

Draft model selection

The draft model is the most important hyperparameter. Requirements: (1) same vocabulary as the target model, (2) fast enough that γ draft steps cost less than 1 target step, (3) high acceptance rate ≥ 0.7. Good pairs:

Llama 3 70B (target) + Llama 3 8B (draft): acceptance rate ~0.75–0.85 on typical prompts
GPT-4 class target + GPT-3.5-turbo draft (used internally by some providers)
Self-speculative decoding: use early exit layers of the same model as the draft

For a 70B/7B pair, if acceptance rate α=0.8 and γ=4, the expected tokens per target call is α × γ + 1 ≈ 4.2, giving ~4× fewer target calls and ~2–3× wall-clock speedup (draft overhead reduces the theoretical gain).

SECTION 06

Throughput vs latency impact

Speculative decoding primarily reduces latency (time for a single request), not throughput (total tokens/second across many requests). When the GPU is already fully utilised with batched requests, speculative decoding may not help — the verification step has the same batch dimension as the target-only baseline. It shines in:

Interactive chat (single-user, latency-critical)
Edge deployment (single GPU, low concurrency)
Long generation tasks where p99 latency matters

SECTION 07

Gotchas

Memory: Running two models simultaneously doubles GPU memory requirements. On memory-constrained hardware, consider self-speculative decoding (no second model) or medusa heads instead.
Temperature 0: At temperature=0 (greedy decoding), acceptance becomes deterministic (always accept if draft token = argmax). Acceptance rate is essentially 1 for easy tokens.
Batching: Speculative decoding with batched requests is harder because different sequences in the batch may reject at different steps. vLLM and TGI implement efficient batched speculative decoding.
Medusa alternative: Medusa adds multiple prediction heads to the target model to generate draft tokens without a separate model — lower memory cost but requires fine-tuning.

Speculative Decoding Performance Analysis

Speculative decoding accelerates autoregressive LLM inference by using a small, fast draft model to propose multiple candidate tokens, which a larger target model then verifies in a single parallel forward pass. Tokens that the target model agrees with are accepted; the first rejected token triggers regeneration from that point. When the draft model's predictions align well with the target, multiple tokens are accepted per target model step, increasing effective throughput.

Configuration	Acceptance Rate	Speedup	Memory Overhead
Same-family (7B draft / 70B target)	70–85%	2–3×	+7B params
Cross-family (mismatched vocab)	N/A	N/A (incompatible)	N/A
Self-speculative (layer skipping)	65–75%	1.5–2×	None
Medusa (multiple heads)	60–80%	2–3×	Small MLP heads

The acceptance rate — the fraction of draft tokens accepted by the target model — is the key metric determining speculative decoding speedup. Acceptance rates above 70% typically produce 2× or better throughput improvements; rates below 50% may produce little benefit after accounting for the overhead of running the draft model. Acceptance rates vary significantly by task type: continuation of predictable text (filling in boilerplate, completing common phrases) achieves high acceptance rates, while creative generation and complex reasoning produce lower rates as the draft model's predictions diverge from the target's choices.

Speculative decoding preserves exact output equivalence with standard autoregressive decoding — every accepted token sequence has the same probability under the target model's distribution as if it had been generated without speculative decoding. This mathematical equivalence means speculative decoding is a pure latency optimization with no quality trade-off, unlike quantization or pruning which exchange some quality for efficiency. The determinism guarantee is important for applications that require reproducible outputs or audit trails of generation decisions.

# Speculative decoding with HuggingFace
from transformers import pipeline

# Load draft and target models
pipe = pipeline(
    "text-generation",
    model="meta-llama/Llama-3.1-70B-Instruct",
    assistant_model="meta-llama/Llama-3.2-3B-Instruct",  # draft
    device_map="auto",
)
# Draft model must share vocabulary with target model

Draft model selection is critical for achieving high speculative decoding speedup. The draft model must use the same vocabulary and tokenizer as the target model to ensure token-level compatibility — a draft model tokenizing "Python" as one token must produce the same token ID as the target model for acceptance verification to work correctly. For Llama-family models, using a smaller Llama variant (3B draft for 70B target) provides good vocabulary compatibility. Cross-family speculative decoding (using a Mistral model as draft for a Llama target) is not directly supported without additional alignment techniques.

Batch speculative decoding in high-throughput serving scenarios introduces complexity because different requests in the same batch may have different acceptance rates and different numbers of accepted tokens per step. The batch cannot advance uniformly by the same number of tokens per step, requiring careful state management to track where each request is in its generation process. Production serving frameworks like SGLang and vLLM implement batched speculative decoding with these complexities handled internally, but the performance benefit per request decreases as batch sizes grow due to the heterogeneous acceptance pattern across requests.

Speculative decoding acceptance rate monitoring provides a leading indicator of serving efficiency. When acceptance rates drop below approximately 60%, the overhead of running the draft model exceeds its speedup benefit, and disabling speculative decoding improves overall throughput. Acceptance rates vary significantly with prompt type: structured output generation (JSON, code) tends to have high acceptance because the draft model handles predictable patterns well, while creative generation with high entropy outputs has lower acceptance. Monitoring acceptance rates by request category enables adaptive speculative decoding that activates only for request types where it provides net benefit.

Speculative decoding latency gains are most pronounced for long-generation tasks. For short responses of 50 tokens or fewer, the overhead of initializing the draft model and verification loop reduces net speedup substantially. The break-even generation length — below which speculative decoding adds latency rather than reducing it — depends on draft model size and target model size, typically falling in the 20–40 token range for well-matched model pairs.