How transformers avoid recomputing attention on every token — and how to tune it for throughput and memory.
In transformer attention, every token computes Q (query), K (key), and V (value) matrices. On each decoding step, the model must compute attention over all previous tokens and the current token. Without caching, this means recomputing K and V for every token that has already been processed — O(n²) computation for a sequence of length n.
The KV cache stores pre-computed K and V matrices from all previous tokens. When a new token arrives, the model only computes its K and V, then attends over the cached KV from all past steps. This reduces the second and subsequent decoding steps from O(n) complexity to O(1) per token, making the amortized cost O(n).
The cache lives in GPU VRAM and is evicted when the context window fills or memory runs out. It is the primary bottleneck for long-context inference, often consuming more memory than the model weights themselves.
The KV cache grows linearly with sequence length. The formula for cache memory is:
cache_bytes = 2 × layers × heads × head_dim × seq_len × 2 (for fp16)
The 2× factor accounts for both K and V. The final 2 is the byte size of fp16. For a 7B model with 32 layers, 32 heads, and 128-dim heads, caching a 2K-token sequence costs ~0.5 GB. At 128K tokens, it's 32 GB — larger than the model weights.
| Model | Seq 2K | Seq 8K | Seq 32K | Seq 128K |
|---|---|---|---|---|
| 7B (32L, 32H) | 0.5 GB | 2 GB | 8 GB | 32 GB |
| 13B (40L, 40H) | 0.8 GB | 3.2 GB | 12.8 GB | 51 GB |
| 70B (80L, 64H) | 3 GB | 12 GB | 48 GB | 192 GB |
| 405B (126L, 128H) | 24 GB | 96 GB | 384 GB | OOM |
This is why long-context requests are expensive and why serving systems enforce max_batch_tokens. A single batch of 16 requests × 128K-token context for a 70B model requires ~3 TB of GPU VRAM just for cache. Even sharding across multiple GPUs becomes challenging.
Grouped Query Attention (GQA) reduces cache size by sharing K and V heads across multiple Q heads. Instead of one K/V head per Q head (Full Multi-Head Attention), GQA uses one K/V head per group of Q heads — typically a group size of 8. This shrinks cache by 8×.
Multi-Query Attention (MQA) is the extreme case: a single K/V head shared by all Q heads. Early models like Falcon and PaLM used MQA and saw modest quality drops but dramatic cache savings.
MLA (Multi-Head Latent Attention, DeepSeek) takes compression further by projecting K/V into a low-rank latent space before caching. Only the compressed latent is stored, then decompressed at attention time. This can achieve 2–4× additional cache reduction compared to GQA.
| Variant | K/V heads (for 32 Q heads) | Cache size vs MHA | Quality vs MHA |
|---|---|---|---|
| Full MHA | 32 | 100% | Baseline |
| GQA (g=8) | 4 | 12.5% | ≈ same |
| MQA | 1 | 3.1% | Slight drop (≈1%) |
| MLA (DeepSeek) | 1 (compressed) | 5–8% | No drop (same quality) |
When context exceeds the window or cache fills up, something must be evicted. The strategy matters: drop the wrong tokens and the model loses crucial context (like the system prompt). Drop the right tokens and quality barely changes.
When cache fills, discard the oldest tokens. Simplest to implement: just a circular buffer. But you lose document history and system prompts evaporate quickly.
Track which tokens accumulated the highest cumulative attention scores across all layers. Keep those tokens, discard the rest. Retains "important" tokens automatically.
Always keep the first N tokens (system prompt, preamble) plus a sliding window of recent tokens. Combines both preservation of critical context and recent information.
Treat cache as virtual memory, organizing it into logical blocks that map to physical GPU/CPU memory. Supports swapping, sharing between requests, and zero-copy transfers.
| Strategy | Impl. Complexity | Memory Efficiency | Quality Preservation | Production Use |
|---|---|---|---|---|
| FIFO | Very low | Good | Poor (loses context) | Legacy systems |
| H2O | Medium | Very good | Good | Research |
| StreamingLLM | Low | Good | Excellent | Emerging |
| PagedAttention | High | Excellent | Excellent | vLLM, state-of-art |
PagedAttention (vLLM) is now the industry standard. Its virtual memory approach eliminates fragmentation and enables sharing of cache blocks across concurrent requests — crucial for batching efficiency.
When many requests share an identical prefix (system prompt, few-shot examples, shared document), compute K/V once and reuse it across all requests. This amortizes the prefill cost and can reduce total tokens processed by 50–90% depending on prefix length.
SGLang, vLLM, and TensorRT-LLM all implement prefix cache (SGLang calls it "radix cache"). At scale, the savings are dramatic: a 100-token system prompt at 1000 requests/sec = 100K free tokens/sec that never need recomputation.
The cache is keyed by a hash of the token sequence (e.g., SHA256 of concatenated embeddings). When a new prefix arrives, compute the hash; if it matches a prior request, retrieve the cached K/V blocks directly. Prefix caches are stored in a distributed KV store (Redis, Memcached) or local GPU cache depending on scale.
Speculative decoding uses a smaller, faster draft model to propose k tokens. The target model then verifies all k+1 tokens (k proposed + 1 from the target) in a single forward pass. Tokens that match the draft are accepted; those that don't are discarded and resampled.
The KV cache interacts subtly: during the verification pass, the target model processes k+1 tokens in parallel, building cache for all of them. Accepted tokens' cache gets written; rejected tokens' cache is thrown away. This means you compute forward passes on tokens you ultimately discard, but the verification batch processes multiple tokens at once (higher GPU utilization).
| Metric | Regular Decoding | Speculative (k=4) | Speedup |
|---|---|---|---|
| Forward passes per output token | 1 | ~1.1–1.25 | 0.8–0.9× |
| Latency (wall-clock) | 1.0× | 0.7–0.85× | 1.2–1.4× |
| Throughput (tokens/sec) | 1.0× | 1.0–1.1× | 1.0–1.1× |
| KV cache overhead | Baseline | +5–10% (temp buffers) | Acceptable |
Speculative decoding trades a small increase in compute (wasted on rejected tokens) for much lower latency. It's effective when the draft model is fast (e.g., a 1B model as draft, 70B as target) and when the acceptance rate is high (>70%, which happens with well-calibrated models). Streaming applications benefit significantly.
from vllm import LLM, SamplingParams
import time
llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.2",
enable_prefix_caching=True, # prefix cache = reuse shared prefixes
max_model_len=4096)
params = SamplingParams(temperature=0.0, max_tokens=64)
# Simulate workload with shared system prompt (benefits from prefix caching)
SYSTEM_PROMPT = "You are a helpful assistant. " * 50 # ~50 tokens of shared prefix
queries = [
SYSTEM_PROMPT + f"Question {i}: What is {i} + {i}?"
for i in range(20)
]
# First batch — cold cache
t0 = time.perf_counter()
outputs_cold = llm.generate(queries, params)
cold_time = time.perf_counter() - t0
# Second batch — warm cache (same system prompt prefix)
t1 = time.perf_counter()
outputs_warm = llm.generate(queries, params)
warm_time = time.perf_counter() - t1
print(f"Cold cache: {cold_time:.2f}s")
print(f"Warm cache: {warm_time:.2f}s")
print(f"Speedup from prefix caching: {cold_time/warm_time:.1f}×")
# Typical: 1.5–3× speedup on workloads with long shared prefixes
In production serving systems, KV cache behavior is controlled by a handful of key parameters. Understanding these lets you trade off throughput, latency, and quality for your workload.
--max-model-len: The maximum sequence length. Larger values allow longer contexts but consume more cache. Set based on your longest expected request + prefill buffer.
--gpu-memory-utilization: Fraction of GPU VRAM to allocate to cache (typically 0.8–0.95). Higher = more concurrent requests, but risks OOM on bursts.
--enable-prefix-caching: Enable automatic caching and reuse of identical prefixes. Essential for workloads with shared system prompts.
--num-gpu-blocks-override: Manually specify cache block count. Useful for tuning to your VRAM constraints.
KV cache is an inference optimisation. Understand it in this order:
In transformer decoding, every new token attends to all previous tokens. Without caching, that's O(n²) work per token. See Transformers for the attention mechanics.
KV cache trades VRAM for speed. At 70B with FP16, the KV cache for a 32K context uses ~32GB. This is why quantised KV cache (FP8 or INT4) matters for long contexts.
vLLM implements PagedAttention (non-contiguous KV blocks) and prefix caching out of the box. For most teams, "use vLLM" is the entire KV cache strategy.