Inference Optimization

KV Cache Mechanics

How transformers avoid recomputing attention on every token — and how to tune it for throughput and memory.

O(n²) → O(n) attention complexity
tokens saved per token after prefill
memory vs speed the core tension
Contents
  1. What is the KV cache?
  2. Cache growth and memory
  3. GQA and MLA variants
  4. Cache eviction strategies
  5. Prefix caching
  6. Speculative decoding
  7. Production knobs
01 — Definition

What Is the KV Cache?

In transformer attention, every token computes Q (query), K (key), and V (value) matrices. On each decoding step, the model must compute attention over all previous tokens and the current token. Without caching, this means recomputing K and V for every token that has already been processed — O(n²) computation for a sequence of length n.

The KV cache stores pre-computed K and V matrices from all previous tokens. When a new token arrives, the model only computes its K and V, then attends over the cached KV from all past steps. This reduces the second and subsequent decoding steps from O(n) complexity to O(1) per token, making the amortized cost O(n).

The cache lives in GPU VRAM and is evicted when the context window fills or memory runs out. It is the primary bottleneck for long-context inference, often consuming more memory than the model weights themselves.

## KV Cache in Action ## Prefill phase (all input tokens at once): keys = [] values = [] for token in input_tokens: k, v = compute_k_v(token) keys.append(k) values.append(v) output = attention(queries, keys, values) ← all at once, high throughput ## Decoding phase (one token at a time): for each new_token: k_new, v_new = compute_k_v(new_token) ← only for new token keys.append(k_new) ← cache grows values.append(v_new) output = attention(query_new, keys, values) ← reuse all cached KV
The speedup is enormous: without KV cache, decoding a 100-token sequence would require ~10,000 forward passes (1 + 2 + 3 + ... + 100). With cache, it's 100 forward passes (one per token). That's 100× faster.
02 — Memory Scaling

Cache Growth and Memory Budget

The KV cache grows linearly with sequence length. The formula for cache memory is:

cache_bytes = 2 × layers × heads × head_dim × seq_len × 2 (for fp16)

The 2× factor accounts for both K and V. The final 2 is the byte size of fp16. For a 7B model with 32 layers, 32 heads, and 128-dim heads, caching a 2K-token sequence costs ~0.5 GB. At 128K tokens, it's 32 GB — larger than the model weights.

ModelSeq 2KSeq 8KSeq 32KSeq 128K
7B (32L, 32H)0.5 GB2 GB8 GB32 GB
13B (40L, 40H)0.8 GB3.2 GB12.8 GB51 GB
70B (80L, 64H)3 GB12 GB48 GB192 GB
405B (126L, 128H)24 GB96 GB384 GBOOM

This is why long-context requests are expensive and why serving systems enforce max_batch_tokens. A single batch of 16 requests × 128K-token context for a 70B model requires ~3 TB of GPU VRAM just for cache. Even sharding across multiple GPUs becomes challenging.

⚠️ Budget VRAM for cache first, model second. A 70B model in fp16 is ~140 GB. If your GPU cluster has 320 GB total, that leaves only 180 GB for cache across the batch — enough for ~24 2K-token requests, but only 1-2 at 128K tokens.
03 — Attention Compression

Grouped Query Attention and MLA

Grouped Query Attention (GQA) reduces cache size by sharing K and V heads across multiple Q heads. Instead of one K/V head per Q head (Full Multi-Head Attention), GQA uses one K/V head per group of Q heads — typically a group size of 8. This shrinks cache by 8×.

Multi-Query Attention (MQA) is the extreme case: a single K/V head shared by all Q heads. Early models like Falcon and PaLM used MQA and saw modest quality drops but dramatic cache savings.

MLA (Multi-Head Latent Attention, DeepSeek) takes compression further by projecting K/V into a low-rank latent space before caching. Only the compressed latent is stored, then decompressed at attention time. This can achieve 2–4× additional cache reduction compared to GQA.

VariantK/V heads (for 32 Q heads)Cache size vs MHAQuality vs MHA
Full MHA32100%Baseline
GQA (g=8)412.5%≈ same
MQA13.1%Slight drop (≈1%)
MLA (DeepSeek)1 (compressed)5–8%No drop (same quality)
Most modern frontier models (Llama 3, Mistral, Gemma) use GQA as default. It's the sweet spot: cache savings of 8–12× with no measurable quality loss. MQA is rarely used now; MLA is promising but newer.
04 — Memory Management

Cache Eviction Strategies

When context exceeds the window or cache fills up, something must be evicted. The strategy matters: drop the wrong tokens and the model loses crucial context (like the system prompt). Drop the right tokens and quality barely changes.

1

FIFO / Sliding Window — oldest out

When cache fills, discard the oldest tokens. Simplest to implement: just a circular buffer. But you lose document history and system prompts evaporate quickly.

  • Pros: Trivial to implement, minimal overhead
  • Cons: Loses system prompt, hurts long-context tasks
2

Heavy Hitter Oracle (H2O) — attention-aware

Track which tokens accumulated the highest cumulative attention scores across all layers. Keep those tokens, discard the rest. Retains "important" tokens automatically.

  • Pros: Preserves semantically important tokens
  • Cons: Tracking attention scores adds overhead, may miss subtle dependencies
3

StreamingLLM — anchored window

Always keep the first N tokens (system prompt, preamble) plus a sliding window of recent tokens. Combines both preservation of critical context and recent information.

  • Pros: Preserves system prompt, recent context intact
  • Cons: Requires tuning window size and anchor size
4

PagedAttention (vLLM) — virtual memory

Treat cache as virtual memory, organizing it into logical blocks that map to physical GPU/CPU memory. Supports swapping, sharing between requests, and zero-copy transfers.

  • Pros: No fragmentation, enables prefix sharing, efficient batch packing
  • Cons: Implementation complexity, requires special software support

Eviction Strategy Comparison

StrategyImpl. ComplexityMemory EfficiencyQuality PreservationProduction Use
FIFOVery lowGoodPoor (loses context)Legacy systems
H2OMediumVery goodGoodResearch
StreamingLLMLowGoodExcellentEmerging
PagedAttentionHighExcellentExcellentvLLM, state-of-art

PagedAttention (vLLM) is now the industry standard. Its virtual memory approach eliminates fragmentation and enables sharing of cache blocks across concurrent requests — crucial for batching efficiency.

05 — Amortization

Prefix Caching and Prompt Reuse

When many requests share an identical prefix (system prompt, few-shot examples, shared document), compute K/V once and reuse it across all requests. This amortizes the prefill cost and can reduce total tokens processed by 50–90% depending on prefix length.

SGLang, vLLM, and TensorRT-LLM all implement prefix cache (SGLang calls it "radix cache"). At scale, the savings are dramatic: a 100-token system prompt at 1000 requests/sec = 100K free tokens/sec that never need recomputation.

## Prefix Cache Hit Example Request 1: System prompt (100 tokens) → compute K/V → cache hash Few-shot examples (50 tokens) → compute K/V → cache hash User query 1 (20 tokens) → compute K/V Output: 200 tokens Total compute: 170 prefill tokens Request 2 (same system + examples, different query): System prompt (100 tokens) → CACHE HIT! reuse stored K/V Few-shot examples (50 tokens) → CACHE HIT! reuse stored K/V User query 2 (15 tokens) → compute K/V Output: 150 tokens Total compute: 15 prefill tokens (89% saved!)
⚠️ Prefix cache hit rate depends on prefix consistency: vary the suffix (user query), keep the prefix identical. A system prompt that changes per request ruins caching. Design your prompts to maximize reusable prefixes.

Prefix Cache Implementation

The cache is keyed by a hash of the token sequence (e.g., SHA256 of concatenated embeddings). When a new prefix arrives, compute the hash; if it matches a prior request, retrieve the cached K/V blocks directly. Prefix caches are stored in a distributed KV store (Redis, Memcached) or local GPU cache depending on scale.

06 — Latency Reduction

Speculative Decoding and Cache Interaction

Speculative decoding uses a smaller, faster draft model to propose k tokens. The target model then verifies all k+1 tokens (k proposed + 1 from the target) in a single forward pass. Tokens that match the draft are accepted; those that don't are discarded and resampled.

The KV cache interacts subtly: during the verification pass, the target model processes k+1 tokens in parallel, building cache for all of them. Accepted tokens' cache gets written; rejected tokens' cache is thrown away. This means you compute forward passes on tokens you ultimately discard, but the verification batch processes multiple tokens at once (higher GPU utilization).

MetricRegular DecodingSpeculative (k=4)Speedup
Forward passes per output token1~1.1–1.250.8–0.9×
Latency (wall-clock)1.0×0.7–0.85×1.2–1.4×
Throughput (tokens/sec)1.0×1.0–1.1×1.0–1.1×
KV cache overheadBaseline+5–10% (temp buffers)Acceptable

Speculative decoding trades a small increase in compute (wasted on rejected tokens) for much lower latency. It's effective when the draft model is fast (e.g., a 1B model as draft, 70B as target) and when the acceptance rate is high (>70%, which happens with well-calibrated models). Streaming applications benefit significantly.

Speculative decoding is production-ready in vLLM, SGLang, and LM Studio. Use it for latency-sensitive workloads (chatbots, real-time APIs) with compatible model pairs (both models must share the same tokenizer).
Python · Measure KV cache hit rate and memory usage with vLLM
from vllm import LLM, SamplingParams
import time

llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.2",
          enable_prefix_caching=True,   # prefix cache = reuse shared prefixes
          max_model_len=4096)

params = SamplingParams(temperature=0.0, max_tokens=64)

# Simulate workload with shared system prompt (benefits from prefix caching)
SYSTEM_PROMPT = "You are a helpful assistant. " * 50  # ~50 tokens of shared prefix
queries = [
    SYSTEM_PROMPT + f"Question {i}: What is {i} + {i}?"
    for i in range(20)
]

# First batch — cold cache
t0 = time.perf_counter()
outputs_cold = llm.generate(queries, params)
cold_time = time.perf_counter() - t0

# Second batch — warm cache (same system prompt prefix)
t1 = time.perf_counter()
outputs_warm = llm.generate(queries, params)
warm_time = time.perf_counter() - t1

print(f"Cold cache: {cold_time:.2f}s")
print(f"Warm cache: {warm_time:.2f}s")
print(f"Speedup from prefix caching: {cold_time/warm_time:.1f}×")
# Typical: 1.5–3× speedup on workloads with long shared prefixes
07 — Practical Tuning

Production Knobs

In production serving systems, KV cache behavior is controlled by a handful of key parameters. Understanding these lets you trade off throughput, latency, and quality for your workload.

vLLM
Inference
--max-model-len, --gpu-memory-utilization, --enable-prefix-caching, --num-gpu-blocks-override
SGLang
Inference
mem_fraction, enable_prefix_cache, max_seq_len, disable_flashinfer
TensorRT-LLM
Inference
--max_batch_size, --builder-opt-level, KV cache quantization
llama.cpp
Local
-c (context size), --cache-type-k, --cache-type-v
MLC LLM
Inference
kv_cache_quantization, flashinfer backend
Ollama
Local
OLLAMA_NUM_GPU (GPU memory allocation)

Key Tuning Parameters

--max-model-len: The maximum sequence length. Larger values allow longer contexts but consume more cache. Set based on your longest expected request + prefill buffer.

--gpu-memory-utilization: Fraction of GPU VRAM to allocate to cache (typically 0.8–0.95). Higher = more concurrent requests, but risks OOM on bursts.

--enable-prefix-caching: Enable automatic caching and reuse of identical prefixes. Essential for workloads with shared system prompts.

--num-gpu-blocks-override: Manually specify cache block count. Useful for tuning to your VRAM constraints.

⚠️ Increasing max_model_len without reducing batch size causes OOM. Always budget VRAM for KV cache first. If you want longer contexts, reduce batch size or increase GPU count proportionally. The formula: VRAM_needed = model_weights + (batch_size × seq_len × cache_bytes_per_token).

Common Pitfalls

08 — Further Reading

References

Academic Papers
Documentation & Blogs
Open Source Tools
LEARNING PATH

Learning Path

KV cache is an inference optimisation. Understand it in this order:

Transformer Attnhow attention works
Autoregressive Decodetoken by token
KV Cachecache K/V, skip recompute
PagedAttentionvLLM memory mgmt
Prefix Cachingreuse across requests
1

Understand autoregressive decoding first

In transformer decoding, every new token attends to all previous tokens. Without caching, that's O(n²) work per token. See Transformers for the attention mechanics.

2

Understand the memory-bandwidth tradeoff

KV cache trades VRAM for speed. At 70B with FP16, the KV cache for a 32K context uses ~32GB. This is why quantised KV cache (FP8 or INT4) matters for long contexts.

3

Use vLLM — it handles all of this

vLLM implements PagedAttention (non-contiguous KV blocks) and prefix caching out of the box. For most teams, "use vLLM" is the entire KV cache strategy.