KV Cache Mechanics

Contents

What is the KV cache?
Cache growth and memory
GQA and MLA variants
Cache eviction strategies
Prefix caching
Speculative decoding
Production knobs

01 — Definition

What Is the KV Cache?

In transformer attention, every token computes Q (query), K (key), and V (value) matrices. On each decoding step, the model must compute attention over all previous tokens and the current token. Without caching, this means recomputing K and V for every token that has already been processed — O(n²) computation for a sequence of length n.

The KV cache stores pre-computed K and V matrices from all previous tokens. When a new token arrives, the model only computes its K and V, then attends over the cached KV from all past steps. This reduces the second and subsequent decoding steps from O(n) complexity to O(1) per token, making the amortized cost O(n).

The cache lives in GPU VRAM and is evicted when the context window fills or memory runs out. It is the primary bottleneck for long-context inference, often consuming more memory than the model weights themselves.

## KV Cache in Action ## Prefill phase (all input tokens at once): keys = [] values = [] for token in input_tokens: k, v = compute_k_v(token) keys.append(k) values.append(v) output = attention(queries, keys, values) ← all at once, high throughput ## Decoding phase (one token at a time): for each new_token: k_new, v_new = compute_k_v(new_token) ← only for new token keys.append(k_new) ← cache grows values.append(v_new) output = attention(query_new, keys, values) ← reuse all cached KV

✓ The speedup is enormous: without KV cache, decoding a 100-token sequence would require ~10,000 forward passes (1 + 2 + 3 + ... + 100). With cache, it's 100 forward passes (one per token). That's 100× faster.

02 — Memory Scaling

Cache Growth and Memory Budget

The KV cache grows linearly with sequence length. The formula for cache memory is:

cache_bytes = 2 × layers × heads × head_dim × seq_len × 2 (for fp16)

The 2× factor accounts for both K and V. The final 2 is the byte size of fp16. For a 7B model with 32 layers, 32 heads, and 128-dim heads, caching a 2K-token sequence costs ~0.5 GB. At 128K tokens, it's 32 GB — larger than the model weights.

Model	Seq 2K	Seq 8K	Seq 32K	Seq 128K
7B (32L, 32H)	0.5 GB	2 GB	8 GB	32 GB
13B (40L, 40H)	0.8 GB	3.2 GB	12.8 GB	51 GB
70B (80L, 64H)	3 GB	12 GB	48 GB	192 GB
405B (126L, 128H)	24 GB	96 GB	384 GB	OOM

This is why long-context requests are expensive and why serving systems enforce max_batch_tokens. A single batch of 16 requests × 128K-token context for a 70B model requires ~3 TB of GPU VRAM just for cache. Even sharding across multiple GPUs becomes challenging.

⚠️ Budget VRAM for cache first, model second. A 70B model in fp16 is ~140 GB. If your GPU cluster has 320 GB total, that leaves only 180 GB for cache across the batch — enough for ~24 2K-token requests, but only 1-2 at 128K tokens.

03 — Attention Compression

Grouped Query Attention and MLA

Grouped Query Attention (GQA) reduces cache size by sharing K and V heads across multiple Q heads. Instead of one K/V head per Q head (Full Multi-Head Attention), GQA uses one K/V head per group of Q heads — typically a group size of 8. This shrinks cache by 8×.

Multi-Query Attention (MQA) is the extreme case: a single K/V head shared by all Q heads. Early models like Falcon and PaLM used MQA and saw modest quality drops but dramatic cache savings.

MLA (Multi-Head Latent Attention, DeepSeek) takes compression further by projecting K/V into a low-rank latent space before caching. Only the compressed latent is stored, then decompressed at attention time. This can achieve 2–4× additional cache reduction compared to GQA.

Variant	K/V heads (for 32 Q heads)	Cache size vs MHA	Quality vs MHA
Full MHA	32	100%	Baseline
GQA (g=8)	4	12.5%	≈ same
MQA	1	3.1%	Slight drop (≈1%)
MLA (DeepSeek)	1 (compressed)	5–8%	No drop (same quality)

✓ Most modern frontier models (Llama 3, Mistral, Gemma) use GQA as default. It's the sweet spot: cache savings of 8–12× with no measurable quality loss. MQA is rarely used now; MLA is promising but newer.

04 — Memory Management

Cache Eviction Strategies

When context exceeds the window or cache fills up, something must be evicted. The strategy matters: drop the wrong tokens and the model loses crucial context (like the system prompt). Drop the right tokens and quality barely changes.

FIFO / Sliding Window — oldest out

When cache fills, discard the oldest tokens. Simplest to implement: just a circular buffer. But you lose document history and system prompts evaporate quickly.

Pros: Trivial to implement, minimal overhead
Cons: Loses system prompt, hurts long-context tasks

Heavy Hitter Oracle (H2O) — attention-aware

Track which tokens accumulated the highest cumulative attention scores across all layers. Keep those tokens, discard the rest. Retains "important" tokens automatically.

Pros: Preserves semantically important tokens
Cons: Tracking attention scores adds overhead, may miss subtle dependencies

StreamingLLM — anchored window

Always keep the first N tokens (system prompt, preamble) plus a sliding window of recent tokens. Combines both preservation of critical context and recent information.

Pros: Preserves system prompt, recent context intact
Cons: Requires tuning window size and anchor size

PagedAttention (vLLM) — virtual memory

Treat cache as virtual memory, organizing it into logical blocks that map to physical GPU/CPU memory. Supports swapping, sharing between requests, and zero-copy transfers.

Pros: No fragmentation, enables prefix sharing, efficient batch packing
Cons: Implementation complexity, requires special software support

Eviction Strategy Comparison

Strategy	Impl. Complexity	Memory Efficiency	Quality Preservation	Production Use
FIFO	Very low	Good	Poor (loses context)	Legacy systems
H2O	Medium	Very good	Good	Research
StreamingLLM	Low	Good	Excellent	Emerging
PagedAttention	High	Excellent	Excellent	vLLM, state-of-art

PagedAttention (vLLM) is now the industry standard. Its virtual memory approach eliminates fragmentation and enables sharing of cache blocks across concurrent requests — crucial for batching efficiency.

05 — Amortization

Prefix Caching and Prompt Reuse

When many requests share an identical prefix (system prompt, few-shot examples, shared document), compute K/V once and reuse it across all requests. This amortizes the prefill cost and can reduce total tokens processed by 50–90% depending on prefix length.

SGLang, vLLM, and TensorRT-LLM all implement prefix cache (SGLang calls it "radix cache"). At scale, the savings are dramatic: a 100-token system prompt at 1000 requests/sec = 100K free tokens/sec that never need recomputation.

## Prefix Cache Hit Example Request 1: System prompt (100 tokens) → compute K/V → cache hash Few-shot examples (50 tokens) → compute K/V → cache hash User query 1 (20 tokens) → compute K/V Output: 200 tokens Total compute: 170 prefill tokens Request 2 (same system + examples, different query): System prompt (100 tokens) → CACHE HIT! reuse stored K/V Few-shot examples (50 tokens) → CACHE HIT! reuse stored K/V User query 2 (15 tokens) → compute K/V Output: 150 tokens Total compute: 15 prefill tokens (89% saved!)

⚠️ Prefix cache hit rate depends on prefix consistency: vary the suffix (user query), keep the prefix identical. A system prompt that changes per request ruins caching. Design your prompts to maximize reusable prefixes.

Prefix Cache Implementation

The cache is keyed by a hash of the token sequence (e.g., SHA256 of concatenated embeddings). When a new prefix arrives, compute the hash; if it matches a prior request, retrieve the cached K/V blocks directly. Prefix caches are stored in a distributed KV store (Redis, Memcached) or local GPU cache depending on scale.

06 — Latency Reduction

Speculative Decoding and Cache Interaction

Speculative decoding uses a smaller, faster draft model to propose k tokens. The target model then verifies all k+1 tokens (k proposed + 1 from the target) in a single forward pass. Tokens that match the draft are accepted; those that don't are discarded and resampled.

The KV cache interacts subtly: during the verification pass, the target model processes k+1 tokens in parallel, building cache for all of them. Accepted tokens' cache gets written; rejected tokens' cache is thrown away. This means you compute forward passes on tokens you ultimately discard, but the verification batch processes multiple tokens at once (higher GPU utilization).

Metric	Regular Decoding	Speculative (k=4)	Speedup
Forward passes per output token	1	~1.1–1.25	0.8–0.9×
Latency (wall-clock)	1.0×	0.7–0.85×	1.2–1.4×
Throughput (tokens/sec)	1.0×	1.0–1.1×	1.0–1.1×
KV cache overhead	Baseline	+5–10% (temp buffers)	Acceptable

Speculative decoding trades a small increase in compute (wasted on rejected tokens) for much lower latency. It's effective when the draft model is fast (e.g., a 1B model as draft, 70B as target) and when the acceptance rate is high (>70%, which happens with well-calibrated models). Streaming applications benefit significantly.

✓ Speculative decoding is production-ready in vLLM, SGLang, and LM Studio. Use it for latency-sensitive workloads (chatbots, real-time APIs) with compatible model pairs (both models must share the same tokenizer).

Python · Measure KV cache hit rate and memory usage with vLLM

from vllm import LLM, SamplingParams
import time

llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.2",
          enable_prefix_caching=True,   # prefix cache = reuse shared prefixes
          max_model_len=4096)

params = SamplingParams(temperature=0.0, max_tokens=64)

# Simulate workload with shared system prompt (benefits from prefix caching)
SYSTEM_PROMPT = "You are a helpful assistant. " * 50  # ~50 tokens of shared prefix
queries = [
    SYSTEM_PROMPT + f"Question {i}: What is {i} + {i}?"
    for i in range(20)
]

# First batch — cold cache
t0 = time.perf_counter()
outputs_cold = llm.generate(queries, params)
cold_time = time.perf_counter() - t0

# Second batch — warm cache (same system prompt prefix)
t1 = time.perf_counter()
outputs_warm = llm.generate(queries, params)
warm_time = time.perf_counter() - t1

print(f"Cold cache: {cold_time:.2f}s")
print(f"Warm cache: {warm_time:.2f}s")
print(f"Speedup from prefix caching: {cold_time/warm_time:.1f}×")
# Typical: 1.5–3× speedup on workloads with long shared prefixes

07 — Practical Tuning

Production Knobs

In production serving systems, KV cache behavior is controlled by a handful of key parameters. Understanding these lets you trade off throughput, latency, and quality for your workload.

vLLM

Inference

--max-model-len, --gpu-memory-utilization, --enable-prefix-caching, --num-gpu-blocks-override

SGLang

Inference

mem_fraction, enable_prefix_cache, max_seq_len, disable_flashinfer

TensorRT-LLM

Inference

--max_batch_size, --builder-opt-level, KV cache quantization

llama.cpp

Local

-c (context size), --cache-type-k, --cache-type-v

MLC LLM

Inference

kv_cache_quantization, flashinfer backend

Ollama

Local

OLLAMA_NUM_GPU (GPU memory allocation)

Key Tuning Parameters

--max-model-len: The maximum sequence length. Larger values allow longer contexts but consume more cache. Set based on your longest expected request + prefill buffer.

--gpu-memory-utilization: Fraction of GPU VRAM to allocate to cache (typically 0.8–0.95). Higher = more concurrent requests, but risks OOM on bursts.

--enable-prefix-caching: Enable automatic caching and reuse of identical prefixes. Essential for workloads with shared system prompts.

--num-gpu-blocks-override: Manually specify cache block count. Useful for tuning to your VRAM constraints.

⚠️ Increasing max_model_len without reducing batch size causes OOM. Always budget VRAM for KV cache first. If you want longer contexts, reduce batch size or increase GPU count proportionally. The formula: VRAM_needed = model_weights + (batch_size × seq_len × cache_bytes_per_token).

Common Pitfalls

Setting max_model_len to 128K when your GPU has only 80GB VRAM and model is 70B. Result: OOM on first long request.
Using FIFO eviction (default in some old systems) and losing system prompts in long sessions.
Not enabling prefix caching when 80% of requests share the same system prompt.
Batching requests with vastly different sequence lengths (e.g., 100 tokens + 8K tokens). The cache pads to the longest, wasting VRAM on short requests.

08 — Further Reading

References

Academic Papers

Paper Zhou, Y. et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv:2309.06180. — arxiv:2309.06180 ↗
Paper Dao, T., Fu, D., et al. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691. — arxiv:2307.08691 ↗
Paper Xiao, G., et al. (2024). Streaming LLM: Efficient Streaming Language Models with Attention Sinks. arXiv:2309.17453. — arxiv:2309.17453 ↗
Paper Zhang, W., et al. (2024). H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. arXiv:2306.14048. — arxiv:2306.14048 ↗
Paper Leviathan, Y., et al. (2023). Fast Transformer Decoding: One Write-Head is All You Need. arXiv:2211.15297. — arxiv:2211.15297 ↗

Documentation & Blogs

Docs vLLM PagedAttention. vllm.ai ↗
Blog SGLang RadixAttention: Efficient Prefix Caching. lmsys.org ↗
Docs HuggingFace Text Generation Inference KV Cache. huggingface.co ↗
Blog Together AI: KV Cache Quantization for Efficient Serving. together.ai ↗

Open Source Tools

Guide vLLM GitHub. github.com/vllm-project ↗
Guide SGLang GitHub. github.com/hiyouga ↗
Guide llama.cpp. github.com/ggerganov ↗

LEARNING PATH

Learning Path

KV cache is an inference optimisation. Understand it in this order:

Transformer Attnhow attention works

→

Autoregressive Decodetoken by token

→

KV Cachecache K/V, skip recompute

→

PagedAttentionvLLM memory mgmt

→

Prefix Cachingreuse across requests

Understand autoregressive decoding first

In transformer decoding, every new token attends to all previous tokens. Without caching, that's O(n²) work per token. See Transformers for the attention mechanics.

Understand the memory-bandwidth tradeoff

KV cache trades VRAM for speed. At 70B with FP16, the KV cache for a 32K context uses ~32GB. This is why quantised KV cache (FP8 or INT4) matters for long contexts.

Use vLLM — it handles all of this

vLLM implements PagedAttention (non-contiguous KV blocks) and prefix caching out of the box. For most teams, "use vLLM" is the entire KV cache strategy.

KV Cache Mechanics

What Is the KV Cache?

Cache Growth and Memory Budget

Grouped Query Attention and MLA

Cache Eviction Strategies

FIFO / Sliding Window — oldest out

Heavy Hitter Oracle (H2O) — attention-aware

StreamingLLM — anchored window

PagedAttention (vLLM) — virtual memory

Eviction Strategy Comparison

Prefix Caching and Prompt Reuse

Prefix Cache Implementation

Speculative Decoding and Cache Interaction

Production Knobs

Key Tuning Parameters

Common Pitfalls

References

Learning Path

Understand autoregressive decoding first

Understand the memory-bandwidth tradeoff

Use vLLM — it handles all of this

Related concepts