vLLM's KV cache management system inspired by OS virtual memory paging. Eliminates KV cache fragmentation, enabling near-100% GPU memory utilisation and 2-4× higher serving throughput.
During LLM inference, you cache key and value tensors for all previous tokens to avoid recomputing them. Naive implementations pre-allocate contiguous memory for the maximum sequence length. This causes two problems: (1) internal fragmentation — short sequences waste the unused tail of their allocation; (2) external fragmentation — after many sequences complete, the free memory is scattered in small gaps that can't fit new allocations even though total free memory is sufficient. In practice, pre-allocation wastes 60–80% of KV cache memory.
PagedAttention, introduced in the vLLM paper (Kwon et al. 2023), borrows the OS virtual memory paging concept. The KV cache is divided into fixed-size physical blocks (e.g. 16 token positions each). Each sequence maintains a block table — a mapping from logical block indices to physical block indices, like a page table. Memory is allocated one block at a time as sequences grow, rather than pre-allocating the full maximum length. When a sequence finishes, its blocks are freed and can be immediately reused by new sequences with no fragmentation.
The key invariant: logical block i of sequence s maps to some physical block p. The attention kernel reads K and V from the physical block addresses in the block table, not from contiguous memory. This requires a custom CUDA kernel that handles the scatter-gather read pattern efficiently.
Copy-on-write (CoW) blocks enable prefix caching: if two sequences share a common prefix (e.g. a system prompt), they can share physical blocks for the prefix, with CoW ensuring modifications don't affect the shared copy. This is the basis for vLLM's prefix caching feature.
from vllm import LLM, SamplingParams
# vLLM uses PagedAttention internally — no configuration needed
llm = LLM(
model="meta-llama/Llama-3-8B-Instruct",
gpu_memory_utilization=0.90, # fraction of GPU RAM for KV cache
max_model_len=8192,
# swap_space=4, # GB of CPU RAM for offloading cold KV blocks
)
params = SamplingParams(temperature=0.7, max_tokens=512)
# Batch inference — PagedAttention manages memory across all sequences
prompts = [
"Explain quantum entanglement in one paragraph.",
"Write a Python function to compute Fibonacci numbers.",
"What are the main causes of the French Revolution?",
]
outputs = llm.generate(prompts, params)
for out in outputs:
print(out.outputs[0].text[:100], "...")
# Online serving with OpenAI-compatible API:
# vllm serve meta-llama/Llama-3-8B-Instruct --port 8000
PagedAttention and continuous batching (iteration-level scheduling) work together to maximise throughput. Continuous batching adds new requests to a running batch as soon as any sequence finishes a token, rather than waiting for all sequences to complete. Without PagedAttention, the scheduler couldn't predict how much memory a new request needs. With block-level allocation, the scheduler knows exactly how many free blocks exist and can make fine-grained admission decisions.
From the vLLM paper benchmarks on LLaMA-13B:
The gains are larger for longer sequences and more concurrent requests, where fragmentation under naive allocation is worst.
enable_prefix_caching=True in vLLM for workloads with shared system prompts. Reduces time-to-first-token for subsequent requests.swap_space allows inactive KV blocks to spill to CPU RAM. Enables more concurrent sequences but adds swap-in latency when blocks are retrieved.import torch
from vllm import LLM
# Paged attention internally manages KV cache blocks
# Think of it like OS virtual memory: logical→physical mapping
llm = LLM(model="meta-llama/Llama-2-7b",
gpu_memory_utilization=0.9,
kv_cache_dtype="auto")
# Behind the scenes:
# - Sequences don't need contiguous memory
# - Shared tokens can reuse physical blocks (tree attention, batched generation)
# - Reduces fragmentation waste from ~80% to ~10%
outputs = llm.generate(["The future of AI is"])
# vLLM's PagedAttention handles all this transparently| Method | Max Batch | Tokens/sec | Memory Util |
|---|---|---|---|
| Standard KV cache | 8 | 140 | 65% |
| PagedAttention (vLLM) | 64 | 460 | 92% |
| PagedAttention + prefix sharing | 80 | 520 | 94% |
The fragmentation problem explained: In standard transformers, the KV cache for each sequence token is allocated contiguously in GPU memory. When sequences end at different times (common in batching), unused memory blocks are wasted. With a batch of 64 sequences where the shortest has 512 tokens and longest has 2048, the system allocates 2048 tokens worth of memory for all 64 sequences, leaving ~1.5MB of wasted memory per sequence. Across a batch, this adds up to severe inefficiency.
PagedAttention divides the KV cache into fixed-size blocks (typically 16 tokens per block) and allocates them on-demand. Sequences maintain a block table mapping logical block IDs to physical GPU memory blocks, similar to virtual memory in operating systems. When a token completes generation, its physical block can be returned to the free pool and reused for another sequence. This simple idea reduces memory waste from ~80% to ~10%, enabling 8-10x larger effective batch sizes.
Implementing PagedAttention correctly requires careful management of block allocation, deallocation, and synchronization with CUDA kernels. vLLM abstracts these details, but understanding the underlying mechanism helps debug performance bottlenecks. For example, block size choice (16 vs 32 tokens) affects CUDA warp efficiency and should be tuned per hardware and sequence length distribution.
PagedAttention is a major breakthrough in serving efficiency, but research continues on further optimizations. Speculative decoding uses smaller models to generate hypothetical tokens, reducing the number of large model forward passes needed. Flash-attention variants continue reducing I/O costs. Sparse attention patterns (attending only to nearby tokens or learned important tokens) further reduce computation for very long sequences.
Quantization of KV caches (storing keys/values in lower precision) combined with paging creates compounded efficiency gains. Prototyping shows that int8 quantization of KV caches with proper calibration loses negligible quality while saving 75% cache memory. Combining PagedAttention with KV quantization is the future of efficient long-context inference at scale.
Hardware-software codesign drives future progress. Custom attention kernels for specific hardware (like NVIDIA H100 or TPU v5) can further optimize PagedAttention. vLLM actively collaborates with hardware vendors to ensure attention implementations saturate memory bandwidth and compute utilization, pushing toward the hardware's theoretical throughput ceiling.
Operating system concepts apply directly to PagedAttention: block tables are like page tables in virtual memory, physical blocks are like frames in memory, and the block allocator behaves like a memory manager. This analogy helps practitioners understand PagedAttention and debug issues. Virtual memory's decades of research on page replacement policies (LRU, FIFO, ARC) could inspire future PagedAttention optimizations.
vLLM's implementation of PagedAttention on different hardware (NVIDIA, AMD, TPU) required adapting the block size and kernel implementation per platform. Block size choice affects CUDA warp efficiency and memory access patterns. Practitioners running vLLM on custom hardware should profile block sizes to find the Pareto frontier of throughput vs. latency for their workloads.
Preemption and scheduling in PagedAttention systems enable prioritizing requests by priority or deadline. A long-context request can be preempted (saving its state) if a shorter, higher-priority request arrives. Sophisticated schedulers optimize for total throughput, latency, or fairness. Understanding the scheduling policy helps predict inference behavior and optimize for specific deployment requirements.
PagedAttention is an example of hardware-software codesign: the algorithm was designed with GPU memory hierarchy and capabilities in mind. Understanding GPU memory architecture (global memory, shared memory, registers) helps appreciate PagedAttention's cleverness in accessing blocks efficiently. Future hardware-software collaboration will likely follow similar principles—designing algorithms and hardware together for mutual benefit.
Different GPU generations (H100, L100, A100, V100) have different memory bandwidths, cache hierarchies, and compute capabilities. PagedAttention implementations are tuned per-hardware. vLLM maintains separate kernels for NVIDIA, AMD, and other vendors. This specialization extracts maximum efficiency from each platform. Practitioners deploying to new hardware should benchmark and potentially retune block sizes and kernel parameters.
Emerging hardware like specialized AI accelerators (TPU, Cerebras, Transkimoto) may enable further PagedAttention optimizations. Some accelerators have custom memory controllers or programmable hardware that could simplify block management or enable new paging strategies. Following PagedAttention's example, optimal inference will involve algorithms and hardware designed together, not bolted on afterward.