Paged Attention

The KV cache fragmentation problem
PagedAttention design
Block tables and physical blocks
Using vLLM in Python
Continuous batching synergy
Memory efficiency numbers
Gotchas

SECTION 01

The KV cache fragmentation problem

During LLM inference, you cache key and value tensors for all previous tokens to avoid recomputing them. Naive implementations pre-allocate contiguous memory for the maximum sequence length. This causes two problems: (1) internal fragmentation — short sequences waste the unused tail of their allocation; (2) external fragmentation — after many sequences complete, the free memory is scattered in small gaps that can't fit new allocations even though total free memory is sufficient. In practice, pre-allocation wastes 60–80% of KV cache memory.

SECTION 02

PagedAttention design

PagedAttention, introduced in the vLLM paper (Kwon et al. 2023), borrows the OS virtual memory paging concept. The KV cache is divided into fixed-size physical blocks (e.g. 16 token positions each). Each sequence maintains a block table — a mapping from logical block indices to physical block indices, like a page table. Memory is allocated one block at a time as sequences grow, rather than pre-allocating the full maximum length. When a sequence finishes, its blocks are freed and can be immediately reused by new sequences with no fragmentation.

SECTION 03

Block tables and physical blocks

The key invariant: logical block i of sequence s maps to some physical block p. The attention kernel reads K and V from the physical block addresses in the block table, not from contiguous memory. This requires a custom CUDA kernel that handles the scatter-gather read pattern efficiently.

Copy-on-write (CoW) blocks enable prefix caching: if two sequences share a common prefix (e.g. a system prompt), they can share physical blocks for the prefix, with CoW ensuring modifications don't affect the shared copy. This is the basis for vLLM's prefix caching feature.

SECTION 04

Using vLLM in Python

from vllm import LLM, SamplingParams

# vLLM uses PagedAttention internally — no configuration needed
llm = LLM(
    model="meta-llama/Llama-3-8B-Instruct",
    gpu_memory_utilization=0.90,  # fraction of GPU RAM for KV cache
    max_model_len=8192,
    # swap_space=4,  # GB of CPU RAM for offloading cold KV blocks
)

params = SamplingParams(temperature=0.7, max_tokens=512)

# Batch inference — PagedAttention manages memory across all sequences
prompts = [
    "Explain quantum entanglement in one paragraph.",
    "Write a Python function to compute Fibonacci numbers.",
    "What are the main causes of the French Revolution?",
]
outputs = llm.generate(prompts, params)
for out in outputs:
    print(out.outputs[0].text[:100], "...")

# Online serving with OpenAI-compatible API:
# vllm serve meta-llama/Llama-3-8B-Instruct --port 8000

SECTION 05

Continuous batching synergy

PagedAttention and continuous batching (iteration-level scheduling) work together to maximise throughput. Continuous batching adds new requests to a running batch as soon as any sequence finishes a token, rather than waiting for all sequences to complete. Without PagedAttention, the scheduler couldn't predict how much memory a new request needs. With block-level allocation, the scheduler knows exactly how many free blocks exist and can make fine-grained admission decisions.

SECTION 06

Memory efficiency numbers

From the vLLM paper benchmarks on LLaMA-13B:

HuggingFace transformers: 20.4% KV cache utilisation (79.6% wasted to fragmentation and reservations)
vLLM with PagedAttention: ~96% KV cache utilisation
Throughput improvement: 2.2–3.5× higher request throughput on A100 80GB vs HuggingFace baseline

The gains are larger for longer sequences and more concurrent requests, where fragmentation under naive allocation is worst.

SECTION 07

Gotchas

Block size tuning: Smaller blocks (8–16 tokens) reduce internal fragmentation but increase block table overhead. vLLM defaults to 16 tokens/block, which works well for most cases.
Prefix caching: Enable with enable_prefix_caching=True in vLLM for workloads with shared system prompts. Reduces time-to-first-token for subsequent requests.
CPU offloading: Setting swap_space allows inactive KV blocks to spill to CPU RAM. Enables more concurrent sequences but adds swap-in latency when blocks are retrieved.
Not just vLLM: TGI, SGLang, and LMDeploy all implement equivalent paged KV cache management — PagedAttention was the first published formalisation, but the concept is now standard across all serious LLM serving systems.

SECTION 09

Serving Speed: KV Cache vs PagedAttention

Method	Max Batch	Tokens/sec	Memory Util
Standard KV cache	8	140	65%
PagedAttention (vLLM)	64	460	92%
PagedAttention + prefix sharing	80	520	94%

The fragmentation problem explained: In standard transformers, the KV cache for each sequence token is allocated contiguously in GPU memory. When sequences end at different times (common in batching), unused memory blocks are wasted. With a batch of 64 sequences where the shortest has 512 tokens and longest has 2048, the system allocates 2048 tokens worth of memory for all 64 sequences, leaving ~1.5MB of wasted memory per sequence. Across a batch, this adds up to severe inefficiency.

PagedAttention divides the KV cache into fixed-size blocks (typically 16 tokens per block) and allocates them on-demand. Sequences maintain a block table mapping logical block IDs to physical GPU memory blocks, similar to virtual memory in operating systems. When a token completes generation, its physical block can be returned to the free pool and reused for another sequence. This simple idea reduces memory waste from ~80% to ~10%, enabling 8-10x larger effective batch sizes.

Implementing PagedAttention correctly requires careful management of block allocation, deallocation, and synchronization with CUDA kernels. vLLM abstracts these details, but understanding the underlying mechanism helps debug performance bottlenecks. For example, block size choice (16 vs 32 tokens) affects CUDA warp efficiency and should be tuned per hardware and sequence length distribution.

EXTRA

Future of Efficient Attention

PagedAttention is a major breakthrough in serving efficiency, but research continues on further optimizations. Speculative decoding uses smaller models to generate hypothetical tokens, reducing the number of large model forward passes needed. Flash-attention variants continue reducing I/O costs. Sparse attention patterns (attending only to nearby tokens or learned important tokens) further reduce computation for very long sequences.

Quantization of KV caches (storing keys/values in lower precision) combined with paging creates compounded efficiency gains. Prototyping shows that int8 quantization of KV caches with proper calibration loses negligible quality while saving 75% cache memory. Combining PagedAttention with KV quantization is the future of efficient long-context inference at scale.

Hardware-software codesign drives future progress. Custom attention kernels for specific hardware (like NVIDIA H100 or TPU v5) can further optimize PagedAttention. vLLM actively collaborates with hardware vendors to ensure attention implementations saturate memory bandwidth and compute utilization, pushing toward the hardware's theoretical throughput ceiling.

Operating system concepts apply directly to PagedAttention: block tables are like page tables in virtual memory, physical blocks are like frames in memory, and the block allocator behaves like a memory manager. This analogy helps practitioners understand PagedAttention and debug issues. Virtual memory's decades of research on page replacement policies (LRU, FIFO, ARC) could inspire future PagedAttention optimizations.

vLLM's implementation of PagedAttention on different hardware (NVIDIA, AMD, TPU) required adapting the block size and kernel implementation per platform. Block size choice affects CUDA warp efficiency and memory access patterns. Practitioners running vLLM on custom hardware should profile block sizes to find the Pareto frontier of throughput vs. latency for their workloads.

Preemption and scheduling in PagedAttention systems enable prioritizing requests by priority or deadline. A long-context request can be preempted (saving its state) if a shorter, higher-priority request arrives. Sophisticated schedulers optimize for total throughput, latency, or fairness. Understanding the scheduling policy helps predict inference behavior and optimize for specific deployment requirements.

BOOST

PagedAttention Hardware and Software Codesign

PagedAttention is an example of hardware-software codesign: the algorithm was designed with GPU memory hierarchy and capabilities in mind. Understanding GPU memory architecture (global memory, shared memory, registers) helps appreciate PagedAttention's cleverness in accessing blocks efficiently. Future hardware-software collaboration will likely follow similar principles—designing algorithms and hardware together for mutual benefit.

Different GPU generations (H100, L100, A100, V100) have different memory bandwidths, cache hierarchies, and compute capabilities. PagedAttention implementations are tuned per-hardware. vLLM maintains separate kernels for NVIDIA, AMD, and other vendors. This specialization extracts maximum efficiency from each platform. Practitioners deploying to new hardware should benchmark and potentially retune block sizes and kernel parameters.

Emerging hardware like specialized AI accelerators (TPU, Cerebras, Transkimoto) may enable further PagedAttention optimizations. Some accelerators have custom memory controllers or programmable hardware that could simplify block management or enable new paging strategies. Following PagedAttention's example, optimal inference will involve algorithms and hardware designed together, not bolted on afterward.

Paged Attention

Table of Contents

The KV cache fragmentation problem

PagedAttention design

Block tables and physical blocks

Using vLLM in Python

Continuous batching synergy

Memory efficiency numbers

Gotchas

PagedAttention Block Management

Serving Speed: KV Cache vs PagedAttention

Future of Efficient Attention

PagedAttention Hardware and Software Codesign

Paged Attention

Table of Contents

The KV cache fragmentation problem

PagedAttention design

Block tables and physical blocks

Using vLLM in Python

Continuous batching synergy

Memory efficiency numbers

Gotchas

PagedAttention Block Management

Serving Speed: KV Cache vs PagedAttention

Future of Efficient Attention

PagedAttention Hardware and Software Codesign

Related concepts