High-throughput LLM serving engine with PagedAttention for efficient KV cache management and OpenAI-compatible API
Serving large language models in production presents unique challenges. Unlike training where you process fixed batches with known sequence lengths, serving must handle dynamic request arrivals with variable sequence lengths and generation lengths. The primary bottleneck is KV cache memory management. The KV cache (stored key and value tensors for all previous tokens) grows with sequence length and number of requests, consuming significant GPU memory.
In standard transformer serving, KV cache is allocated statically. If you allocate space for a 2048-token sequence but a request only uses 512 tokens, the remaining 1536 token slots are wasted. Across many requests, this fragmentation can leave 60-80% of allocated KV cache unused. This memory waste forces systems to run with smaller batch sizes, reducing throughput.
Another serving challenge is request latency variability. Some requests may complete quickly (short generation), others slowly (long generation). With static batching, you must wait for the slowest request before processing a new batch. This leads to inefficient GPU utilization where hardware sits idle waiting for a single slow request to finish.
KV Cache Problem: A single 70B parameter model with 2048-token context requires ~280GB of KV cache (both K and V, all layers). If serving 8 requests with average context 512 tokens, you still allocate for 2048, wasting 75% of memory. This forces batch_size=1 or 2, severely limiting throughput.
PagedAttention is vLLM's core innovation, adapting the virtual memory concept from operating systems. In OS virtual memory, logical pages (512B-4KB chunks) are mapped to physical memory locations, enabling overcommitting memory and dynamic allocation. PagedAttention applies this to KV cache: logical token blocks are mapped to physical blocks in GPU memory.
The key insight is that KV cache doesn't require contiguous memory. Attention computation can work with scattered memory locations if you maintain a mapping from logical block indices to physical block indices. This allows dynamic allocation: blocks are allocated on-demand as tokens arrive, and released when requests complete. Fragmentation is eliminated because you can allocate any physical block for any logical position.
With PagedAttention, KV cache is allocated in blocks (e.g., 16 tokens per block). If a request uses 512 tokens, it uses 32 blocks. A different request using 768 tokens uses 48 blocks. The blocks can come from any available physical memory locations; the page table maps them logically. When requests complete, blocks are freed and reused, eliminating fragmentation entirely.
PagedAttention Memory Efficiency: With 16-token blocks, utilization improves from 31% to ~95%. For the example above, 512-token request uses 32 blocks (512 tokens), 768-token uses 48 blocks. No wasted space. At batch_size=4, actual memory usage: (32+16+64+48) × block_size = 160 blocks vs 256 blocks (62.5% more efficient).
vLLM's architecture consists of several key components: an async scheduling engine, continuous batching, prefix caching, and chunked prefill processing. The scheduler manages incoming requests, decides which requests to batch together, and handles KV cache allocation using PagedAttention.
Continuous batching (also called dynamic batching) allows requests to be added or removed from the batch at any time, not just at batch boundaries. This is crucial for throughput. When a request finishes generation and is removed from the batch, a new request can immediately start. This eliminates idle time waiting for batches to fill or for the slowest request to complete.
Prefix caching exploits the fact that many requests share common prefixes (system prompts, function definitions, retrieved documents). vLLM caches KV values for these shared prefixes. When a new request arrives with the same prefix, the cached KV values are reused, avoiding redundant computation and memory usage.
Chunked prefill separates prefill (processing the initial prompt) from decode (generating one token at a time). Traditionally, these are batched together, but they have different characteristics: prefill is compute-intensive (many tokens at once), decode is memory-bound (one token at a time). By chunking prefill into smaller batches, vLLM can better interleave prefill and decode, improving GPU utilization.
Continuous Batching Impact: With traditional static batching, a batch of 8 requests where one finishes quickly must wait for the 7 others. With continuous batching, that request's slot is immediately freed for a new request. For typical request distributions, this can increase throughput by 2-3x.
Deploying vLLM is straightforward for basic use cases. Installation via pip, launching an OpenAI-compatible server, and querying via standard client libraries are all simple. For production, you'll want to manage multiple workers (GPU processes), monitor throughput, and handle scaling across multiple nodes.
vLLM provides an OpenAI-compatible API, meaning clients written for OpenAI's API work unchanged with vLLM. This is significant for adoption because it means no client-side code changes needed. Tools like LangChain, LlamaIndex, and custom applications integrate seamlessly.
For production deployments, vLLM is often run in containers (Docker) with Kubernetes orchestration. This allows scaling horizontally (adding more GPUs) and managing resource allocation. vLLM supports tensor parallelism for single-GPU memory limits (split model across multiple GPUs on same node) and distributed inference across multiple nodes.
vLLM performance depends on many parameters. The main tuning knobs are: tensor parallelism (how many GPUs split the model), pipeline parallelism (less common), quantization (reduce precision), max_num_seqs (batch size), and max_model_len (context length). Understanding these parameters helps you extract maximum throughput.
Tensor parallelism is crucial for large models. A 70B model doesn't fit on a single A100 (80GB). With tensor parallelism across 8 A100s (640GB total), each GPU holds ~9B parameters, fitting comfortably. Inference latency increases due to communication overhead, but throughput often improves because you can run larger batches. For a 70B model on 8 A100s with max_batch_size=256, you get ~1000+ tokens/sec throughput.
Quantization reduces memory and bandwidth, allowing larger batches or longer contexts. vLLM supports several quantization schemes: AWQ (Activation-aware Weight Quantization), GPTQ (Generative Pre-trained Transformer Quantization), and FP8. With 4-bit quantization (AWQ/GPTQ), a 70B model uses ~35GB (vs 140GB FP16), enabling single-GPU deployment or larger batches.
max_num_seqs controls maximum batch size and is the primary lever for throughput vs latency trade-off. Higher max_num_seqs increases throughput but increases latency for individual requests (more requests in queue waiting to be served). Typical values: 256 for high throughput, 64 for balanced, 16 for low latency.
Tuning Strategy: Start with max_num_seqs=64, tensor_parallel=ceil(model_size_bytes/48GB), gpu_memory_utilization=0.9. Monitor throughput. If CUDA OOM, reduce max_num_seqs. If underutilized, increase. For latency-sensitive: reduce max_num_seqs, increase tensor_parallel.
Beyond basic serving, vLLM offers several advanced features for production use: speculative decoding, LoRA serving, prefix caching, and multimodal support. These features enable higher throughput, lower costs, and support for more complex workloads.
Speculative decoding uses a small "draft model" to generate token candidates, which are then verified by the larger model in batches. This parallelizes decode steps, reducing total generation time by 2-3x when the draft model's predictions are correct. The draft model might be a smaller quantized version or a different model entirely.
LoRA (Low-Rank Adaptation) serving allows serving multiple fine-tuned variants of a model without loading each separately. Multiple requests can share the base model but use different LoRA adapters. This is crucial for multi-tenant scenarios where each customer has their own fine-tuned model. vLLM multiplexes adapter weights efficiently.
Prefix caching automatically detects common prompts (system messages, retrieved documents) and caches their KV values. Subsequent requests with the same prefix reuse cached KV, reducing computation and memory. For RAG systems where documents are retrieved and prepended to prompts, this provides significant speedup.
Multimodal support in vLLM (still evolving) includes vision models that process images and text. vLLM handles image tokenization and KV caching for visual tokens, allowing efficient serving of models like LLaVA or GPT-4V-style architectures.
LoRA Serving Benefit: Instead of loading 8 different 70B fine-tuned models (560GB total), load the base 70B model once (140GB) and 8 LoRA adapters (~200MB each = 1.6GB total). Memory: 141.6GB vs 560GB (75% reduction).
Several alternatives to vLLM exist for serving large language models: Text Generation Inference (TGI), TensorRT-LLM, Ollama, and llama.cpp. Each has different strengths, use cases, and trade-offs. The choice depends on your deployment context, scale, and requirements.
Text Generation Inference (HuggingFace) is a robust production-grade server with similar features to vLLM. It emphasizes stability, offers OpenAI-compatible APIs, and integrates well with HuggingFace ecosystem. For teams already using HF models and tools, TGI may be preferable. Performance is comparable to vLLM, though PagedAttention gives vLLM an edge for very high batch sizes.
TensorRT-LLM is NVIDIA's optimization engine for LLM inference. It provides superior performance on NVIDIA hardware through compiled kernels and advanced optimizations. However, it requires TensorRT model export, more infrastructure knowledge, and is primarily targeting NVIDIA GPUs. For maximum performance on H100s, TensorRT-LLM is unbeatable.
Ollama and llama.cpp are lightweight, single-machine solutions optimized for CPU and consumer GPUs. They're ideal for local development, research, and resource-constrained environments. Performance is lower than vLLM/TGI, but they're simple to run locally without containers or complex setup.
| Feature | vLLM | TGI | TensorRT-LLM | Ollama | llama.cpp |
|---|---|---|---|---|---|
| Throughput (tok/s) | 800-1200 | 700-1000 | 1000-2000+ | 50-200 | 10-100 |
| Latency (P99) | 200-500ms | 250-600ms | 100-300ms | 1-5s | 5-30s |
| Setup Complexity | Easy | Easy | Hard | Trivial | Trivial |
| GPU Required | Yes (CUDA) | Yes (CUDA) | Yes (NVIDIA only) | Optional | Optional |
| Multi-GPU | Yes (Tensor parallel) | Yes (Tensor parallel) | Yes (Multi-method) | No | No |
| Quantization Support | AWQ, GPTQ, FP8 | GPTQ, bitsandbytes | INT8, INT4 | GGUF (builtin) | GGUF (builtin) |
| LoRA Support | Yes | Limited | No | No | No |
| OpenAI API | Yes | Yes | Via wrapper | Yes | Yes (via ollama) |
| Use Case | Production at scale | Production (HF ecosystem) | Max performance (NVIDIA) | Local development | Lightweight, offline |
vLLM is best suited for production serving at scale, especially for high-throughput batch scenarios. Its PagedAttention innovation, continuous batching, and comprehensive feature set make it the go-to choice for most organizations. If you're deploying in cloud (AWS, GCP, Azure), vLLM is the safest bet.
For research and prototyping, Ollama or llama.cpp on a laptop or local GPU is unbeatable for simplicity. For NVIDIA-specific optimization, TensorRT-LLM will give you 20-30% better performance. For teams integrated into HuggingFace ecosystem, TGI is a natural fit.
Decision Framework: Need max performance on H100s? TensorRT-LLM. Need easy setup for development? Ollama/llama.cpp. Need production at scale with multiple models/LoRAs? vLLM. Already using HF stack extensively? TGI.
Running vLLM in production requires visibility into four key metrics: GPU memory utilisation (target 85–90% to leave headroom for KV cache spikes), request queue depth (leading indicator of throughput saturation), prefill vs decode latency split (long prefill = large prompts; long decode = high output token count), and cache hit rate (measures prefix-caching effectiveness for repeated system prompts).
from prometheus_client import start_http_server, Gauge
import requests, time
VLLM_BASE = "http://localhost:8000"
gpu_mem = Gauge("vllm_gpu_mem_pct", "GPU KV cache utilisation %")
queue = Gauge("vllm_queue_depth", "Pending requests in queue")
cache_hit = Gauge("vllm_prefix_cache_hit_rate", "Prefix cache hit rate")
def scrape_vllm_metrics():
r = requests.get(f"{VLLM_BASE}/metrics", timeout=2)
for line in r.text.splitlines():
if "vllm:gpu_cache_usage_perc" in line and not line.startswith("#"):
gpu_mem.set(float(line.split()[-1]) * 100)
if "vllm:num_requests_waiting" in line and not line.startswith("#"):
queue.set(float(line.split()[-1]))
if "vllm:cpu_prefix_cache_hit_rate" in line and not line.startswith("#"):
cache_hit.set(float(line.split()[-1]))
start_http_server(9090) # expose to Prometheus
while True:
scrape_vllm_metrics()
time.sleep(15)
Set alerts on queue depth > 50 (scale-out trigger) and GPU memory > 95% (OOM risk). For high-concurrency deployments, run multiple vLLM processes behind a load balancer and route by session ID to maximise prefix-cache hits — sending the same user's requests to the same vLLM instance avoids cold-cache penalties on repeated system prompts.