LLM Serving & Inference Infrastructure

Contents

The two serving metrics
Continuous batching
PagedAttention & memory
Parallelism strategies
Speculative decoding
Routing & load balancing
Framework comparison

01 — Core Concepts

The Two Serving Metrics

TTFT (Time to First Token): Latency from request arrival to the first output token. This is what the user perceives — how responsive the system feels. A 500ms TTFT is acceptable. A 5-second TTFT feels slow.

TBT (Time Between Tokens) / ITL (Inter-Token Latency): Latency between each subsequent token. This drives the streaming experience. If TBT is 50ms, the user sees about 20 tokens per second on screen — a natural reading pace. If TBT is 200ms, the text appears in bursts, feeling sluggish.

Throughput: Total tokens generated per second across all concurrent requests. This drives cost efficiency. Generate 500 tokens/sec/GPU and your GPU is fully utilized. Generate 50 tokens/sec/GPU and you're wasting 90% of its potential.

The fundamental tension: maximizing throughput requires large batches, which delays responses (high TTFT). Minimizing TTFT requires small batches, which wastes GPU utilization. Production systems must balance all three metrics within SLA bounds.

Target SLAs (typical production): TTFT: < 500ms for chat, < 5s for batch TBT: < 50ms (20+ tokens/sec for readable streaming) Throughput: maximize within SLA bounds

💡 Understand your tradeoff: A 7B model with 100ms TTFT is better for chat. The same model with 20ms TTFT but half throughput costs twice as much. Choose based on your use case.

02 — Core Optimization

Continuous Batching

Naive batching: collect requests until batch is full (say, 32), run inference together, return all results. Problem: the GPU sits idle while waiting for the batch to fill. Throughput suffers.

Continuous batching (Orca 2022): Don't wait. Start processing requests immediately. As one request finishes generating a token, remove it from the active batch. As new requests arrive, add them to fill empty slots. The batch is always full, always moving.

This is deceptively powerful. All major serving frameworks (vLLM, TGI, TRT-LLM, SGLang) implement continuous batching. Without it, you leave 5–10× throughput on the table.

Key insight: prefill (processing input tokens) is compute-bound. Decode (generating tokens one-by-one) is memory-bandwidth-bound. Modern systems separate them: prefill on one GPU pool, decode on another. Each optimized independently.

Disaggregated serving takes this further: run prefill and decode on completely separate GPU clusters. Prefill gets high-end GPUs with large memory for long prompts. Decode gets bandwidth-optimized GPUs for throughput. Route requests through a scheduler.

💡 Continuous batching is the single highest-leverage optimization for throughput. Enables 5–10× better GPU utilization vs naive batching. If you're running a custom serving system, implement this first.

Python · vLLM OpenAI-compatible client with streaming

from openai import OpenAI  # vLLM exposes OpenAI-compatible API

# Point to local vLLM server instead of OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # vLLM doesn't require auth by default
)

def stream_from_vllm(prompt: str, model: str = "mistral-7b"):
    """Stream tokens from a locally-served vLLM model."""
    stream = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=256,
        temperature=0.0,
        stream=True
    )
    full_response = ""
    for chunk in stream:
        delta = chunk.choices[0].delta.content or ""
        print(delta, end="", flush=True)
        full_response += delta
    print()
    return full_response

# Example
response = stream_from_vllm("Explain speculative decoding in 3 bullet points.")

# Check server metrics
import requests
metrics = requests.get("http://localhost:8000/metrics").text
# Returns Prometheus metrics: vllm:num_requests_running, gpu_cache_usage_perc, etc.

03 — Memory Optimization

PagedAttention and Memory Management

The KV cache — stored key and value tensors for each token in the context — is the primary memory consumer during inference. As sequences get longer (and they do — 8K, 32K, 128K contexts are becoming standard), KV cache dominates.

With fixed-size KV cache pre-allocated per request, varying sequence lengths cause fragmentation. A request with 1K tokens uses 1K of 4K allocated cache — 3K wasted. Multiply across 100 concurrent requests and you've wasted significant memory.

PagedAttention (vLLM): Allocate KV cache in fixed-size "pages" (like OS virtual memory). Each token's key and value go into a page. Pages are dynamically allocated. If a sequence needs more pages, allocate more. No internal fragmentation.

Benefits cascade: Prefix sharing: requests with the same prompt prefix share the same pages. System prompt appears in every request — save 90% of system prompt cache by sharing. Beam search: explore multiple hypotheses without memory blowup — each hypothesis shares the prefix.

Approach	Fragmentation	Prefix sharing	Complexity
Fixed pre-allocated	High	No	Low
Dynamic per-request	Medium	No	Medium
PagedAttention	Minimal	Yes	High
Chunked prefill	Low	Partial	Medium

PagedAttention is non-trivial to implement (fancy indexing into pages), but the memory savings are substantial. vLLM's adoption is partly because they got PagedAttention right.

04 — Scaling Across GPUs

Parallelism Strategies

Most LLMs don't fit on a single GPU. Must split across multiple. Four strategies, each with tradeoffs.

Tensor Parallelism — split weights

Split individual weight matrices across GPUs. Each GPU holds partial weights. All GPUs process the same batch in parallel. After each layer, all-reduce to combine results.

Low latency per token (all-reduce happens in parallel)
High communication cost — must transfer intermediate activations
Best within a single node (NVLink, very low latency interconnect)

Pipeline Parallelism — split layers

Split layers across GPUs. GPU 0 does layer 0, GPU 1 does layer 1, etc. Each GPU processes a micro-batch sequentially. Activations flow through the pipeline.

High throughput (can overlap computation on different GPUs)
Pipeline bubble latency — some GPUs idle while waiting for previous stage
Works across nodes (lower bandwidth requirements)

Data Parallelism — replicate model

Replicate full model on each GPU. Each GPU handles a different request (or batch of requests). No inter-GPU communication during forward pass.

Scales throughput linearly (each GPU independent)
Requires model to fit on a single GPU
Perfect for multi-request inference (chat, batch scoring)

Expert Parallelism — for MoE

For Mixture-of-Experts models: route tokens to expert GPUs. Each GPU holds different expert FFN layers. Sparse activation — each token routes to a few experts, not all.

Enables very large models with conditional compute
Load balancing complexity — some experts become bottlenecks
Specialized for MoE architectures (Mixtral, DeepSeek-MoE)

Strategy	Model fits GPU?	Comm overhead	Good for
DP	Yes (per replica)	Low	High throughput, multiple replicas
TP	No (splits model)	High (all-reduce)	Large models, low latency
PP	No (splits layers)	Medium	Very large models
TP+PP	No	Highest	Frontier models (175B+)

In practice, teams use combinations: tensor parallelism within a node (exploiting NVLink), pipeline parallelism across nodes. Larger models need multiple strategies.

05 — Inference Speedup

Speculative Decoding

Decoding is memory-bandwidth-bound: GPU loads weights, generates one token, waits for next batch of requests. The bottleneck is weight transfer, not compute.

Speculative decoding: Use a small, fast draft model to propose the next k tokens. The target (main) model verifies all k tokens in parallel. If all k are accepted, you get k tokens in one round-trip instead of k round-trips. In theory, k× speedup.

In practice: 70–85% of proposed tokens are accepted (due to distribution mismatch). With N=4 draft tokens and 75% acceptance rate, you get roughly 2–3× speedup with no quality loss (output distribution is identical).

Variants: Small draft model (separate model), Medusa (draft heads on the main model, lower overhead), EAGLE (feature-level drafting, higher acceptance rates), prompt lookup decoding (copy from context, minimal compute).

Variant	Draft source	Speedup	Quality loss	Setup complexity
Small draft model	Separate small LLM	2–3×	None (exact)	Medium
Medusa	Extra draft heads	1.5–2×	None (exact)	Low
EAGLE	Feature-level	3–4×	None (exact)	Medium
Prompt lookup	Copies from prompt	1.5–2.5×	None (exact)	Very low

Speculative decoding is especially effective for long outputs. Short completions don't benefit as much. Worth measuring on your workload.

06 — Multi-Replica Systems

Routing and Load Balancing

Production: deploy multiple replicas of your model. Need intelligent routing to maximize cache hit rates and throughput.

Strategies: Round-robin (simple, ignores load), least-connections (tracks active requests per replica), prefix-aware routing (send requests with same system prompt to same replica — huge cache hit rate improvement).

For chatbots, prefix-aware routing is critical. System prompt appears in every request. If 100 requests with the same system prompt hit 100 different replicas, each one recomputes the system prompt's KV cache. Route to same replica, compute once, share across 100 requests.

Autoscaling: Scale replicas based on queue depth, not CPU/RAM. GPU serving is queue-depth-driven: requests queue up, replicas drain the queue. When queue depth > threshold, scale up. When queue depth < threshold, scale down.

Model routing by capability: Easy queries (classification, factual lookup) route to small, fast model. Hard queries (reasoning, analysis) route to large, slow model. A tiny router model classifies difficulty (<100ms), then routes to appropriate model. Maximizes throughput, minimizes cost.

⚠️ Prefix-aware routing can double KV cache hit rates in production. Critical optimization for chatbots. Implement a hash of the system prompt, route all requests with same hash to same replica.

Python · Load balancing across multiple vLLM replicas

import random, time, httpx, asyncio
from dataclasses import dataclass, field

@dataclass
class VLLMReplica:
    url: str
    weight: float = 1.0
    active_requests: int = 0
    total_requests: int = 0
    error_count: int = 0

class RoundRobinBalancer:
    def __init__(self, replicas: list[str]):
        self.replicas = [VLLMReplica(url) for url in replicas]
        self._idx = 0

    def get_replica(self) -> VLLMReplica:
        """Least-connections routing: pick replica with fewest active requests."""
        return min(self.replicas, key=lambda r: r.active_requests)

    async def generate(self, payload: dict) -> dict:
        replica = self.get_replica()
        replica.active_requests += 1
        replica.total_requests += 1
        try:
            async with httpx.AsyncClient(timeout=60) as client:
                resp = await client.post(
                    f"{replica.url}/v1/chat/completions",
                    json=payload
                )
                return resp.json()
        except Exception as e:
            replica.error_count += 1
            raise
        finally:
            replica.active_requests -= 1

    def stats(self) -> list[dict]:
        return [{"url": r.url, "total": r.total_requests,
                 "errors": r.error_count, "active": r.active_requests}
                for r in self.replicas]

# Usage
balancer = RoundRobinBalancer([
    "http://vllm-1:8000",
    "http://vllm-2:8000",
    "http://vllm-3:8000"
])

07 — Tools & Selection

Framework Comparison and Selection

Mature open-source and proprietary serving frameworks exist. Each has strengths for different scenarios.

Framework

vLLM

Most popular OSS serving. PagedAttention, continuous batching, tensor parallelism. Supports GPTQ, AWQ, FP8 quantization.

Framework

SGLang

Structured generation + serving. Native support for tree search, constrained generation, agent loops. FP8, GPTQ quantization.

Framework

TensorRT-LLM

NVIDIA's production framework. FP8, INT8 quantization. Complex to use, powerful performance. For teams with NVIDIA resources.

Framework

TGI (Text Generation Inference)

HuggingFace's framework. Built-in continuous batching. AWQ, GPTQ quantization. Solid default choice.

Framework

Ollama

Local/dev use only. Easy setup, GGUF quantization. Not suitable for production.

Framework

llama.cpp

C++ inference, CPU-optimized. GGUF format. Good for edge and local deployment.

Framework

Triton Inference Server

NVIDIA's orchestration layer. Supports multiple backends. Complex setup, powerful for large deployments.

Framework	Best for	Quantization	Multi-GPU	Production-ready
vLLM	General OSS serving	GPTQ, AWQ, FP8	TP + PP	Yes
SGLang	Structured outputs, agents	FP8, GPTQ	TP	Yes
TRT-LLM	NVIDIA production	FP8, INT8	TP + PP	Yes (complex)
TGI	HuggingFace models	AWQ, GPTQ	TP	Yes
Ollama	Local/dev use	GGUF	Limited	No (dev only)

Selection Guide

Starting point: vLLM. Default choice. Mature, well-documented, supports most models and quantizations.

Need structured outputs: SGLang. Native grammar constraints, agents, tree search.

Maximum performance on NVIDIA: TensorRT-LLM. Complex, but squeezes every bit of performance.

Already using HuggingFace pipeline: TGI. Integrates naturally.

Local development: Ollama. Easy setup, not for production.

08 — Further Reading

References

Academic Papers

Paper Lee, J. W. et al. (2022). Orca: A Distributed Serving System for Transformer-Based Generative Models. arXiv:2306.02212. — arxiv:2306.02212 ↗
Paper Zhou, W. et al. (2023). vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention. arXiv:2309.06180. — arxiv:2309.06180 ↗
Paper Leviathan, Y. et al. (2022). Fast Transformer Decoding: One Write-Head is All You Need. Speculative decoding. arXiv:2211.17192. — arxiv:2211.17192 ↗
Paper Zhang, S. et al. (2024). EAGLE: Speculative Sampling Requires Rethinking Feature Prediction. arXiv:2401.15077. — arxiv:2401.15077 ↗
Paper Lian, X. et al. (2023). SGLang: Efficient Execution of Structured Language Model Programs. arXiv:2312.07104. — arxiv:2312.07104 ↗

Documentation & Guides

Docs vLLM Official Documentation. docs.vllm.ai ↗
Docs SGLang Documentation. sgl-project.github.io ↗
Docs TensorRT-LLM. nvidia.github.io/TensorRT-LLM ↗
Docs HuggingFace Text Generation Inference. huggingface.co/docs ↗
Guide NVIDIA Triton Inference Server. developer.nvidia.com ↗

Practitioner Resources

Blog NVIDIA. (2023). Efficient LLM Inference with vLLM and PagedAttention. — nvidia.com/blog ↗
Blog Together AI. (2024). LLM Inference Cost and Performance Guide. — together.ai/blog ↗

LEARNING PATH

Learning Path

LLM serving is its own engineering discipline. Here's the progression from first deployment to production-grade serving:

API callOpenAI / Anthropic

→

Self-hosted vLLMyour own GPU

→

Continuous batchingthroughput optimisation

→

Quantisationmemory + speed

→

Multi-GPU / KV cachescale-out

Start with a managed API

OpenAI, Anthropic, or Google for your first 6 months in production. The operational overhead of self-hosting is real; validate the use case first.

Move to vLLM when cost justifies it

At ~$500+/month on APIs for a specific workload, self-hosting Llama or Mistral on a rented A100 typically breaks even. vLLM is the default choice: continuous batching, PagedAttention, easy setup.

Learn the two serving regimes

Prefill-heavy (long prompts, batch jobs): GPU-compute bound. Decode-heavy (interactive, streaming): GPU memory-bandwidth bound. Each has different optimisation levers.

Add speculative decoding for latency

A small draft model proposes tokens; the large model verifies in parallel. Typical speedup: 2–3x TTFT with same quality. Supported natively in vLLM.

LLM Serving & Inference Infrastructure

The Two Serving Metrics

Continuous Batching

PagedAttention and Memory Management

Parallelism Strategies

Tensor Parallelism — split weights

Pipeline Parallelism — split layers

Data Parallelism — replicate model

Expert Parallelism — for MoE

Speculative Decoding

Routing and Load Balancing

Framework Comparison and Selection

Selection Guide

References

Learning Path

Start with a managed API

Move to vLLM when cost justifies it

Learn the two serving regimes

Add speculative decoding for latency

Related concepts