Throughput, latency, and cost — how to run LLMs at scale in production
TTFT (Time to First Token): Latency from request arrival to the first output token. This is what the user perceives — how responsive the system feels. A 500ms TTFT is acceptable. A 5-second TTFT feels slow.
TBT (Time Between Tokens) / ITL (Inter-Token Latency): Latency between each subsequent token. This drives the streaming experience. If TBT is 50ms, the user sees about 20 tokens per second on screen — a natural reading pace. If TBT is 200ms, the text appears in bursts, feeling sluggish.
Throughput: Total tokens generated per second across all concurrent requests. This drives cost efficiency. Generate 500 tokens/sec/GPU and your GPU is fully utilized. Generate 50 tokens/sec/GPU and you're wasting 90% of its potential.
The fundamental tension: maximizing throughput requires large batches, which delays responses (high TTFT). Minimizing TTFT requires small batches, which wastes GPU utilization. Production systems must balance all three metrics within SLA bounds.
Naive batching: collect requests until batch is full (say, 32), run inference together, return all results. Problem: the GPU sits idle while waiting for the batch to fill. Throughput suffers.
Continuous batching (Orca 2022): Don't wait. Start processing requests immediately. As one request finishes generating a token, remove it from the active batch. As new requests arrive, add them to fill empty slots. The batch is always full, always moving.
This is deceptively powerful. All major serving frameworks (vLLM, TGI, TRT-LLM, SGLang) implement continuous batching. Without it, you leave 5–10× throughput on the table.
Key insight: prefill (processing input tokens) is compute-bound. Decode (generating tokens one-by-one) is memory-bandwidth-bound. Modern systems separate them: prefill on one GPU pool, decode on another. Each optimized independently.
Disaggregated serving takes this further: run prefill and decode on completely separate GPU clusters. Prefill gets high-end GPUs with large memory for long prompts. Decode gets bandwidth-optimized GPUs for throughput. Route requests through a scheduler.
from openai import OpenAI # vLLM exposes OpenAI-compatible API
# Point to local vLLM server instead of OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # vLLM doesn't require auth by default
)
def stream_from_vllm(prompt: str, model: str = "mistral-7b"):
"""Stream tokens from a locally-served vLLM model."""
stream = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=256,
temperature=0.0,
stream=True
)
full_response = ""
for chunk in stream:
delta = chunk.choices[0].delta.content or ""
print(delta, end="", flush=True)
full_response += delta
print()
return full_response
# Example
response = stream_from_vllm("Explain speculative decoding in 3 bullet points.")
# Check server metrics
import requests
metrics = requests.get("http://localhost:8000/metrics").text
# Returns Prometheus metrics: vllm:num_requests_running, gpu_cache_usage_perc, etc.
The KV cache — stored key and value tensors for each token in the context — is the primary memory consumer during inference. As sequences get longer (and they do — 8K, 32K, 128K contexts are becoming standard), KV cache dominates.
With fixed-size KV cache pre-allocated per request, varying sequence lengths cause fragmentation. A request with 1K tokens uses 1K of 4K allocated cache — 3K wasted. Multiply across 100 concurrent requests and you've wasted significant memory.
PagedAttention (vLLM): Allocate KV cache in fixed-size "pages" (like OS virtual memory). Each token's key and value go into a page. Pages are dynamically allocated. If a sequence needs more pages, allocate more. No internal fragmentation.
Benefits cascade: Prefix sharing: requests with the same prompt prefix share the same pages. System prompt appears in every request — save 90% of system prompt cache by sharing. Beam search: explore multiple hypotheses without memory blowup — each hypothesis shares the prefix.
| Approach | Fragmentation | Prefix sharing | Complexity |
|---|---|---|---|
| Fixed pre-allocated | High | No | Low |
| Dynamic per-request | Medium | No | Medium |
| PagedAttention | Minimal | Yes | High |
| Chunked prefill | Low | Partial | Medium |
PagedAttention is non-trivial to implement (fancy indexing into pages), but the memory savings are substantial. vLLM's adoption is partly because they got PagedAttention right.
Most LLMs don't fit on a single GPU. Must split across multiple. Four strategies, each with tradeoffs.
Split individual weight matrices across GPUs. Each GPU holds partial weights. All GPUs process the same batch in parallel. After each layer, all-reduce to combine results.
Split layers across GPUs. GPU 0 does layer 0, GPU 1 does layer 1, etc. Each GPU processes a micro-batch sequentially. Activations flow through the pipeline.
Replicate full model on each GPU. Each GPU handles a different request (or batch of requests). No inter-GPU communication during forward pass.
For Mixture-of-Experts models: route tokens to expert GPUs. Each GPU holds different expert FFN layers. Sparse activation — each token routes to a few experts, not all.
| Strategy | Model fits GPU? | Comm overhead | Good for |
|---|---|---|---|
| DP | Yes (per replica) | Low | High throughput, multiple replicas |
| TP | No (splits model) | High (all-reduce) | Large models, low latency |
| PP | No (splits layers) | Medium | Very large models |
| TP+PP | No | Highest | Frontier models (175B+) |
In practice, teams use combinations: tensor parallelism within a node (exploiting NVLink), pipeline parallelism across nodes. Larger models need multiple strategies.
Decoding is memory-bandwidth-bound: GPU loads weights, generates one token, waits for next batch of requests. The bottleneck is weight transfer, not compute.
Speculative decoding: Use a small, fast draft model to propose the next k tokens. The target (main) model verifies all k tokens in parallel. If all k are accepted, you get k tokens in one round-trip instead of k round-trips. In theory, k× speedup.
In practice: 70–85% of proposed tokens are accepted (due to distribution mismatch). With N=4 draft tokens and 75% acceptance rate, you get roughly 2–3× speedup with no quality loss (output distribution is identical).
Variants: Small draft model (separate model), Medusa (draft heads on the main model, lower overhead), EAGLE (feature-level drafting, higher acceptance rates), prompt lookup decoding (copy from context, minimal compute).
| Variant | Draft source | Speedup | Quality loss | Setup complexity |
|---|---|---|---|---|
| Small draft model | Separate small LLM | 2–3× | None (exact) | Medium |
| Medusa | Extra draft heads | 1.5–2× | None (exact) | Low |
| EAGLE | Feature-level | 3–4× | None (exact) | Medium |
| Prompt lookup | Copies from prompt | 1.5–2.5× | None (exact) | Very low |
Speculative decoding is especially effective for long outputs. Short completions don't benefit as much. Worth measuring on your workload.
Production: deploy multiple replicas of your model. Need intelligent routing to maximize cache hit rates and throughput.
Strategies: Round-robin (simple, ignores load), least-connections (tracks active requests per replica), prefix-aware routing (send requests with same system prompt to same replica — huge cache hit rate improvement).
For chatbots, prefix-aware routing is critical. System prompt appears in every request. If 100 requests with the same system prompt hit 100 different replicas, each one recomputes the system prompt's KV cache. Route to same replica, compute once, share across 100 requests.
Autoscaling: Scale replicas based on queue depth, not CPU/RAM. GPU serving is queue-depth-driven: requests queue up, replicas drain the queue. When queue depth > threshold, scale up. When queue depth < threshold, scale down.
Model routing by capability: Easy queries (classification, factual lookup) route to small, fast model. Hard queries (reasoning, analysis) route to large, slow model. A tiny router model classifies difficulty (<100ms), then routes to appropriate model. Maximizes throughput, minimizes cost.
import random, time, httpx, asyncio
from dataclasses import dataclass, field
@dataclass
class VLLMReplica:
url: str
weight: float = 1.0
active_requests: int = 0
total_requests: int = 0
error_count: int = 0
class RoundRobinBalancer:
def __init__(self, replicas: list[str]):
self.replicas = [VLLMReplica(url) for url in replicas]
self._idx = 0
def get_replica(self) -> VLLMReplica:
"""Least-connections routing: pick replica with fewest active requests."""
return min(self.replicas, key=lambda r: r.active_requests)
async def generate(self, payload: dict) -> dict:
replica = self.get_replica()
replica.active_requests += 1
replica.total_requests += 1
try:
async with httpx.AsyncClient(timeout=60) as client:
resp = await client.post(
f"{replica.url}/v1/chat/completions",
json=payload
)
return resp.json()
except Exception as e:
replica.error_count += 1
raise
finally:
replica.active_requests -= 1
def stats(self) -> list[dict]:
return [{"url": r.url, "total": r.total_requests,
"errors": r.error_count, "active": r.active_requests}
for r in self.replicas]
# Usage
balancer = RoundRobinBalancer([
"http://vllm-1:8000",
"http://vllm-2:8000",
"http://vllm-3:8000"
])
Mature open-source and proprietary serving frameworks exist. Each has strengths for different scenarios.
| Framework | Best for | Quantization | Multi-GPU | Production-ready |
|---|---|---|---|---|
| vLLM | General OSS serving | GPTQ, AWQ, FP8 | TP + PP | Yes |
| SGLang | Structured outputs, agents | FP8, GPTQ | TP | Yes |
| TRT-LLM | NVIDIA production | FP8, INT8 | TP + PP | Yes (complex) |
| TGI | HuggingFace models | AWQ, GPTQ | TP | Yes |
| Ollama | Local/dev use | GGUF | Limited | No (dev only) |
Starting point: vLLM. Default choice. Mature, well-documented, supports most models and quantizations.
Need structured outputs: SGLang. Native grammar constraints, agents, tree search.
Maximum performance on NVIDIA: TensorRT-LLM. Complex, but squeezes every bit of performance.
Already using HuggingFace pipeline: TGI. Integrates naturally.
Local development: Ollama. Easy setup, not for production.
LLM serving is its own engineering discipline. Here's the progression from first deployment to production-grade serving:
OpenAI, Anthropic, or Google for your first 6 months in production. The operational overhead of self-hosting is real; validate the use case first.
At ~$500+/month on APIs for a specific workload, self-hosting Llama or Mistral on a rented A100 typically breaks even. vLLM is the default choice: continuous batching, PagedAttention, easy setup.
A small draft model proposes tokens; the large model verifies in parallel. Typical speedup: 2–3x TTFT with same quality. Supported natively in vLLM.