Serving at scale — inference engines, hardware, quantization, and cloud deployment
Serving LLMs in production is a distinct engineering problem from training them. The bottleneck is not compute — it is GPU memory bandwidth, request concurrency, and cost per token at scale.
The serving stack layers: (1) Inference engine (vLLM, TGI, SGLang) handles batching and KV cache. (2) Routing layer (LiteLLM, LLM Proxy) selects the right model. (3) Caching layer (Redis) caches tokens and embeddings. (4) Hardware (H100s, A100s, multi-GPU). (5) Cloud orchestration (Kubernetes, Ray Serve).
Training: Compute-bound (FLOP/s). Inference (batch=1): Memory-bandwidth bound. Each token needs ~2 bytes × model params. Inference (batch=large): Compute-bound again (matrix multiplies overlap bandwidth stalls).
Modern inference engines solve the batching problem via continuous batching: as requests arrive, interleave token generation. This dramatically improves throughput compared to static batching (wait for batch to fill).
Continuous batching + paged attention KV cache. Fastest inference on single GPU. OpenAI-compatible API. Zero setup.
# Serve a quantized Mistral-7B with continuous batching
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
--quantization awq \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--enable-chunked-prefill \
--max-num-batched-tokens 4096 \
--served-model-name mistral-7b \
--port 8000
# Quick health check
curl http://localhost:8000/health
# Test inference (OpenAI-compatible API)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "mistral-7b",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 64, "temperature": 0.0}'
# Monitor GPU utilisation
watch -n1 nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total \
--format=csv,noheader
Production-grade inference server. Continuous batching, quantization, watermarking. Rust-based, highly optimized.
Inference engine optimized for structured outputs. Native support for regex, JSON schema, context-free grammars.
Quantization reduces model weight precision (FP32 → INT8, INT4, or INT2). It cuts VRAM by 2–8× with ~5% accuracy loss. For large models (70B+), quantization is mandatory for single-GPU inference.
| Method | Precision | VRAM reduction | Speed | Quality |
|---|---|---|---|---|
| FP32 (baseline) | 32-bit float | 1× | 1× | 100% |
| FP16/BF16 | 16-bit float | 2× | 1–2× | 99% |
| INT8 (dynamic) | 8-bit int | 4× | 1–1.5× | 98% |
| GPTQ (INT4) | 4-bit int | 8× | 1.2–1.8× | 95–97% |
| AWQ (INT4) | 4-bit int | 8× | 1–1.5× | 96–98% |
| GGUF (INT4/2) | 2–4 bit int | 8–16× | 0.5–1× | 92–96% |
GPU choice is driven by memory bandwidth, not peak FLOP/s. Inference is bandwidth-bound. An H100 is 5× faster than A100 for inference primarily because of 3× higher bandwidth (3.35 TB/s vs 1.9 TB/s).
| GPU | VRAM | Bandwidth | Cost/month | Best for |
|---|---|---|---|---|
| A100 (40GB) | 40GB | 1.9 TB/s | $500–700 | Training, large batch inference |
| H100 (80GB) | 80GB | 3.35 TB/s | $1200–1500 | Production inference, training |
| L40S | 48GB | 0.86 TB/s | $300–400 | Multi-modal, mixed workloads |
| RTX 4090 | 24GB | 0.92 TB/s | $150–250 | Fine-tuning, small models |
Data parallelism: Shard batch across GPUs. Simple but doesn't help single large request. Tensor parallelism: Shard model weights across GPUs. Each GPU computes part of each layer. Synchronization overhead ~5–10%. Pipeline parallelism: Split layers across GPUs. Good for sequential (one batch) but harder to saturate.
Deploying LLMs to cloud requires: (1) Container (Docker) + orchestration (Kubernetes). (2) Auto-scaling based on queue depth. (3) Model serving framework (Seldon, KServe). (4) Monitoring + cost tracking.
Inference cost is dominated by two factors: time-to-first-token (TTFT) and throughput (tokens/sec). Before choosing a serving solution, benchmark your specific workload — batch size, sequence length, and request concurrency all shift the optimal configuration significantly.
Key metrics to track: TTFT at p50/p95, tokens-per-second per GPU, GPU memory utilisation, and cost-per-1M tokens. A model serving 10 tok/s on a single A100 can hit 80+ tok/s with continuous batching on the same hardware. The gap between naive and optimised serving is rarely less than 5×.
import time, statistics
from vllm import LLM, SamplingParams
llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.2",
tensor_parallel_size=1,
max_model_len=4096)
prompts = ["Explain transformer attention in one paragraph."] * 32
params = SamplingParams(temperature=0.0, max_tokens=256)
# Warm-up run (JIT compilation)
_ = llm.generate(prompts[:2], params)
# Timed benchmark
start = time.perf_counter()
outputs = llm.generate(prompts, params)
elapsed = time.perf_counter() - start
total_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
latencies = [elapsed / len(prompts) * 1000] * len(prompts) # approx
print(f"Batch size : {len(prompts)}")
print(f"Total time : {elapsed:.2f}s")
print(f"Throughput : {total_tokens/elapsed:.0f} tok/s")
print(f"Avg latency : {elapsed/len(prompts)*1000:.0f} ms/req")
print(f"Cost est. : ${total_tokens/1e6 * 0.20:.4f} (at $0.20/1M tok)")
Go deeper into specific infrastructure domains: