INFRASTRUCTURE & SERVING

LLM Infrastructure

Serving at scale — inference engines, hardware, quantization, and cloud deployment

tokens/sec × batch the metric
VRAM defines limits the constraint
vLLM → cloud the stack
Contents
  1. The serving stack
  2. Inference engines
  3. Quantization
  4. Hardware
  5. Cloud deployment
  6. Child pages
  7. References
01 — Foundation

The Serving Stack

Serving LLMs in production is a distinct engineering problem from training them. The bottleneck is not compute — it is GPU memory bandwidth, request concurrency, and cost per token at scale.

The serving stack layers: (1) Inference engine (vLLM, TGI, SGLang) handles batching and KV cache. (2) Routing layer (LiteLLM, LLM Proxy) selects the right model. (3) Caching layer (Redis) caches tokens and embeddings. (4) Hardware (H100s, A100s, multi-GPU). (5) Cloud orchestration (Kubernetes, Ray Serve).

The biggest cost lever: Not model choice — it is batching. A request processed alone wastes 80%+ of GPU capacity. vLLM continuous batching fills that gap automatically. Profile tokens/sec and GPU utilization before scaling horizontally.

Bottleneck Shifts

Training: Compute-bound (FLOP/s). Inference (batch=1): Memory-bandwidth bound. Each token needs ~2 bytes × model params. Inference (batch=large): Compute-bound again (matrix multiplies overlap bandwidth stalls).

02 — Serving Engines

Inference Engines (vLLM, TGI, SGLang)

Modern inference engines solve the batching problem via continuous batching: as requests arrive, interleave token generation. This dramatically improves throughput compared to static batching (wait for batch to fill).

Popular Engines

1

vLLM — Industry standard

Continuous batching + paged attention KV cache. Fastest inference on single GPU. OpenAI-compatible API. Zero setup.

  • Paged attention: KV cache as virtual pages, reduce fragmentation
  • Continuous batching: interleave requests at token level
  • Supports quantization (GPTQ, AWQ), LoRA, multi-modal
  • 10-20× faster than naive inference on typical workloads
Shell · Launch vLLM with AWQ quantisation for a 7B model
# Serve a quantized Mistral-7B with continuous batching
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
  --quantization awq \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --enable-chunked-prefill \
  --max-num-batched-tokens 4096 \
  --served-model-name mistral-7b \
  --port 8000

# Quick health check
curl http://localhost:8000/health

# Test inference (OpenAI-compatible API)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral-7b",
       "messages": [{"role": "user", "content": "Hello!"}],
       "max_tokens": 64, "temperature": 0.0}'

# Monitor GPU utilisation
watch -n1 nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total \
  --format=csv,noheader
2

Text Generation Inference (TGI) — Hugging Face

Production-grade inference server. Continuous batching, quantization, watermarking. Rust-based, highly optimized.

3

SGLang — Structured generation

Inference engine optimized for structured outputs. Native support for regex, JSON schema, context-free grammars.

  • Structured generation without post-hoc filtering
  • Prefix caching: reuse computation across similar queries
  • Radix attention tree for long-sequence handling
  • Compatible with vLLM serving engine
⚠️ vLLM dominance: As of 2025, vLLM is the de facto standard. TGI and SGLang offer specialized features (production hardening, structured output), but vLLM covers 95% of use cases.
03 — Model Compression

Quantization (GPTQ, AWQ, GGUF)

Quantization reduces model weight precision (FP32 → INT8, INT4, or INT2). It cuts VRAM by 2–8× with ~5% accuracy loss. For large models (70B+), quantization is mandatory for single-GPU inference.

Quantization Methods

MethodPrecisionVRAM reductionSpeedQuality
FP32 (baseline)32-bit float100%
FP16/BF1616-bit float1–2×99%
INT8 (dynamic)8-bit int1–1.5×98%
GPTQ (INT4)4-bit int1.2–1.8×95–97%
AWQ (INT4)4-bit int1–1.5×96–98%
GGUF (INT4/2)2–4 bit int8–16×0.5–1×92–96%

Quantization Frameworks

Quantization
GPTQ
INT4 quantization via gradient info; high quality, widely supported.
Quantization
AWQ (Activation-aware)
INT4 using activation statistics; faster, better quality than GPTQ.
Quantization
GGUF (llama.cpp)
INT2-INT4 quantization, optimized for CPU/edge inference.
Quantization
bitsandbytes
8-bit and 4-bit quantization for training and inference.
ℹ️ When to quantize: If your model fits in VRAM, quantization is optional. If it doesn't fit, use AWQ or GPTQ INT4 (8× reduction). For edge/mobile, use GGUF. Always benchmark on your specific hardware.
04 — Compute

Hardware (GPUs, Memory Bandwidth)

GPU choice is driven by memory bandwidth, not peak FLOP/s. Inference is bandwidth-bound. An H100 is 5× faster than A100 for inference primarily because of 3× higher bandwidth (3.35 TB/s vs 1.9 TB/s).

GPU Comparison

GPUVRAMBandwidthCost/monthBest for
A100 (40GB)40GB1.9 TB/s$500–700Training, large batch inference
H100 (80GB)80GB3.35 TB/s$1200–1500Production inference, training
L40S48GB0.86 TB/s$300–400Multi-modal, mixed workloads
RTX 409024GB0.92 TB/s$150–250Fine-tuning, small models

Multi-GPU Inference

Data parallelism: Shard batch across GPUs. Simple but doesn't help single large request. Tensor parallelism: Shard model weights across GPUs. Each GPU computes part of each layer. Synchronization overhead ~5–10%. Pipeline parallelism: Split layers across GPUs. Good for sequential (one batch) but harder to saturate.

⚠️ Multi-GPU scaling: Tensor parallelism across 2–4 GPUs is practical. Beyond that, synchronization kills throughput. For 100+ concurrent users, scale horizontally (more machines) rather than vertically (more GPUs per machine).
05 — Orchestration

Cloud Deployment Patterns

Deploying LLMs to cloud requires: (1) Container (Docker) + orchestration (Kubernetes). (2) Auto-scaling based on queue depth. (3) Model serving framework (Seldon, KServe). (4) Monitoring + cost tracking.

# vLLM server + LiteLLM routing from openai import OpenAI vllm_client = OpenAI( base_url='http://localhost:8000/v1', api_key='ignored' ) resp = vllm_client.chat.completions.create( model='meta-llama/Meta-Llama-3-8B-Instruct', messages=[ {'role':'user','content': 'Explain KV caching in one paragraph.'} ], max_tokens=256 ) print(resp.choices[0].message.content) # Multi-model routing with fallbacks import litellm resp2 = litellm.completion( model='claude-haiku-4-5-20251001', messages=[{'role':'user','content':'Hello'}], fallbacks=['gpt-4o-mini', 'groq/llama3-8b-8192'] ) print(resp2.choices[0].message.content)

Cloud Deployment Options

Serverless
AWS Lambda + GPU
Pay-per-request; cold starts ~10sec. Good for bursty traffic.
Serverless
Modal
Managed inference; auto-scaling; simple Python interface.
Managed
Replicate
Model hosting; versioning; webhook predictions.
Managed
Together AI
Inference API; fine-tuning; multi-model deployment.
Self-hosted
vLLM on Kubernetes
Full control; horizontal scaling via StatefulSets.

Cost Optimization

1
Profile baseline: Measure tokens/sec on a small batch. Calculate cost per token. Set budget limit.
2
Quantize: If model doesn't fit, quantize to INT4. Reduces VRAM 8× with ~3–5% quality loss.
3
Batch aggressively: vLLM batching adds <1ms latency but increases throughput 3–5×. Batch at application level if needed.
4
Model selection: Smaller models (8B vs 70B) cost 8× less. Use smaller model + RAG instead of larger model.
5
Caching: Cache frequent responses (Redis). Semantic caching reduces repeated computations.
06 — Benchmarking

Performance Benchmarking & Cost Optimisation

Inference cost is dominated by two factors: time-to-first-token (TTFT) and throughput (tokens/sec). Before choosing a serving solution, benchmark your specific workload — batch size, sequence length, and request concurrency all shift the optimal configuration significantly.

Key metrics to track: TTFT at p50/p95, tokens-per-second per GPU, GPU memory utilisation, and cost-per-1M tokens. A model serving 10 tok/s on a single A100 can hit 80+ tok/s with continuous batching on the same hardware. The gap between naive and optimised serving is rarely less than 5×.

Python · vLLM throughput benchmark script
import time, statistics
from vllm import LLM, SamplingParams

llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.2",
          tensor_parallel_size=1,
          max_model_len=4096)

prompts = ["Explain transformer attention in one paragraph."] * 32
params = SamplingParams(temperature=0.0, max_tokens=256)

# Warm-up run (JIT compilation)
_ = llm.generate(prompts[:2], params)

# Timed benchmark
start = time.perf_counter()
outputs = llm.generate(prompts, params)
elapsed = time.perf_counter() - start

total_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
latencies = [elapsed / len(prompts) * 1000] * len(prompts)  # approx

print(f"Batch size   : {len(prompts)}")
print(f"Total time   : {elapsed:.2f}s")
print(f"Throughput   : {total_tokens/elapsed:.0f} tok/s")
print(f"Avg latency  : {elapsed/len(prompts)*1000:.0f} ms/req")
print(f"Cost est.    : ${total_tokens/1e6 * 0.20:.4f} (at $0.20/1M tok)")
06 — Explore

Related Topics

Go deeper into specific infrastructure domains:

LLM Serving
Deep Dive
Serving patterns, vLLM internals, batching strategies.
Quantization
Deep Dive
GPTQ, AWQ, GGUF, quantization-aware training (QAT).
Hardware
Deep Dive
GPU selection, memory bandwidth, multi-GPU scaling.
Cloud & Deployment
Deep Dive
Kubernetes, auto-scaling, cost optimization strategies.
07 — Further Reading

References

Academic Papers
Documentation
Articles & Guides