LLM Infrastructure

Contents

The serving stack
Inference engines
Quantization
Hardware
Cloud deployment
Child pages
References

01 — Foundation

The Serving Stack

Serving LLMs in production is a distinct engineering problem from training them. The bottleneck is not compute — it is GPU memory bandwidth, request concurrency, and cost per token at scale.

The serving stack layers: (1) Inference engine (vLLM, TGI, SGLang) handles batching and KV cache. (2) Routing layer (LiteLLM, LLM Proxy) selects the right model. (3) Caching layer (Redis) caches tokens and embeddings. (4) Hardware (H100s, A100s, multi-GPU). (5) Cloud orchestration (Kubernetes, Ray Serve).

✓ The biggest cost lever: Not model choice — it is batching. A request processed alone wastes 80%+ of GPU capacity. vLLM continuous batching fills that gap automatically. Profile tokens/sec and GPU utilization before scaling horizontally.

Bottleneck Shifts

Training: Compute-bound (FLOP/s). Inference (batch=1): Memory-bandwidth bound. Each token needs ~2 bytes × model params. Inference (batch=large): Compute-bound again (matrix multiplies overlap bandwidth stalls).

02 — Serving Engines

Inference Engines (vLLM, TGI, SGLang)

Modern inference engines solve the batching problem via continuous batching: as requests arrive, interleave token generation. This dramatically improves throughput compared to static batching (wait for batch to fill).

Popular Engines

vLLM — Industry standard

Continuous batching + paged attention KV cache. Fastest inference on single GPU. OpenAI-compatible API. Zero setup.

Paged attention: KV cache as virtual pages, reduce fragmentation
Continuous batching: interleave requests at token level
Supports quantization (GPTQ, AWQ), LoRA, multi-modal
10-20× faster than naive inference on typical workloads

Shell · Launch vLLM with AWQ quantisation for a 7B model

# Serve a quantized Mistral-7B with continuous batching
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ \
  --quantization awq \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --enable-chunked-prefill \
  --max-num-batched-tokens 4096 \
  --served-model-name mistral-7b \
  --port 8000

# Quick health check
curl http://localhost:8000/health

# Test inference (OpenAI-compatible API)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral-7b",
       "messages": [{"role": "user", "content": "Hello!"}],
       "max_tokens": 64, "temperature": 0.0}'

# Monitor GPU utilisation
watch -n1 nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total \
  --format=csv,noheader

Text Generation Inference (TGI) — Hugging Face

Production-grade inference server. Continuous batching, quantization, watermarking. Rust-based, highly optimized.

Built on Hugging Face transformers
Supports int8, int4, bitsandbytes quantization
Production-ready logging and monitoring
Token streaming and early stopping

SGLang — Structured generation

Inference engine optimized for structured outputs. Native support for regex, JSON schema, context-free grammars.

Structured generation without post-hoc filtering
Prefix caching: reuse computation across similar queries
Radix attention tree for long-sequence handling
Compatible with vLLM serving engine

⚠️ vLLM dominance: As of 2025, vLLM is the de facto standard. TGI and SGLang offer specialized features (production hardening, structured output), but vLLM covers 95% of use cases.

03 — Model Compression

Quantization (GPTQ, AWQ, GGUF)

Quantization reduces model weight precision (FP32 → INT8, INT4, or INT2). It cuts VRAM by 2–8× with ~5% accuracy loss. For large models (70B+), quantization is mandatory for single-GPU inference.

Quantization Methods

Method	Precision	VRAM reduction	Speed	Quality
FP32 (baseline)	32-bit float	1×	1×	100%
FP16/BF16	16-bit float	2×	1–2×	99%
INT8 (dynamic)	8-bit int	4×	1–1.5×	98%
GPTQ (INT4)	4-bit int	8×	1.2–1.8×	95–97%
AWQ (INT4)	4-bit int	8×	1–1.5×	96–98%
GGUF (INT4/2)	2–4 bit int	8–16×	0.5–1×	92–96%

Quantization Frameworks

Quantization

GPTQ

INT4 quantization via gradient info; high quality, widely supported.

Quantization

AWQ (Activation-aware)

INT4 using activation statistics; faster, better quality than GPTQ.

Quantization

GGUF (llama.cpp)

INT2-INT4 quantization, optimized for CPU/edge inference.

Quantization

bitsandbytes

8-bit and 4-bit quantization for training and inference.

ℹ️ When to quantize: If your model fits in VRAM, quantization is optional. If it doesn't fit, use AWQ or GPTQ INT4 (8× reduction). For edge/mobile, use GGUF. Always benchmark on your specific hardware.

04 — Compute

Hardware (GPUs, Memory Bandwidth)

GPU choice is driven by memory bandwidth, not peak FLOP/s. Inference is bandwidth-bound. An H100 is 5× faster than A100 for inference primarily because of 3× higher bandwidth (3.35 TB/s vs 1.9 TB/s).

GPU Comparison

GPU	VRAM	Bandwidth	Cost/month	Best for
A100 (40GB)	40GB	1.9 TB/s	$500–700	Training, large batch inference
H100 (80GB)	80GB	3.35 TB/s	$1200–1500	Production inference, training
L40S	48GB	0.86 TB/s	$300–400	Multi-modal, mixed workloads
RTX 4090	24GB	0.92 TB/s	$150–250	Fine-tuning, small models

Multi-GPU Inference

Data parallelism: Shard batch across GPUs. Simple but doesn't help single large request. Tensor parallelism: Shard model weights across GPUs. Each GPU computes part of each layer. Synchronization overhead ~5–10%. Pipeline parallelism: Split layers across GPUs. Good for sequential (one batch) but harder to saturate.

⚠️ Multi-GPU scaling: Tensor parallelism across 2–4 GPUs is practical. Beyond that, synchronization kills throughput. For 100+ concurrent users, scale horizontally (more machines) rather than vertically (more GPUs per machine).

05 — Orchestration

Cloud Deployment Patterns

Deploying LLMs to cloud requires: (1) Container (Docker) + orchestration (Kubernetes). (2) Auto-scaling based on queue depth. (3) Model serving framework (Seldon, KServe). (4) Monitoring + cost tracking.

# vLLM server + LiteLLM routing from openai import OpenAI vllm_client = OpenAI( base_url='http://localhost:8000/v1', api_key='ignored' ) resp = vllm_client.chat.completions.create( model='meta-llama/Meta-Llama-3-8B-Instruct', messages=[ {'role':'user','content': 'Explain KV caching in one paragraph.'} ], max_tokens=256 ) print(resp.choices[0].message.content) # Multi-model routing with fallbacks import litellm resp2 = litellm.completion( model='claude-haiku-4-5-20251001', messages=[{'role':'user','content':'Hello'}], fallbacks=['gpt-4o-mini', 'groq/llama3-8b-8192'] ) print(resp2.choices[0].message.content)

Cloud Deployment Options

Serverless

AWS Lambda + GPU

Pay-per-request; cold starts ~10sec. Good for bursty traffic.

Serverless

Modal

Managed inference; auto-scaling; simple Python interface.

Managed

Replicate

Model hosting; versioning; webhook predictions.

Managed

Together AI

Inference API; fine-tuning; multi-model deployment.

Self-hosted

vLLM on Kubernetes

Full control; horizontal scaling via StatefulSets.

Cost Optimization

Profile baseline: Measure tokens/sec on a small batch. Calculate cost per token. Set budget limit.

Quantize: If model doesn't fit, quantize to INT4. Reduces VRAM 8× with ~3–5% quality loss.

Batch aggressively: vLLM batching adds <1ms latency but increases throughput 3–5×. Batch at application level if needed.

Model selection: Smaller models (8B vs 70B) cost 8× less. Use smaller model + RAG instead of larger model.

Caching: Cache frequent responses (Redis). Semantic caching reduces repeated computations.

06 — Benchmarking

Performance Benchmarking & Cost Optimisation

Inference cost is dominated by two factors: time-to-first-token (TTFT) and throughput (tokens/sec). Before choosing a serving solution, benchmark your specific workload — batch size, sequence length, and request concurrency all shift the optimal configuration significantly.

Key metrics to track: TTFT at p50/p95, tokens-per-second per GPU, GPU memory utilisation, and cost-per-1M tokens. A model serving 10 tok/s on a single A100 can hit 80+ tok/s with continuous batching on the same hardware. The gap between naive and optimised serving is rarely less than 5×.

Python · vLLM throughput benchmark script

import time, statistics
from vllm import LLM, SamplingParams

llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.2",
          tensor_parallel_size=1,
          max_model_len=4096)

prompts = ["Explain transformer attention in one paragraph."] * 32
params = SamplingParams(temperature=0.0, max_tokens=256)

# Warm-up run (JIT compilation)
_ = llm.generate(prompts[:2], params)

# Timed benchmark
start = time.perf_counter()
outputs = llm.generate(prompts, params)
elapsed = time.perf_counter() - start

total_tokens = sum(len(o.outputs[0].token_ids) for o in outputs)
latencies = [elapsed / len(prompts) * 1000] * len(prompts)  # approx

print(f"Batch size   : {len(prompts)}")
print(f"Total time   : {elapsed:.2f}s")
print(f"Throughput   : {total_tokens/elapsed:.0f} tok/s")
print(f"Avg latency  : {elapsed/len(prompts)*1000:.0f} ms/req")
print(f"Cost est.    : ${total_tokens/1e6 * 0.20:.4f} (at $0.20/1M tok)")

06 — Explore

References

Academic Papers

Paper Kwon, H. et al. (2023). Efficient Memory Management for LLM Serving (vLLM). arXiv:2309.06180 — arxiv:2309.06180 ↗
Paper Frantar, E. et al. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers. arXiv:2210.17323 — arxiv:2210.17323 ↗

Documentation

Docs vLLM documentation. — docs.vllm.ai ↗
Docs LiteLLM — Call all LLM APIs using OpenAI format. — docs.litellm.ai ↗

Articles & Guides

Blog Tim Dettmers. Which GPU for Deep Learning? — timdettmers.com ↗

LLM Infrastructure

The Serving Stack

Bottleneck Shifts

Inference Engines (vLLM, TGI, SGLang)

Popular Engines

vLLM — Industry standard

Text Generation Inference (TGI) — Hugging Face

SGLang — Structured generation

Quantization (GPTQ, AWQ, GGUF)

Quantization Methods

Quantization Frameworks

Hardware (GPUs, Memory Bandwidth)

GPU Comparison

Multi-GPU Inference

Cloud Deployment Patterns

Cloud Deployment Options

Cost Optimization

Performance Benchmarking & Cost Optimisation

Related Topics

References

Related concepts