Running models at scale is harder than training them. Learn infrastructure, system design, and production engineering for reliability and cost.
Running LLMs in production is fundamentally different from running traditional software. Traditional software is deterministic (same input → same output) and fast. LLMs are probabilistic, slow, and expensive. This changes everything about infrastructure, monitoring, and reliability.
LLMs are slow. Generating one token takes ~10-100ms on modern hardware. A 100-token response takes 1-10 seconds. Users expect responses in under 1-3 seconds. Solution: streaming (show tokens as they arrive), smaller models, and inference optimization (quantization, distillation, KV caching).
LLM inference is expensive. Each token costs money (OpenAI: $0.001-0.01 per 1K tokens). A typical conversation costs $0.10-1.00. Multiply by millions of users and costs spiral. Solution: prompt caching, smaller models, local inference, aggressive optimization.
Same prompt → different output. Temperature, sampling strategy, model version all affect results. Traditional software assumes determinism; production LLM systems need to handle randomness. Solutions: temperature control, output parsing, fallback models, human review loops.
Using API vendors (OpenAI, Anthropic) means your app depends on their availability and pricing. They can change prices, deprecate models, or degrade performance. Solutions: multi-provider routing, fallbacks, caching, budget alerts.
Production GenAI systems rest on three pillars. All three must be strong, or the system fails.
What: How to run models (cloud GPUs, inference optimizers, caching layers). Key questions: Where do we run models? How do we scale? How do we cache? How do we handle failures? Tools: vLLM, Ollama, Together AI, Baseten, Replicate.
What: How to architect a system that uses LLMs reliably. Key questions: How do we route requests? How do we degrade gracefully? How do we fall back? How do we make LLM outputs deterministic? Patterns: Multi-provider routing, output caching, prompt versioning, circuit breakers.
What: How to monitor, debug, and operate LLM systems. Key questions: What metrics matter? How do we detect failures? How do we debug why an LLM failed? How do we track costs? Tools: Logging, tracing, dashboards, cost monitoring, user feedback loops.
| Pillar | Key Component | Primary Concern | Key Metric |
|---|---|---|---|
| Infrastructure | Model serving (vLLM, TGI) | Throughput & cost | Tokens/sec/GPU |
| Infrastructure | Caching layer | Reducing redundant calls | Cache hit rate |
| System Design | Retry & fallback | Reliability at P99 | Error rate, P99 latency |
| System Design | Async/streaming | Perceived responsiveness | Time-to-first-token |
| Prod Engineering | Observability (LangSmith) | Debug & audit trail | Trace coverage |
| Prod Engineering | Human-in-the-loop | Safety & quality gate | Review queue depth |
Where and how you run models affects everything: latency, cost, control, and reliability.
Pros: No infrastructure to manage, always updated, highest quality models. Cons: No control, vendor lock-in, expensive, rate limits, data privacy. Best for: Prototypes, startups, apps where control isn't critical.
Pros: Full control, cheaper at scale, data stays local, can customize. Cons: Manage infrastructure, latency optimization, model updates on you. Best for: Production apps, on-prem deployments, privacy-critical work.
Quantization: Reduce model size and latency (int8 instead of fp32, 10x faster, minimal quality loss). Distillation: Train a small model to mimic a large one (smaller, faster, cheaper). Batching: Process multiple requests together (higher throughput). KV Caching: Cache attention keys/values (large speedup for generation).
Don't depend on one vendor. Use a router that tries primary provider (e.g., GPT-4), falls back to secondary (e.g., Claude), then tertiary (e.g., local model). Handles outages, rate limits, and cost optimization.
import hashlib, time, json
from functools import lru_cache
from openai import OpenAI
client = OpenAI()
_cache: dict = {} # replace with Redis in production
def semantic_cache_key(prompt: str, model: str) -> str:
"""Exact-match cache key (use embedding similarity for semantic caching)."""
return hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest()
def tracked_completion(prompt: str, model: str = "gpt-4o",
use_cache: bool = True) -> dict:
"""LLM call with timing, caching, and basic observability."""
key = semantic_cache_key(prompt, model)
if use_cache and key in _cache:
return {"response": _cache[key], "source": "cache", "latency_ms": 0}
start = time.perf_counter()
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
latency_ms = (time.perf_counter() - start) * 1000
content = resp.choices[0].message.content
_cache[key] = content
return {
"response": content,
"source": "llm",
"latency_ms": round(latency_ms),
"tokens": resp.usage.total_tokens,
"model": model
}
result = tracked_completion("What is the KV cache in LLMs?")
print(f"[{result['source']}] {result['latency_ms']}ms: {result['response'][:80]}...")
How to build systems that use LLMs reliably at scale.
Many users ask similar questions. Cache outputs (question → answer) with expiration. Before calling the LLM, check cache. Can reduce LLM calls by 50-80% depending on use case. Use Redis or similar.
Track prompt changes. If you improve a prompt, increment version. Old requests use old prompt, new requests use new prompt. Enables A/B testing and debugging ("this user is on prompt v2, not v3").
LLMs generate text freely; sometimes that's wrong (harmful, off-topic, wrong format). Add a filter layer: parse output, validate against schema, check for sensitive content, detect hallucinations. Reject and retry if guardrails fail.
What if the LLM fails? Have fallbacks: (1) cached response, (2) smaller/faster model, (3) hardcoded response, (4) human escalation. Graceful degradation beats hard failure.
Building the monitoring, observability, and operational practices that keep systems running reliably.
Latency: Time to first token, time to last token (measure both). Cost: Cost per request, cost per user, token consumption. Quality: Output correctness (via evaluation set or user feedback), hallucination rate, safety metrics. Availability: Uptime, error rates, cache hit rate. Throughput: Requests per second, tokens per second.
Log every LLM call: prompt, model, output, latency, cost, user. Trace requests end-to-end (user request → cache check → LLM call → guardrails → response). Essential for debugging "why did the LLM give a bad answer?"
LLM costs are real and grow fast. Monitor: (1) cost per feature, (2) cost per user, (3) cost per request. Set budgets and alerts. If a feature costs more than you can afford, optimize or disable. If a user is expensive, investigate.
Define what "good" means for your LLM outputs (accuracy, tone, format). Build an evaluation set (100+ examples). Run evals regularly. If quality degrades, investigate: did the model version change? Did the prompt break? Did the context change?
Let users rate outputs (thumbs up/down). Log feedback. Use it to detect problems early ("Why did we get 50 downvotes today?"). Feed good examples back into fine-tuning or prompt optimization.
Before shipping a GenAI system, ensure you've covered these. Adapted from production ML and SRE best practices.
☐ Model chosen (or multi-model routing strategy). ☐ Inference latency measured and acceptable. ☐ Inference cost understood and budgeted. ☐ Error handling and fallbacks defined. ☐ Rate limiting implemented. ☐ Authentication & logging in place.
☐ Caching strategy (output cache, context cache, KV cache). ☐ Prompt versioning set up. ☐ Guardrails (output validation, safety checks) defined. ☐ Degradation paths clear (what happens on failure). ☐ Multi-provider fallback or at least one fallback.
☐ Metrics defined (latency, cost, quality, availability). ☐ Logging and tracing set up (can debug any failure). ☐ Cost monitoring and alerts in place. ☐ Evaluation set built (can measure quality). ☐ Dashboards created (can see system health at a glance). ☐ Runbooks written (how to respond to common issues).
☐ Quality baseline established (initial BLEU/accuracy/recall). ☐ Safety red-lines defined (what outputs are unacceptable). ☐ User feedback mechanism built (thumbs up/down or equivalent). ☐ Monitoring for quality drift (does quality degrade over time?). ☐ Process for responding to failures (who's on call?).
import json, time
from dataclasses import dataclass, field
from openai import OpenAI
from pathlib import Path
client = OpenAI()
@dataclass
class ReviewItem:
id: str
input: str
output: str
flag_reason: str
timestamp: float = field(default_factory=time.time)
reviewed: bool = False
approved: bool = False
class ReviewQueue:
def __init__(self, path: str = "review_queue.jsonl"):
self.path = Path(path)
def flag(self, item: ReviewItem):
with self.path.open("a") as f:
f.write(json.dumps(item.__dict__) + "
")
print(f"[QUEUE] Flagged: {item.flag_reason} — {item.input[:50]}")
def pending(self) -> list[dict]:
if not self.path.exists(): return []
return [json.loads(l) for l in self.path.read_text().splitlines()
if not json.loads(l).get("reviewed")]
queue = ReviewQueue()
def safe_generate(user_input: str, system: str) -> str:
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "system", "content": system},
{"role": "user", "content": user_input}]
).choices[0].message.content
# Flag if output looks sensitive
if any(w in resp.lower() for w in ["sorry", "cannot", "unsafe", "harmful"]):
queue.flag(ReviewItem(
id=f"r-{int(time.time())}",
input=user_input, output=resp,
flag_reason="potential_refusal_or_safety"
))
return resp
Each production topic deserves careful study. Start with your biggest pain point.
Model serving, inference optimization (quantization, distillation, caching), and multi-provider routing.
Architecture patterns, caching strategies, guardrails, fallbacks, and graceful degradation.
Monitoring, logging, cost tracking, quality evaluation, and operational practices.