Operations & Scale

GenAI in Production

Running models at scale is harder than training them. Learn infrastructure, system design, and production engineering for reliability and cost.

3 Pillars
10+ Key Metrics
On This Page
01 — Context

Why GenAI Production Is Uniquely Hard

Running LLMs in production is fundamentally different from running traditional software. Traditional software is deterministic (same input → same output) and fast. LLMs are probabilistic, slow, and expensive. This changes everything about infrastructure, monitoring, and reliability.

Latency

LLMs are slow. Generating one token takes ~10-100ms on modern hardware. A 100-token response takes 1-10 seconds. Users expect responses in under 1-3 seconds. Solution: streaming (show tokens as they arrive), smaller models, and inference optimization (quantization, distillation, KV caching).

Cost

LLM inference is expensive. Each token costs money (OpenAI: $0.001-0.01 per 1K tokens). A typical conversation costs $0.10-1.00. Multiply by millions of users and costs spiral. Solution: prompt caching, smaller models, local inference, aggressive optimization.

Non-Determinism

Same prompt → different output. Temperature, sampling strategy, model version all affect results. Traditional software assumes determinism; production LLM systems need to handle randomness. Solutions: temperature control, output parsing, fallback models, human review loops.

Vendor Dependency

Using API vendors (OpenAI, Anthropic) means your app depends on their availability and pricing. They can change prices, deprecate models, or degrade performance. Solutions: multi-provider routing, fallbacks, caching, budget alerts.

⚠️ Production reality: Most LLM failures aren't about the model — they're about latency, cost, monitoring, and error handling. Plan for these problems, or your production system will fail spectacularly.
02 — Architecture

The Three Pillars of Production GenAI

Production GenAI systems rest on three pillars. All three must be strong, or the system fails.

Pillar 1: Infrastructure & Serving

What: How to run models (cloud GPUs, inference optimizers, caching layers). Key questions: Where do we run models? How do we scale? How do we cache? How do we handle failures? Tools: vLLM, Ollama, Together AI, Baseten, Replicate.

Pillar 2: System Design Patterns

What: How to architect a system that uses LLMs reliably. Key questions: How do we route requests? How do we degrade gracefully? How do we fall back? How do we make LLM outputs deterministic? Patterns: Multi-provider routing, output caching, prompt versioning, circuit breakers.

Pillar 3: Production Engineering

What: How to monitor, debug, and operate LLM systems. Key questions: What metrics matter? How do we detect failures? How do we debug why an LLM failed? How do we track costs? Tools: Logging, tracing, dashboards, cost monitoring, user feedback loops.

💡 Systems thinking: A production GenAI system is not just "call the LLM API and return the result." It's a sophisticated system with caching, fallbacks, monitoring, cost controls, and error handling. Design for all three pillars from day one.
PillarKey ComponentPrimary ConcernKey Metric
InfrastructureModel serving (vLLM, TGI)Throughput & costTokens/sec/GPU
InfrastructureCaching layerReducing redundant callsCache hit rate
System DesignRetry & fallbackReliability at P99Error rate, P99 latency
System DesignAsync/streamingPerceived responsivenessTime-to-first-token
Prod EngineeringObservability (LangSmith)Debug & audit trailTrace coverage
Prod EngineeringHuman-in-the-loopSafety & quality gateReview queue depth
03 — Serving

Pillar 1: Infrastructure & Model Serving

Where and how you run models affects everything: latency, cost, control, and reliability.

Hosted APIs (OpenAI, Anthropic, Mistral)

Pros: No infrastructure to manage, always updated, highest quality models. Cons: No control, vendor lock-in, expensive, rate limits, data privacy. Best for: Prototypes, startups, apps where control isn't critical.

Self-Hosted (vLLM, Ollama, llama.cpp)

Pros: Full control, cheaper at scale, data stays local, can customize. Cons: Manage infrastructure, latency optimization, model updates on you. Best for: Production apps, on-prem deployments, privacy-critical work.

Inference Optimization Techniques

Quantization: Reduce model size and latency (int8 instead of fp32, 10x faster, minimal quality loss). Distillation: Train a small model to mimic a large one (smaller, faster, cheaper). Batching: Process multiple requests together (higher throughput). KV Caching: Cache attention keys/values (large speedup for generation).

Multi-Provider Strategy

Don't depend on one vendor. Use a router that tries primary provider (e.g., GPT-4), falls back to secondary (e.g., Claude), then tertiary (e.g., local model). Handles outages, rate limits, and cost optimization.

Infra best practice: Start with hosted APIs (OpenAI) for speed, but design for portability. Build an abstraction layer so you can swap providers. Monitor costs aggressively. Optimize latency obsessively — every 100ms matters for user experience.
Python · Production LLM call with latency tracking and semantic caching
import hashlib, time, json
from functools import lru_cache
from openai import OpenAI

client = OpenAI()
_cache: dict = {}   # replace with Redis in production

def semantic_cache_key(prompt: str, model: str) -> str:
    """Exact-match cache key (use embedding similarity for semantic caching)."""
    return hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest()

def tracked_completion(prompt: str, model: str = "gpt-4o",
                       use_cache: bool = True) -> dict:
    """LLM call with timing, caching, and basic observability."""
    key = semantic_cache_key(prompt, model)

    if use_cache and key in _cache:
        return {"response": _cache[key], "source": "cache", "latency_ms": 0}

    start = time.perf_counter()
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    latency_ms = (time.perf_counter() - start) * 1000
    content = resp.choices[0].message.content

    _cache[key] = content
    return {
        "response": content,
        "source": "llm",
        "latency_ms": round(latency_ms),
        "tokens": resp.usage.total_tokens,
        "model": model
    }

result = tracked_completion("What is the KV cache in LLMs?")
print(f"[{result['source']}] {result['latency_ms']}ms: {result['response'][:80]}...")
04 — Architecture

Pillar 2: System Design Patterns

How to build systems that use LLMs reliably at scale.

Output Caching & Deduplication

Many users ask similar questions. Cache outputs (question → answer) with expiration. Before calling the LLM, check cache. Can reduce LLM calls by 50-80% depending on use case. Use Redis or similar.

Prompt Versioning

Track prompt changes. If you improve a prompt, increment version. Old requests use old prompt, new requests use new prompt. Enables A/B testing and debugging ("this user is on prompt v2, not v3").

Output Guardrails

LLMs generate text freely; sometimes that's wrong (harmful, off-topic, wrong format). Add a filter layer: parse output, validate against schema, check for sensitive content, detect hallucinations. Reject and retry if guardrails fail.

Fallback & Degradation

What if the LLM fails? Have fallbacks: (1) cached response, (2) smaller/faster model, (3) hardcoded response, (4) human escalation. Graceful degradation beats hard failure.

PRODUCTION SYSTEM ARCHITECTURE: User Request ↓ Rate Limiting (avoid abuse) ↓ Authentication & Logging ↓ Check Cache (have we seen this before?) ↓ Try Primary LLM (e.g., GPT-4) ↓ Run Guardrails (validate output) ↓ Cache Result ↓ Return to User On Failure: • LLM error? Try Secondary Model • Timeout? Return cached/default response • Invalid output? Retry with prompt fix • Rate limited? Queue and retry later
⚠️ Common mistake: Calling the LLM on every request, even if you've answered the same question before. This is expensive and slow. Cache everything, invalidate when data changes.
05 — Operations

Pillar 3: Production Engineering

Building the monitoring, observability, and operational practices that keep systems running reliably.

Key Metrics to Track

Latency: Time to first token, time to last token (measure both). Cost: Cost per request, cost per user, token consumption. Quality: Output correctness (via evaluation set or user feedback), hallucination rate, safety metrics. Availability: Uptime, error rates, cache hit rate. Throughput: Requests per second, tokens per second.

Logging & Tracing

Log every LLM call: prompt, model, output, latency, cost, user. Trace requests end-to-end (user request → cache check → LLM call → guardrails → response). Essential for debugging "why did the LLM give a bad answer?"

Cost Monitoring

LLM costs are real and grow fast. Monitor: (1) cost per feature, (2) cost per user, (3) cost per request. Set budgets and alerts. If a feature costs more than you can afford, optimize or disable. If a user is expensive, investigate.

Quality & Evaluation

Define what "good" means for your LLM outputs (accuracy, tone, format). Build an evaluation set (100+ examples). Run evals regularly. If quality degrades, investigate: did the model version change? Did the prompt break? Did the context change?

User Feedback Loop

Let users rate outputs (thumbs up/down). Log feedback. Use it to detect problems early ("Why did we get 50 downvotes today?"). Feed good examples back into fine-tuning or prompt optimization.

Ops best practice: Instrument everything. Track latency, cost, quality, and availability. Set up alerts for anomalies. Build dashboards. Review metrics daily. Production GenAI is about running systems at scale, not just calling APIs.
06 — Planning

Minimal Production Checklist

Before shipping a GenAI system, ensure you've covered these. Adapted from production ML and SRE best practices.

Serving & Infra

☐ Model chosen (or multi-model routing strategy). ☐ Inference latency measured and acceptable. ☐ Inference cost understood and budgeted. ☐ Error handling and fallbacks defined. ☐ Rate limiting implemented. ☐ Authentication & logging in place.

System Design

☐ Caching strategy (output cache, context cache, KV cache). ☐ Prompt versioning set up. ☐ Guardrails (output validation, safety checks) defined. ☐ Degradation paths clear (what happens on failure). ☐ Multi-provider fallback or at least one fallback.

Production Engineering

☐ Metrics defined (latency, cost, quality, availability). ☐ Logging and tracing set up (can debug any failure). ☐ Cost monitoring and alerts in place. ☐ Evaluation set built (can measure quality). ☐ Dashboards created (can see system health at a glance). ☐ Runbooks written (how to respond to common issues).

Quality & Safety

☐ Quality baseline established (initial BLEU/accuracy/recall). ☐ Safety red-lines defined (what outputs are unacceptable). ☐ User feedback mechanism built (thumbs up/down or equivalent). ☐ Monitoring for quality drift (does quality degrade over time?). ☐ Process for responding to failures (who's on call?).

⚠️ Reality check: If you can't check all these boxes, you're not ready for production. It's okay — take the time to build the right foundation. Shipping without these is how you end up with $10K bills and angry users.
Python · Human-in-the-loop review queue for flagged outputs
import json, time
from dataclasses import dataclass, field
from openai import OpenAI
from pathlib import Path

client = OpenAI()

@dataclass
class ReviewItem:
    id: str
    input: str
    output: str
    flag_reason: str
    timestamp: float = field(default_factory=time.time)
    reviewed: bool = False
    approved: bool = False

class ReviewQueue:
    def __init__(self, path: str = "review_queue.jsonl"):
        self.path = Path(path)

    def flag(self, item: ReviewItem):
        with self.path.open("a") as f:
            f.write(json.dumps(item.__dict__) + "
")
        print(f"[QUEUE] Flagged: {item.flag_reason} — {item.input[:50]}")

    def pending(self) -> list[dict]:
        if not self.path.exists(): return []
        return [json.loads(l) for l in self.path.read_text().splitlines()
                if not json.loads(l).get("reviewed")]

queue = ReviewQueue()

def safe_generate(user_input: str, system: str) -> str:
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "system", "content": system},
                  {"role": "user", "content": user_input}]
    ).choices[0].message.content

    # Flag if output looks sensitive
    if any(w in resp.lower() for w in ["sorry", "cannot", "unsafe", "harmful"]):
        queue.flag(ReviewItem(
            id=f"r-{int(time.time())}",
            input=user_input, output=resp,
            flag_reason="potential_refusal_or_safety"
        ))
    return resp
07 — Explore

Deep Dives: Production Topics

Each production topic deserves careful study. Start with your biggest pain point.

Production Pillars

1

Infrastructure

Model serving, inference optimization (quantization, distillation, caching), and multi-provider routing.

2

System Design

Architecture patterns, caching strategies, guardrails, fallbacks, and graceful degradation.

3

Prod Engineering

Monitoring, logging, cost tracking, quality evaluation, and operational practices.

💡 Priority order: Start with infrastructure (measure latency/cost). Move to system design (add caching/fallbacks). Graduate to ops (monitoring/dashboards). Each layer multiplies reliability and reduces cost.
08 — Further Reading

References

Production & Operations
Inference & Optimization
ML Operations
  • Blog Sculley, D. et al. (2015). Hidden Technical Debt in Machine Learning Systems. NeurIPS. — arXiv:1503.04811 ↗