Production Engineering

MLOps for LLMs

Model registries, deployment pipelines, drift detection, and the operational practices that keep LLM products reliable

train → evaluate → deploy the ML lifecycle
drift + regression the twin risks
CI/CD for prompts the new discipline
Sections
Section 1

LLM MLOps vs Traditional MLOps

Traditional MLOps: retrain model, run offline evals, deploy new weights, monitor prediction distribution

LLM MLOps adds: prompt versioning (prompts change more often than weights), LLM-as-judge (evaluating open-ended text output), vendor model deprecations (you didn't train the model), context window management (inputs are variable-length documents)

Aspect Traditional ML LLM MLOps
Deployable artifact Model weights Weights + prompts + config
Main regression signal Metric on holdout set LLM judge + golden set
Retraining trigger Drift in input distribution Model deprecation or quality drop
Latency source Single inference Multiple LLM calls, tool chains
Vendor dependency Low (own model) High (API providers)
Your model might be GPT-4o today. When OpenAI deprecates it, your pipelines break unless you test migrating to each new model before the deprecation date.
Section 2

Prompt Versioning and CI/CD

Prompts are code: version them in git, review them in PRs, test them before merging

Prompt CI/CD pipeline: commit prompt change → run eval suite → compare to baseline → block merge if score drops → deploy if passing

Prompt registry: store prompts with versions, tags (production/staging/experiment), eval scores, and deployment history

Example: prompt versioning workflow

# prompts/system_v2.3.txt committed to git # .github/workflows/eval.yml: on: pull_request jobs: eval: steps: - run: python evals/run_suite.py --prompt prompts/system_v2.3.txt - run: python evals/compare_to_baseline.py --threshold 0.02 # Blocks PR if score drops >2% on any category

Tools: LangSmith (prompt hub), Braintrust (prompt management), PromptFoo (YAML-based testing), W&B Prompts

Section 3

Model Registry and Lifecycle

Model registry: catalog of all models in use with: model ID, version, provider, deployment date, eval scores, deprecation date

For fine-tuned models: include training data version, training config, validation metrics, and lineage (base model → fine-tuned model)

Deprecation management: subscribe to provider changelogs, test new model versions on your golden set before forced migration, maintain fallback model config

Example: model registry entry schema

{ "model_id": "gpt-4o-2024-08-06", "provider": "openai", "deployed_to": ["search_pipeline", "summarization_service"], "eval_scores": {"mmlu": 85.7, "task_accuracy": 0.91}, "deployed_at": "2024-08-15", "deprecation_date": "2025-06-01", "fallback": "gpt-4o-mini" }
Section 4

Monitoring and Drift Detection

What to monitor: (1) latency (TTFT, total latency, P99), (2) cost (tokens in/out, per-request cost), (3) error rates (API errors, timeouts, retries), (4) output quality (LLM judge scores, user feedback), (5) input distribution (prompt length, query type shifts)

Drift signals: quality score drops, latency spikes, unexpected token count changes, new error codes from provider

Alerting thresholds: quality drops >5% → page on-call; latency P99 >2× baseline → alert; error rate >1% → alert

Layer Metric Tool
Infrastructure GPU utilization, memory Prometheus, Grafana
API provider Latency, error rate, cost Helicone, LangSmith
Application Tokens/request, quality score Custom + Langfuse
User Thumbs up/down, session length Custom analytics
Section 5

Canary Deploys and Shadow Testing

Canary deploy: send 5% of traffic to new model version. Monitor quality and latency for 24 hours. Roll back if metrics degrade.

Shadow testing: new model handles all requests but outputs are discarded. Compare to current model side-by-side.

A/B test: split traffic 50/50, measure business metrics (conversion, satisfaction, engagement)

Example: routing logic for canary deploy

import random def select_model(request_id: str) -> str: # Deterministic routing by request hash (consistent user experience) bucket = hash(request_id) % 100 if bucket < 5: # 5% canary return "gpt-4o-2024-11-20" # new version return "gpt-4o-2024-08-06" # stable version
Use deterministic routing (hash on user ID) so individual users have consistent experiences. Random routing can confuse users who see different behaviors in the same session.
Section 6

Cost Management

Token cost optimization: cache common prefixes (system prompt + few-shot examples), compress retrieved context, use cheaper models for simple subtasks

Prompt caching: Anthropic and OpenAI both offer cache_control / automatic prefix caching — system prompts processed once

Tiered routing: classify query complexity, route simple queries to mini/flash tier, complex to flagship tier

Technique Savings Complexity Risk
Prefix caching 50–90% on cached tokens Low None
Tiered routing 40–80% Medium Quality regression on edge cases
Output compression 10–30% Low Information loss
Shorter system prompts 5–20% Low Quality regression
Smaller model for subsets 30–70% Medium Quality regression
Measure cost per outcome (cost per successful task completion), not just cost per token. A cheaper model that requires 3 retries is more expensive than a reliable expensive model.
Section 7

Incident Response and Runbooks

LLM-specific incidents: provider outage (API returns 500/503), model behavior change after silent update, prompt injection attack triggering unexpected behavior, cost spike from abnormally long inputs

Runbook entries: for each incident type — detection signal, first responder action, escalation path, rollback procedure

Circuit breaker: if error rate >5% for 60 seconds → switch to fallback model automatically

Post-mortems: document what changed, what signal detected it, how long until detection, remediation taken

LLM Ops
LangSmith
Monitoring and debugging
LLM Ops
Langfuse
Tracing and cost tracking
LLM Ops
Helicone
API observability
LLM Ops
W&B
Model tracking and experiments
Observability
Arize Phoenix
LLM tracing platform
Infrastructure
Prometheus + Grafana
Metrics and dashboards

References

Tools & Platforms
Books & Guides