MLOps for LLMs

Sections

LLM MLOps vs ML
Prompt Versioning
Model Registry
Monitoring & Drift
Canary Deploys
Cost Management
Incident Response

Section 1

LLM MLOps vs Traditional MLOps

Traditional MLOps: retrain model, run offline evals, deploy new weights, monitor prediction distribution

LLM MLOps adds: prompt versioning (prompts change more often than weights), LLM-as-judge (evaluating open-ended text output), vendor model deprecations (you didn't train the model), context window management (inputs are variable-length documents)

Aspect	Traditional ML	LLM MLOps
Deployable artifact	Model weights	Weights + prompts + config
Main regression signal	Metric on holdout set	LLM judge + golden set
Retraining trigger	Drift in input distribution	Model deprecation or quality drop
Latency source	Single inference	Multiple LLM calls, tool chains
Vendor dependency	Low (own model)	High (API providers)

⚠

Your model might be GPT-4o today. When OpenAI deprecates it, your pipelines break unless you test migrating to each new model before the deprecation date.

Section 2

Prompt Versioning and CI/CD

Prompts are code: version them in git, review them in PRs, test them before merging

Prompt CI/CD pipeline: commit prompt change → run eval suite → compare to baseline → block merge if score drops → deploy if passing

Prompt registry: store prompts with versions, tags (production/staging/experiment), eval scores, and deployment history

Example: prompt versioning workflow

# prompts/system_v2.3.txt committed to git # .github/workflows/eval.yml: on: pull_request jobs: eval: steps: - run: python evals/run_suite.py --prompt prompts/system_v2.3.txt - run: python evals/compare_to_baseline.py --threshold 0.02 # Blocks PR if score drops >2% on any category

Tools: LangSmith (prompt hub), Braintrust (prompt management), PromptFoo (YAML-based testing), W&B Prompts

Section 3

Model Registry and Lifecycle

Model registry: catalog of all models in use with: model ID, version, provider, deployment date, eval scores, deprecation date

For fine-tuned models: include training data version, training config, validation metrics, and lineage (base model → fine-tuned model)

Deprecation management: subscribe to provider changelogs, test new model versions on your golden set before forced migration, maintain fallback model config

Example: model registry entry schema

{ "model_id": "gpt-4o-2024-08-06", "provider": "openai", "deployed_to": ["search_pipeline", "summarization_service"], "eval_scores": {"mmlu": 85.7, "task_accuracy": 0.91}, "deployed_at": "2024-08-15", "deprecation_date": "2025-06-01", "fallback": "gpt-4o-mini" }

Section 4

Monitoring and Drift Detection

What to monitor: (1) latency (TTFT, total latency, P99), (2) cost (tokens in/out, per-request cost), (3) error rates (API errors, timeouts, retries), (4) output quality (LLM judge scores, user feedback), (5) input distribution (prompt length, query type shifts)

Drift signals: quality score drops, latency spikes, unexpected token count changes, new error codes from provider

Alerting thresholds: quality drops >5% → page on-call; latency P99 >2× baseline → alert; error rate >1% → alert

Layer	Metric	Tool
Infrastructure	GPU utilization, memory	Prometheus, Grafana
API provider	Latency, error rate, cost	Helicone, LangSmith
Application	Tokens/request, quality score	Custom + Langfuse
User	Thumbs up/down, session length	Custom analytics

Section 5

Canary Deploys and Shadow Testing

Canary deploy: send 5% of traffic to new model version. Monitor quality and latency for 24 hours. Roll back if metrics degrade.

Shadow testing: new model handles all requests but outputs are discarded. Compare to current model side-by-side.

A/B test: split traffic 50/50, measure business metrics (conversion, satisfaction, engagement)

Example: routing logic for canary deploy

import random def select_model(request_id: str) -> str: # Deterministic routing by request hash (consistent user experience) bucket = hash(request_id) % 100 if bucket < 5: # 5% canary return "gpt-4o-2024-11-20" # new version return "gpt-4o-2024-08-06" # stable version

✓

Use deterministic routing (hash on user ID) so individual users have consistent experiences. Random routing can confuse users who see different behaviors in the same session.

Section 6

Cost Management

Token cost optimization: cache common prefixes (system prompt + few-shot examples), compress retrieved context, use cheaper models for simple subtasks

Prompt caching: Anthropic and OpenAI both offer cache_control / automatic prefix caching — system prompts processed once

Tiered routing: classify query complexity, route simple queries to mini/flash tier, complex to flagship tier

Technique	Savings	Complexity	Risk
Prefix caching	50–90% on cached tokens	Low	None
Tiered routing	40–80%	Medium	Quality regression on edge cases
Output compression	10–30%	Low	Information loss
Shorter system prompts	5–20%	Low	Quality regression
Smaller model for subsets	30–70%	Medium	Quality regression

⚠

Measure cost per outcome (cost per successful task completion), not just cost per token. A cheaper model that requires 3 retries is more expensive than a reliable expensive model.

Section 7

Incident Response and Runbooks

LLM-specific incidents: provider outage (API returns 500/503), model behavior change after silent update, prompt injection attack triggering unexpected behavior, cost spike from abnormally long inputs

Runbook entries: for each incident type — detection signal, first responder action, escalation path, rollback procedure

Circuit breaker: if error rate >5% for 60 seconds → switch to fallback model automatically

Post-mortems: document what changed, what signal detected it, how long until detection, remediation taken

LLM Ops

LangSmith

Monitoring and debugging

LLM Ops

Langfuse

Tracing and cost tracking

LLM Ops

Helicone

API observability

LLM Ops

W&B

Model tracking and experiments

Observability

Arize Phoenix

LLM tracing platform

Infrastructure

Prometheus + Grafana

Metrics and dashboards

References

Tools & Platforms

Books & Guides

Chip Huyen — Designing ML Systems (O'Reilly)

MLOps for LLMs

LLM MLOps vs Traditional MLOps

Prompt Versioning and CI/CD

Model Registry and Lifecycle

Monitoring and Drift Detection

Canary Deploys and Shadow Testing

Cost Management

Incident Response and Runbooks

References

Related concepts