Model registries, deployment pipelines, drift detection, and the operational practices that keep LLM products reliable
Traditional MLOps: retrain model, run offline evals, deploy new weights, monitor prediction distribution
LLM MLOps adds: prompt versioning (prompts change more often than weights), LLM-as-judge (evaluating open-ended text output), vendor model deprecations (you didn't train the model), context window management (inputs are variable-length documents)
| Aspect | Traditional ML | LLM MLOps |
|---|---|---|
| Deployable artifact | Model weights | Weights + prompts + config |
| Main regression signal | Metric on holdout set | LLM judge + golden set |
| Retraining trigger | Drift in input distribution | Model deprecation or quality drop |
| Latency source | Single inference | Multiple LLM calls, tool chains |
| Vendor dependency | Low (own model) | High (API providers) |
Prompts are code: version them in git, review them in PRs, test them before merging
Prompt CI/CD pipeline: commit prompt change → run eval suite → compare to baseline → block merge if score drops → deploy if passing
Prompt registry: store prompts with versions, tags (production/staging/experiment), eval scores, and deployment history
Example: prompt versioning workflow
Tools: LangSmith (prompt hub), Braintrust (prompt management), PromptFoo (YAML-based testing), W&B Prompts
Model registry: catalog of all models in use with: model ID, version, provider, deployment date, eval scores, deprecation date
For fine-tuned models: include training data version, training config, validation metrics, and lineage (base model → fine-tuned model)
Deprecation management: subscribe to provider changelogs, test new model versions on your golden set before forced migration, maintain fallback model config
Example: model registry entry schema
What to monitor: (1) latency (TTFT, total latency, P99), (2) cost (tokens in/out, per-request cost), (3) error rates (API errors, timeouts, retries), (4) output quality (LLM judge scores, user feedback), (5) input distribution (prompt length, query type shifts)
Drift signals: quality score drops, latency spikes, unexpected token count changes, new error codes from provider
Alerting thresholds: quality drops >5% → page on-call; latency P99 >2× baseline → alert; error rate >1% → alert
| Layer | Metric | Tool |
|---|---|---|
| Infrastructure | GPU utilization, memory | Prometheus, Grafana |
| API provider | Latency, error rate, cost | Helicone, LangSmith |
| Application | Tokens/request, quality score | Custom + Langfuse |
| User | Thumbs up/down, session length | Custom analytics |
Canary deploy: send 5% of traffic to new model version. Monitor quality and latency for 24 hours. Roll back if metrics degrade.
Shadow testing: new model handles all requests but outputs are discarded. Compare to current model side-by-side.
A/B test: split traffic 50/50, measure business metrics (conversion, satisfaction, engagement)
Example: routing logic for canary deploy
Token cost optimization: cache common prefixes (system prompt + few-shot examples), compress retrieved context, use cheaper models for simple subtasks
Prompt caching: Anthropic and OpenAI both offer cache_control / automatic prefix caching — system prompts processed once
Tiered routing: classify query complexity, route simple queries to mini/flash tier, complex to flagship tier
| Technique | Savings | Complexity | Risk |
|---|---|---|---|
| Prefix caching | 50–90% on cached tokens | Low | None |
| Tiered routing | 40–80% | Medium | Quality regression on edge cases |
| Output compression | 10–30% | Low | Information loss |
| Shorter system prompts | 5–20% | Low | Quality regression |
| Smaller model for subsets | 30–70% | Medium | Quality regression |
LLM-specific incidents: provider outage (API returns 500/503), model behavior change after silent update, prompt injection attack triggering unexpected behavior, cost spike from abnormally long inputs
Runbook entries: for each incident type — detection signal, first responder action, escalation path, rollback procedure
Circuit breaker: if error rate >5% for 60 seconds → switch to fallback model automatically
Post-mortems: document what changed, what signal detected it, how long until detection, remediation taken