Tracking model quality, cost, latency, and drift in production — from traces to dashboards
Traditional ML systems emit easily measured signals: prediction, ground truth, latency. LLMs are harder. Outputs are non-deterministic — the same prompt twice yields different responses. There's no single ground truth; quality is subjective and emerges over multiple interactions. Cost is per-call, not per-batch, and scales with context window size. Latency is multi-stage: tokenization, inference, post-processing.
Traditional monitoring asks: "Is the model accurate?" LLM monitoring asks: "Is this response helpful? Honest? Safe? Expensive? Is the model degrading over time?"
| Signal | What to track | How to measure | Alert threshold |
|---|---|---|---|
| Quality | Response correctness, helpfulness, safety | LLM-as-judge, user ratings, fallback rates | Judge score < 7/10 or unsafe detection |
| Cost | Token usage, model routing, cache hits | Input tokens + output tokens, cache efficiency | Cost per interaction > budget or trend spike |
| Latency | End-to-end time, TTFT, TPS | p50, p95, p99 milliseconds | p95 latency > 5 sec or SLA breach |
| Safety | Hallucinations, injection attacks, drift | Automated checks, manual review samples, user flags | Hallucination rate > 2% or injection attempt detected |
Quality is business value. Cost is business sustainability. Latency is user experience. Safety is risk. Monitor all four; they're not orthogonal. A model can be fast and cheap but produce hallucinations. It can be accurate but prohibitively expensive.
A trace is a complete record of a single request: prompt → LLM call → tool call → response → user feedback. Modern LLM tracing uses OpenTelemetry spans to capture hierarchical causality. Each LLM call is a span; tool calls are child spans. This structure powers debugging and root-cause analysis.
Log everything: Low volume; suitable for debugging. Random 10%: Balanced coverage and cost. Stratified by model: Sample more from cheaper models, less from expensive. Error-driven: Always log failures; sample successes randomly.
Asking users "Was this helpful?" takes time. LLM-as-judge scores responses automatically. Train or prompt an LLM to rate other LLM outputs on a rubric: correctness, helpfulness, safety, tone. Score every response in near real-time. Compare against human labels to calibrate.
Clarity: Each dimension must be observable from the text alone. "Helpful" is vague; "answers the user's question without hallucinating" is concrete. Completeness: Cover business-critical dimensions. Reproducibility: The same response should score similarly across judges.
Sample 200 responses. Have humans score them. Have the judge model score the same 200. Calculate agreement: Cohen's kappa, correlation, or agreement rate. Iterate on the rubric until agreement > 0.85. Then deploy the judge to production and monitor score drift quarterly.
Each LLM call costs money. Input tokens are cheaper than output tokens. Long context windows increase cost. In production, cost compounds: one slow user interaction might read 50KB of context, inflate your batch size, or trigger expensive retries.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Speed (tok/sec) | Quality |
|---|---|---|---|---|
| Claude 3.5 Haiku | $0.80 | $4.00 | 40 | Good |
| Claude 3 Opus | $15.00 | $75.00 | 15 | Excellent |
| GPT-4 Turbo | $10.00 | $30.00 | 12 | Excellent |
| Mixtral 8x7B | $0.27 | $0.81 | 25 | Fair |
Route simple requests to fast, cheap models (Haiku). Route complex queries to capable models (Opus). Measure: effort required + quality threshold.
Cache system prompts, documentation, and conversation histories. Cache hits cost 10% of normal tokens. With repeated context (RAG, retrieval results), caching reduces costs 30–50%.
Per-user monthly budgets. Per-request context length limits. If query exceeds budget, degrade gracefully (shorter context, faster model, cached response).
Faster models draft multiple tokens; strong model verifies. Total tokens are similar but latency improves. Reduces cost in some architectures.
Example cost optimization: System processes 100K requests/day. Average cost $0.005/request = $500/day. Implement routing: 60% to Haiku ($0.0008 avg), 40% to Opus ($0.015 avg). New cost: $0.0088/request = $880/day. But quality on easy queries doesn't degrade. Caching cuts 40%. Final: $530/day — 6% savings after quality controls.
Drift is silent model degradation. Input distribution shifts: users ask different questions. Output quality drifts: the model behaves differently (bug in prompt? model update? world changed?). Detect drift by tracking distributions over time.
Monthly: Compare judge scores this month vs. last month. If difference > 0.3 points, investigate. Quarterly: Re-sample 500 responses, have humans score them, compare to judge. If agreement < 0.80, recalibrate. After model updates: Always re-eval on a canary traffic sample before rolling out.
Open-source LLM tracing & analytics. Native integration with LangChain, LlamaIndex. Real-time dashboards, prompt management, cost tracking.
ML observability for LLMs. Embedding drift detection, quality scoring, cost analysis. Integrates with major inference APIs.
LLM monitoring proxy. Drop-in replacement for OpenAI, Anthropic APIs. Traces, cost breakdowns, caching, rate limiting built-in.
LangChain's tracing & evaluation platform. Runs evals at scale, compares model versions, surfaces problematic traces.
ML experiment tracking with LLM extensions. Log traces, runs, scores. Compare model versions and deployment impact.
Open standard for observability. Vendor-agnostic tracing, metrics, logs. Export to Datadog, Grafana, Jaeger, New Relic.
Time-series metrics database. Scrape metrics from apps, build alerts. Standard in Kubernetes environments.
Visualization and dashboarding. Works with Prometheus, InfluxDB, Elasticsearch. Create custom alerts and SLOs.