Production · Operations

LLM Monitoring & Observability

Tracking model quality, cost, latency, and drift in production — from traces to dashboards

4 signal types
6 sections
8 tools
Contents
  1. Why LLM monitoring differs
  2. The 4 signal types
  3. Tracing & logging
  4. Quality scoring in production
  5. Cost & token management
  6. Drift detection
  7. Tools & platforms
  8. References
01 — Context

Why LLM Monitoring Differs

Traditional ML systems emit easily measured signals: prediction, ground truth, latency. LLMs are harder. Outputs are non-deterministic — the same prompt twice yields different responses. There's no single ground truth; quality is subjective and emerges over multiple interactions. Cost is per-call, not per-batch, and scales with context window size. Latency is multi-stage: tokenization, inference, post-processing.

Traditional monitoring asks: "Is the model accurate?" LLM monitoring asks: "Is this response helpful? Honest? Safe? Expensive? Is the model degrading over time?"

💡 Three pillars of LLM observability: Tracing (what happened), Scoring (how good was it), and Alerting (when to act).
02 — Metrics

The 4 Signal Types

SignalWhat to trackHow to measureAlert threshold
QualityResponse correctness, helpfulness, safetyLLM-as-judge, user ratings, fallback ratesJudge score < 7/10 or unsafe detection
CostToken usage, model routing, cache hitsInput tokens + output tokens, cache efficiencyCost per interaction > budget or trend spike
LatencyEnd-to-end time, TTFT, TPSp50, p95, p99 millisecondsp95 latency > 5 sec or SLA breach
SafetyHallucinations, injection attacks, driftAutomated checks, manual review samples, user flagsHallucination rate > 2% or injection attempt detected

Why All Four Matter

Quality is business value. Cost is business sustainability. Latency is user experience. Safety is risk. Monitor all four; they're not orthogonal. A model can be fast and cheap but produce hallucinations. It can be accurate but prohibitively expensive.

03 — Instrumentation

Tracing & Logging

A trace is a complete record of a single request: prompt → LLM call → tool call → response → user feedback. Modern LLM tracing uses OpenTelemetry spans to capture hierarchical causality. Each LLM call is a span; tool calls are child spans. This structure powers debugging and root-cause analysis.

Trace Structure

Request trace ├── User span │ ├── Prompt sanitization span │ ├── LLM call span │ │ ├── Input tokens: 127 │ │ ├── Output tokens: 45 │ │ ├── Latency: 1240ms │ │ └── Model: gpt-4-turbo │ ├── Tool use span (if called) │ │ ├── Tool: search │ │ ├── Args: {"query": "..."} │ │ └── Result: 3 docs │ └── Post-processing span └── Metrics: quality=8.2, cost=$0.002, latency=1450ms

Python Instrumentation with OpenTelemetry

from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.exporter.jaeger.thrift import JaegerExporter tracer = trace.get_tracer(__name__) with tracer.start_as_current_span("llm_call") as span: span.set_attribute("model", "gpt-4") span.set_attribute("input_tokens", 127) response = client.chat.completions.create( model="gpt-4", messages=[...] ) span.set_attribute("output_tokens", len(response.choices[0].message.content.split())) span.set_attribute("latency_ms", elapsed)

Sampling Strategies

Log everything: Low volume; suitable for debugging. Random 10%: Balanced coverage and cost. Stratified by model: Sample more from cheaper models, less from expensive. Error-driven: Always log failures; sample successes randomly.

⚠️ Avoid sampling bias. Random sampling is simple but stratified sampling (by user, model, latency) gives better coverage. Never filter logs by quality or safety signals after they occur — that hides the problems you need to detect.
04 — Automated Evaluation

Quality Scoring in Production

Asking users "Was this helpful?" takes time. LLM-as-judge scores responses automatically. Train or prompt an LLM to rate other LLM outputs on a rubric: correctness, helpfulness, safety, tone. Score every response in near real-time. Compare against human labels to calibrate.

Rubric Design

Clarity: Each dimension must be observable from the text alone. "Helpful" is vague; "answers the user's question without hallucinating" is concrete. Completeness: Cover business-critical dimensions. Reproducibility: The same response should score similarly across judges.

Rubric for customer support responses: 1. Answers the question directly (0–1) 2. Mentions relevant product features (0–2) 3. No hallucinated features (0–3) 4. Tone is professional and empathetic (0–1) Total: 0–7 points

Calibration Against Human Labels

Sample 200 responses. Have humans score them. Have the judge model score the same 200. Calculate agreement: Cohen's kappa, correlation, or agreement rate. Iterate on the rubric until agreement > 0.85. Then deploy the judge to production and monitor score drift quarterly.

💡 Judge cost vs accuracy: A cheap model (Claude Haiku) scores faster and cheaper but with lower accuracy. A strong model (GPT-4) is slower and more expensive but more reliable. Many systems use a tiered approach: cheap judge for initial filtering, expensive judge for borderline cases.
05 — Economics

Cost & Token Management

Each LLM call costs money. Input tokens are cheaper than output tokens. Long context windows increase cost. In production, cost compounds: one slow user interaction might read 50KB of context, inflate your batch size, or trigger expensive retries.

Model Pricing & ROI

ModelInput (per 1M tokens)Output (per 1M tokens)Speed (tok/sec)Quality
Claude 3.5 Haiku$0.80$4.0040Good
Claude 3 Opus$15.00$75.0015Excellent
GPT-4 Turbo$10.00$30.0012Excellent
Mixtral 8x7B$0.27$0.8125Fair

Cost Control Strategies

1

Model Routing — choose the right tool

Route simple requests to fast, cheap models (Haiku). Route complex queries to capable models (Opus). Measure: effort required + quality threshold.

2

Prompt Caching — reuse computation

Cache system prompts, documentation, and conversation histories. Cache hits cost 10% of normal tokens. With repeated context (RAG, retrieval results), caching reduces costs 30–50%.

3

Token Budgets — enforce quotas

Per-user monthly budgets. Per-request context length limits. If query exceeds budget, degrade gracefully (shorter context, faster model, cached response).

4

Speculative Decoding — predict ahead

Faster models draft multiple tokens; strong model verifies. Total tokens are similar but latency improves. Reduces cost in some architectures.

Example cost optimization: System processes 100K requests/day. Average cost $0.005/request = $500/day. Implement routing: 60% to Haiku ($0.0008 avg), 40% to Opus ($0.015 avg). New cost: $0.0088/request = $880/day. But quality on easy queries doesn't degrade. Caching cuts 40%. Final: $530/day — 6% savings after quality controls.

06 — Maintenance

Drift Detection

Drift is silent model degradation. Input distribution shifts: users ask different questions. Output quality drifts: the model behaves differently (bug in prompt? model update? world changed?). Detect drift by tracking distributions over time.

Types of Drift

Re-evaluation Triggers

Monthly: Compare judge scores this month vs. last month. If difference > 0.3 points, investigate. Quarterly: Re-sample 500 responses, have humans score them, compare to judge. If agreement < 0.80, recalibrate. After model updates: Always re-eval on a canary traffic sample before rolling out.

💡 Drift baseline: Statistical tests (Kolmogorov-Smirnov, chi-squared) detect distribution shift. But small shifts that impact users are often invisible statistically. Always pair numerical drift detection with qualitative review: read 20 sampled responses every month.
07 — Ecosystem

Tools & Platforms

Langfuse

Open-source LLM tracing & analytics. Native integration with LangChain, LlamaIndex. Real-time dashboards, prompt management, cost tracking.

Arize Phoenix

ML observability for LLMs. Embedding drift detection, quality scoring, cost analysis. Integrates with major inference APIs.

Helicone

LLM monitoring proxy. Drop-in replacement for OpenAI, Anthropic APIs. Traces, cost breakdowns, caching, rate limiting built-in.

LangSmith

LangChain's tracing & evaluation platform. Runs evals at scale, compares model versions, surfaces problematic traces.

Weights & Biases

ML experiment tracking with LLM extensions. Log traces, runs, scores. Compare model versions and deployment impact.

OpenTelemetry

Open standard for observability. Vendor-agnostic tracing, metrics, logs. Export to Datadog, Grafana, Jaeger, New Relic.

Prometheus

Time-series metrics database. Scrape metrics from apps, build alerts. Standard in Kubernetes environments.

Grafana

Visualization and dashboarding. Works with Prometheus, InfluxDB, Elasticsearch. Create custom alerts and SLOs.

08 — Further Reading

References

Academic Papers
Documentation & Guides
Practitioner Writing