LLM Monitoring & Observability

Contents

Why LLM monitoring differs
The 4 signal types
Tracing & logging
Quality scoring in production
Cost & token management
Drift detection
Tools & platforms
References

01 — Context

Why LLM Monitoring Differs

Traditional ML systems emit easily measured signals: prediction, ground truth, latency. LLMs are harder. Outputs are non-deterministic — the same prompt twice yields different responses. There's no single ground truth; quality is subjective and emerges over multiple interactions. Cost is per-call, not per-batch, and scales with context window size. Latency is multi-stage: tokenization, inference, post-processing.

Traditional monitoring asks: "Is the model accurate?" LLM monitoring asks: "Is this response helpful? Honest? Safe? Expensive? Is the model degrading over time?"

💡 Three pillars of LLM observability: Tracing (what happened), Scoring (how good was it), and Alerting (when to act).

02 — Metrics

The 4 Signal Types

Signal	What to track	How to measure	Alert threshold
Quality	Response correctness, helpfulness, safety	LLM-as-judge, user ratings, fallback rates	Judge score < 7/10 or unsafe detection
Cost	Token usage, model routing, cache hits	Input tokens + output tokens, cache efficiency	Cost per interaction > budget or trend spike
Latency	End-to-end time, TTFT, TPS	p50, p95, p99 milliseconds	p95 latency > 5 sec or SLA breach
Safety	Hallucinations, injection attacks, drift	Automated checks, manual review samples, user flags	Hallucination rate > 2% or injection attempt detected

Why All Four Matter

Quality is business value. Cost is business sustainability. Latency is user experience. Safety is risk. Monitor all four; they're not orthogonal. A model can be fast and cheap but produce hallucinations. It can be accurate but prohibitively expensive.

03 — Instrumentation

Tracing & Logging

A trace is a complete record of a single request: prompt → LLM call → tool call → response → user feedback. Modern LLM tracing uses OpenTelemetry spans to capture hierarchical causality. Each LLM call is a span; tool calls are child spans. This structure powers debugging and root-cause analysis.

Trace Structure

Request trace ├── User span │ ├── Prompt sanitization span │ ├── LLM call span │ │ ├── Input tokens: 127 │ │ ├── Output tokens: 45 │ │ ├── Latency: 1240ms │ │ └── Model: gpt-4-turbo │ ├── Tool use span (if called) │ │ ├── Tool: search │ │ ├── Args: {"query": "..."} │ │ └── Result: 3 docs │ └── Post-processing span └── Metrics: quality=8.2, cost=$0.002, latency=1450ms

Python Instrumentation with OpenTelemetry

from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.exporter.jaeger.thrift import JaegerExporter tracer = trace.get_tracer(__name__) with tracer.start_as_current_span("llm_call") as span: span.set_attribute("model", "gpt-4") span.set_attribute("input_tokens", 127) response = client.chat.completions.create( model="gpt-4", messages=[...] ) span.set_attribute("output_tokens", len(response.choices[0].message.content.split())) span.set_attribute("latency_ms", elapsed)

Sampling Strategies

Log everything: Low volume; suitable for debugging. Random 10%: Balanced coverage and cost. Stratified by model: Sample more from cheaper models, less from expensive. Error-driven: Always log failures; sample successes randomly.

⚠️ Avoid sampling bias. Random sampling is simple but stratified sampling (by user, model, latency) gives better coverage. Never filter logs by quality or safety signals after they occur — that hides the problems you need to detect.

04 — Automated Evaluation

Quality Scoring in Production

Asking users "Was this helpful?" takes time. LLM-as-judge scores responses automatically. Train or prompt an LLM to rate other LLM outputs on a rubric: correctness, helpfulness, safety, tone. Score every response in near real-time. Compare against human labels to calibrate.

Rubric Design

Clarity: Each dimension must be observable from the text alone. "Helpful" is vague; "answers the user's question without hallucinating" is concrete. Completeness: Cover business-critical dimensions. Reproducibility: The same response should score similarly across judges.

Rubric for customer support responses: 1. Answers the question directly (0–1) 2. Mentions relevant product features (0–2) 3. No hallucinated features (0–3) 4. Tone is professional and empathetic (0–1) Total: 0–7 points

Calibration Against Human Labels

Sample 200 responses. Have humans score them. Have the judge model score the same 200. Calculate agreement: Cohen's kappa, correlation, or agreement rate. Iterate on the rubric until agreement > 0.85. Then deploy the judge to production and monitor score drift quarterly.

💡 Judge cost vs accuracy: A cheap model (Claude Haiku) scores faster and cheaper but with lower accuracy. A strong model (GPT-4) is slower and more expensive but more reliable. Many systems use a tiered approach: cheap judge for initial filtering, expensive judge for borderline cases.

05 — Economics

Cost & Token Management

Each LLM call costs money. Input tokens are cheaper than output tokens. Long context windows increase cost. In production, cost compounds: one slow user interaction might read 50KB of context, inflate your batch size, or trigger expensive retries.

Model Pricing & ROI

Model	Input (per 1M tokens)	Output (per 1M tokens)	Speed (tok/sec)	Quality
Claude 3.5 Haiku	$0.80	$4.00	40	Good
Claude 3 Opus	$15.00	$75.00	15	Excellent
GPT-4 Turbo	$10.00	$30.00	12	Excellent
Mixtral 8x7B	$0.27	$0.81	25	Fair

Cost Control Strategies

Model Routing — choose the right tool

Route simple requests to fast, cheap models (Haiku). Route complex queries to capable models (Opus). Measure: effort required + quality threshold.

Prompt Caching — reuse computation

Cache system prompts, documentation, and conversation histories. Cache hits cost 10% of normal tokens. With repeated context (RAG, retrieval results), caching reduces costs 30–50%.

Token Budgets — enforce quotas

Per-user monthly budgets. Per-request context length limits. If query exceeds budget, degrade gracefully (shorter context, faster model, cached response).

Speculative Decoding — predict ahead

Faster models draft multiple tokens; strong model verifies. Total tokens are similar but latency improves. Reduces cost in some architectures.

Example cost optimization: System processes 100K requests/day. Average cost $0.005/request = $500/day. Implement routing: 60% to Haiku ($0.0008 avg), 40% to Opus ($0.015 avg). New cost: $0.0088/request = $880/day. But quality on easy queries doesn't degrade. Caching cuts 40%. Final: $530/day — 6% savings after quality controls.

06 — Maintenance

Drift Detection

Drift is silent model degradation. Input distribution shifts: users ask different questions. Output quality drifts: the model behaves differently (bug in prompt? model update? world changed?). Detect drift by tracking distributions over time.

Types of Drift

Input distribution drift: Query length, topic distribution, language changes. Signals: vocabulary shift, TF-IDF divergence, embedding drift.
Model behavior drift: Outputs are different despite same inputs. Signals: response length change, safety scores drop, user complaint rate rises.
Quality degradation: Judge scores decline. Signals: average judge score drops 0.5 points, hallucination rate increases, user ratings fall.

Re-evaluation Triggers

Monthly: Compare judge scores this month vs. last month. If difference > 0.3 points, investigate. Quarterly: Re-sample 500 responses, have humans score them, compare to judge. If agreement < 0.80, recalibrate. After model updates: Always re-eval on a canary traffic sample before rolling out.

💡 Drift baseline: Statistical tests (Kolmogorov-Smirnov, chi-squared) detect distribution shift. But small shifts that impact users are often invisible statistically. Always pair numerical drift detection with qualitative review: read 20 sampled responses every month.

07 — Ecosystem

Tools & Platforms

Langfuse

Open-source LLM tracing & analytics. Native integration with LangChain, LlamaIndex. Real-time dashboards, prompt management, cost tracking.

Arize Phoenix

ML observability for LLMs. Embedding drift detection, quality scoring, cost analysis. Integrates with major inference APIs.

Helicone

LLM monitoring proxy. Drop-in replacement for OpenAI, Anthropic APIs. Traces, cost breakdowns, caching, rate limiting built-in.

LangSmith

LangChain's tracing & evaluation platform. Runs evals at scale, compares model versions, surfaces problematic traces.

Weights & Biases

ML experiment tracking with LLM extensions. Log traces, runs, scores. Compare model versions and deployment impact.

OpenTelemetry

Open standard for observability. Vendor-agnostic tracing, metrics, logs. Export to Datadog, Grafana, Jaeger, New Relic.

Prometheus

Time-series metrics database. Scrape metrics from apps, build alerts. Standard in Kubernetes environments.

Grafana

Visualization and dashboarding. Works with Prometheus, InfluxDB, Elasticsearch. Create custom alerts and SLOs.

08 — Further Reading

References

Academic Papers

Paper Liang, P. et al. (2022). Holistic Evaluation of Language Models. Stanford CRFM. — arxiv:2211.09110 ↗
Paper Zhang, Y. et al. (2023). OPT: Open Pretrained Transformer Language Models. Meta. arXiv:2205.01068. — arxiv:2205.01068 ↗

Documentation & Guides

Docs Langfuse — LLM Tracing & Analytics. langfuse.com ↗
Docs OpenTelemetry — Instrumentation Guide. opentelemetry.io ↗
Docs Arize Phoenix — LLM Observability. phoenix.arize.com ↗
Guide Helicone — Cost Optimization for LLMs. helicone.ai ↗

Practitioner Writing

Blog DeepLearning.AI. (2024). Evaluating and Monitoring LLMs in Production. — deeplearning.ai ↗
Blog Chip Huyen. (2023). Monitoring Machine Learning Systems. — huyenchip.com ↗

LLM Monitoring & Observability

Why LLM Monitoring Differs

The 4 Signal Types

Why All Four Matter

Tracing & Logging

Trace Structure

Python Instrumentation with OpenTelemetry

Sampling Strategies

Quality Scoring in Production

Rubric Design

Calibration Against Human Labels

Cost & Token Management

Model Pricing & ROI

Cost Control Strategies

Model Routing — choose the right tool

Prompt Caching — reuse computation

Token Budgets — enforce quotas

Speculative Decoding — predict ahead

Drift Detection

Types of Drift

Re-evaluation Triggers

Tools & Platforms

Langfuse

Arize Phoenix

Helicone

LangSmith

Weights & Biases

OpenTelemetry

Prometheus

Grafana

References

Related concepts