EVALUATION & OBSERVABILITY

Evaluating LLM Systems

You can't improve what you don't measure — the discipline that separates prototypes from production

golden set + LLM judge the stack
offline → online the lifecycle
measure before you ship the rule
Contents
  1. Why evaluation matters
  2. Offline evaluation
  3. RAG-specific evaluation
  4. LLM-as-judge pattern
  5. Production monitoring
  6. Child pages
  7. References
01 — Foundation

Why Evaluation Matters

You can't improve what you don't measure. Evaluation tells you if your model knows enough, if your RAG is retrieving the right chunks, and whether quality has quietly degraded since last week — before your users notice.

A typical LLM system has three stages of evaluation: offline (before deployment), canary (small traffic), and production (full rollout). Without rigorous offline evaluation, bugs reach production. Without monitoring, quality drift goes undetected for weeks.

Golden evaluation set: Build your golden set from day 1. Start with 20–30 representative questions and expected answers. Run the golden set every time you change a model, prompt, or pipeline — treat it like your test suite.

Three Levels of Measurement

StageWhenMethodRisk
OfflineBefore deploymentGolden set, benchmarksMissing edge cases
Canary5-10% production trafficLLM-as-judge, user feedbackStatistical noise
ProductionAll traffic, ongoingMonitoring dashboards, alertsFalse positives
02 — Benchmarking

Offline Evaluation (Benchmarks + Golden Sets)

Offline evaluation is your first gate. You test on a fixed dataset of questions and expected answers before any code reaches production. It's fast, reproducible, and catches obvious failures.

Building a Golden Dataset

A golden dataset is a small (20–100 questions), curated collection of representative examples with ground-truth answers. Include edge cases: typos, ambiguous queries, multi-part questions, rare entities.

Golden dataset structure: [ { "question": "What is our return policy?", "expected_answer": "30-day return window, no questions asked", "category": "policy" }, { "question": "How long does shipping take to Alaska?", "expected_answer": "7-10 business days (standard), 2-3 business days (express)", "category": "shipping", "difficulty": "hard" }, { "question": "What year was the company founded?", "expected_answer": "2019", "category": "company_facts", "difficulty": "easy" } ]

Offline Metrics

1

Exact Match — For factual answers

Did the answer match the expected answer word-for-word or semantically? Best for Q&A with definitive answers (dates, names, policies).

2

BLEU / ROUGE — Overlap metrics

Measure n-gram overlap between generated and expected answer. Fast but shallow — high overlap doesn't guarantee correctness.

3

Semantic Similarity — Embedding-based

Embed both expected and actual answer, compute cosine distance. Captures meaning even if wording differs.

4

LLM-as-Judge — Human-like grading

Ask an LLM to grade the answer against the expected answer and rubric. Most expressive but slightly slower.

03 — RAG Metrics

RAG-Specific Evaluation (RAGAS, TruLens, DeepEval)

RAG systems have unique evaluation challenges. You need to measure retrieval quality (did we get relevant documents?), answer quality, and grounding (is the answer supported by context?).

RAG Evaluation Frameworks

Framework
RAGAS
Reference-free metrics: faithfulness, answer relevancy, context precision/recall.
Framework
TruLens
Instrumentation + evals for LLM apps; tracing and metrics dashboards.
Framework
DeepEval
Synthetic test generation + metrics for RAG and LLM systems.
Framework
LlamaIndex Eval
Built-in evaluation suite for retrieval and generation pipelines.

Key RAG Metrics

Retrieval Precision (Context Precision): What fraction of retrieved chunks are actually relevant to the query? Retrieval Recall (Context Recall): What fraction of ground-truth information was retrieved? Faithfulness: Are the claims in the answer supported by the retrieved context? Answer Relevancy: Does the final answer actually address the user's question?

ℹ️ Trade-offs: Retrieval precision and recall trade off — higher recall means more noise. Typically optimize for high recall (~0.8–0.9) in retrieval, then use reranking to filter to top-5 for answer generation.
04 — Automated Grading

LLM-as-Judge Pattern

LLM-as-judge is the most practical approach for evaluating open-ended answers. Instead of hand-grading every output, you ask a strong model (Claude, GPT-4) to grade against a rubric. It's fast, scalable, and surprisingly accurate when given a good rubric.

from anthropic import Anthropic import json client = Anthropic() GOLDEN_SET = [ {'question': 'What is your return policy?', 'expected': '30-day return window, no questions asked'}, {'question': 'How long does shipping take?', 'expected': '3-5 business days for standard shipping'}, ] JUDGE_PROMPT = """Score this answer 1-5 against the expected answer. 5=Perfect, 4=Good (minor gaps), 3=Partial, 2=Mostly wrong, 1=Completely wrong. Reply with only the number.""" def judge(question: str, expected: str, actual: str) -> int: resp = client.messages.create( model='claude-haiku-4-5-20251001', max_tokens=4, messages=[{'role':'user','content': f'Question: {question}\nExpected: {expected}\nActual: {actual}\n\n{JUDGE_PROMPT}'}] ) try: return int(resp.content[0].text.strip()) except: return 0 def evaluate(answer_fn) -> dict: scores = [] for item in GOLDEN_SET: actual = answer_fn(item['question']) score = judge(item['question'], item['expected'], actual) scores.append({'q': item['question'], 'score': score}) avg = sum(s['score'] for s in scores) / len(scores) return {'average': round(avg, 2), 'details': scores}

Judge Rubrics

A good rubric describes what each score level means and what the judge should look for. Be specific about what "good" looks like.

Rubric: Answer Correctness

  • 5: Answer is complete, accurate, and well-explained
  • 4: Answer is correct but missing minor details
  • 3: Answer is partially correct; major info missing
  • 2: Answer is mostly wrong or contradicts expected
  • 1: Answer is completely wrong or off-topic

Rubric: Grounding in Context

  • 5: Every claim is cited to retrieved context
  • 4: Most claims grounded; 1-2 unsupported
  • 3: Some claims grounded; several unsupported
  • 2: Mostly unsupported; mostly hallucinated
  • 1: Completely hallucinated; no grounding
Python · LLM-as-judge with structured output scoring
from pydantic import BaseModel, Field
from openai import OpenAI

client = OpenAI()

class JudgmentResult(BaseModel):
    score: int = Field(ge=1, le=5, description="Quality score 1 (poor) to 5 (excellent)")
    reasoning: str = Field(description="One-sentence explanation of the score")
    passes_bar: bool = Field(description="True if score >= 3 (acceptable quality)")

def judge_response(question: str, response: str, criteria: str) -> JudgmentResult:
    system = f"""You are a calibrated evaluator. Use these criteria:
{criteria}
Score 1=wrong/harmful, 2=poor, 3=acceptable, 4=good, 5=excellent.
Be consistent: the same response quality should always get the same score."""

    result = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system},
            {"role": "user",   "content": f"Question: {question}

Response:
{response}"}
        ],
        response_format=JudgmentResult,
        temperature=0.0
    )
    return result.choices[0].message.parsed

# Example
j = judge_response(
    question="Explain what RAG is in 2 sentences.",
    response="RAG combines retrieval of relevant documents with LLM generation. It reduces hallucinations by grounding the model in external knowledge.",
    criteria="1. Factual accuracy  2. Completeness  3. Appropriate length"
)
print(f"Score: {j.score}/5 | Pass: {j.passes_bar}")
print(f"Reason: {j.reasoning}")
Tip: Test your judge rubric on a few golden examples first. Does the judge consistently give the scores you expect? Adjust the rubric if not.
05 — Observability

Production Monitoring (Langfuse, LangSmith, W&B)

Offline evaluation gets you to deploy. Production monitoring ensures you catch quality regression before it affects users. Without monitoring, silent failures can persist for weeks.

Production Monitoring Tools

Observability
Langfuse
Open-source LLM observability: tracing, evals, cost tracking, dashboards.
Observability
LangSmith
LangChain's native platform for tracing, monitoring, and LLMOps.
Observability
Weights & Biases
ML monitoring for LLMs: evals, cost tracking, comparison dashboards.
Observability
Arize Phoenix
Production ML monitoring; detect drift and anomalies in real-time.

What to Monitor

1
Latency: P50, P95, P99 response times. Alert if P99 > threshold.
2
Cost per token: Track total and per-request cost across models.
3
Error rate: Crashes, API errors, timeouts. Alert if > 0.1%.
4
Quality metrics: LLM-as-judge scores on production traffic sample (5-10%).
5
User feedback: Thumbs up/down, explicit ratings, complaint tickets.
⚠️ Sampling strategy: Running LLM-as-judge on every production request is expensive. Sample 5-10% of requests daily for evaluation. If quality drops below threshold, increase sampling frequency.

Detecting Quality Regression

Set up alerts for: (1) Judge score drops >5% day-over-day, (2) Retrieval precision drops below baseline, (3) Error rate spikes, (4) Latency increases >30%. When an alert fires, compare with recent code changes and revert if needed.

06 — Continuous

Continuous Evaluation Pipeline

Single-run evaluations are not enough. Production AI systems need continuous evaluation: every model update, prompt change, or retrieval schema change should trigger an automated eval run that catches regressions before they reach users.

A practical pipeline: nightly eval job against a frozen golden set → LLM-as-judge scoring on sampled live traffic → alert on score drop >5% → human review queue for flagged outputs. Gate model promotions on eval passage in CI/CD. The golden set should cover both typical cases and known hard cases — edge cases that previously caused production incidents are especially valuable.

Python · Nightly eval job with regression gate
import json, statistics, sys
from openai import OpenAI

client = OpenAI()

def llm_score(question: str, expected: str, actual: str) -> float:
    """LLM-as-judge: returns 0.0–1.0."""
    prompt = f"""Rate how well the response answers the question compared to the expected answer.
Question: {question}
Expected: {expected}
Actual: {actual}
Reply with a single number 0-10 only."""
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=5, temperature=0.0
    ).choices[0].message.content.strip()
    try:
        return float(resp) / 10.0
    except ValueError:
        return 0.0

def run_eval(golden_path: str, model: str, threshold: float = 0.80) -> dict:
    golden = [json.loads(l) for l in open(golden_path)]
    scores = []
    failures = []

    for item in golden:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": item["input"]}]
        ).choices[0].message.content
        score = llm_score(item["input"], item["expected"], response)
        scores.append(score)
        if score < 0.6:
            failures.append({"input": item["input"], "score": score, "response": response})

    avg = statistics.mean(scores)
    result = {"score": avg, "n": len(golden), "failures": failures, "passed": avg >= threshold}
    print(f"Eval: n={len(golden)} avg={avg:.3f} failures={len(failures)} → {'PASS' if result['passed'] else 'FAIL'}")
    return result

# Run and gate on result
result = run_eval("golden_set.jsonl", "gpt-4o", threshold=0.80)
if not result["passed"]:
    sys.exit(f"Eval regression: {result['score']:.3f} < 0.80")
06 — Explore

Related Topics

Dive deeper into specific evaluation practices and tools:

Benchmarks
Deep Dive
Standard benchmarks for evaluating LLMs: MMLU, HumanEval, GSM8K, and beyond.
RAG Evaluation
Deep Dive
Specialized metrics and tools for retrieval-augmented generation systems.
Observability
Deep Dive
Production monitoring, tracing, and dashboards for LLM applications.
07 — Further Reading

References

Academic Papers
Frameworks & Tools
Blogs & Guides