Evaluation

Contents

Why evaluation matters
Offline evaluation
RAG-specific evaluation
LLM-as-judge pattern
Production monitoring
Child pages
References

01 — Foundation

Why Evaluation Matters

You can't improve what you don't measure. Evaluation tells you if your model knows enough, if your RAG is retrieving the right chunks, and whether quality has quietly degraded since last week — before your users notice.

A typical LLM system has three stages of evaluation: offline (before deployment), canary (small traffic), and production (full rollout). Without rigorous offline evaluation, bugs reach production. Without monitoring, quality drift goes undetected for weeks.

✓ Golden evaluation set: Build your golden set from day 1. Start with 20–30 representative questions and expected answers. Run the golden set every time you change a model, prompt, or pipeline — treat it like your test suite.

Three Levels of Measurement

Stage	When	Method	Risk
Offline	Before deployment	Golden set, benchmarks	Missing edge cases
Canary	5-10% production traffic	LLM-as-judge, user feedback	Statistical noise
Production	All traffic, ongoing	Monitoring dashboards, alerts	False positives

02 — Benchmarking

Offline Evaluation (Benchmarks + Golden Sets)

Offline evaluation is your first gate. You test on a fixed dataset of questions and expected answers before any code reaches production. It's fast, reproducible, and catches obvious failures.

Building a Golden Dataset

A golden dataset is a small (20–100 questions), curated collection of representative examples with ground-truth answers. Include edge cases: typos, ambiguous queries, multi-part questions, rare entities.

Golden dataset structure: [ { "question": "What is our return policy?", "expected_answer": "30-day return window, no questions asked", "category": "policy" }, { "question": "How long does shipping take to Alaska?", "expected_answer": "7-10 business days (standard), 2-3 business days (express)", "category": "shipping", "difficulty": "hard" }, { "question": "What year was the company founded?", "expected_answer": "2019", "category": "company_facts", "difficulty": "easy" } ]

Offline Metrics

Exact Match — For factual answers

Did the answer match the expected answer word-for-word or semantically? Best for Q&A with definitive answers (dates, names, policies).

BLEU / ROUGE — Overlap metrics

Measure n-gram overlap between generated and expected answer. Fast but shallow — high overlap doesn't guarantee correctness.

Semantic Similarity — Embedding-based

Embed both expected and actual answer, compute cosine distance. Captures meaning even if wording differs.

LLM-as-Judge — Human-like grading

Ask an LLM to grade the answer against the expected answer and rubric. Most expressive but slightly slower.

03 — RAG Metrics

RAG-Specific Evaluation (RAGAS, TruLens, DeepEval)

RAG systems have unique evaluation challenges. You need to measure retrieval quality (did we get relevant documents?), answer quality, and grounding (is the answer supported by context?).

RAG Evaluation Frameworks

Framework

RAGAS

Reference-free metrics: faithfulness, answer relevancy, context precision/recall.

Framework

TruLens

Instrumentation + evals for LLM apps; tracing and metrics dashboards.

Framework

DeepEval

Synthetic test generation + metrics for RAG and LLM systems.

Framework

LlamaIndex Eval

Built-in evaluation suite for retrieval and generation pipelines.

Key RAG Metrics

Retrieval Precision (Context Precision): What fraction of retrieved chunks are actually relevant to the query? Retrieval Recall (Context Recall): What fraction of ground-truth information was retrieved? Faithfulness: Are the claims in the answer supported by the retrieved context? Answer Relevancy: Does the final answer actually address the user's question?

ℹ️ Trade-offs: Retrieval precision and recall trade off — higher recall means more noise. Typically optimize for high recall (~0.8–0.9) in retrieval, then use reranking to filter to top-5 for answer generation.

04 — Automated Grading

LLM-as-Judge Pattern

LLM-as-judge is the most practical approach for evaluating open-ended answers. Instead of hand-grading every output, you ask a strong model (Claude, GPT-4) to grade against a rubric. It's fast, scalable, and surprisingly accurate when given a good rubric.

from anthropic import Anthropic import json client = Anthropic() GOLDEN_SET = [ {'question': 'What is your return policy?', 'expected': '30-day return window, no questions asked'}, {'question': 'How long does shipping take?', 'expected': '3-5 business days for standard shipping'}, ] JUDGE_PROMPT = """Score this answer 1-5 against the expected answer. 5=Perfect, 4=Good (minor gaps), 3=Partial, 2=Mostly wrong, 1=Completely wrong. Reply with only the number.""" def judge(question: str, expected: str, actual: str) -> int: resp = client.messages.create( model='claude-haiku-4-5-20251001', max_tokens=4, messages=[{'role':'user','content': f'Question: {question}\nExpected: {expected}\nActual: {actual}\n\n{JUDGE_PROMPT}'}] ) try: return int(resp.content[0].text.strip()) except: return 0 def evaluate(answer_fn) -> dict: scores = [] for item in GOLDEN_SET: actual = answer_fn(item['question']) score = judge(item['question'], item['expected'], actual) scores.append({'q': item['question'], 'score': score}) avg = sum(s['score'] for s in scores) / len(scores) return {'average': round(avg, 2), 'details': scores}

Judge Rubrics

A good rubric describes what each score level means and what the judge should look for. Be specific about what "good" looks like.

Rubric: Answer Correctness

5: Answer is complete, accurate, and well-explained
4: Answer is correct but missing minor details
3: Answer is partially correct; major info missing
2: Answer is mostly wrong or contradicts expected
1: Answer is completely wrong or off-topic

Rubric: Grounding in Context

5: Every claim is cited to retrieved context
4: Most claims grounded; 1-2 unsupported
3: Some claims grounded; several unsupported
2: Mostly unsupported; mostly hallucinated
1: Completely hallucinated; no grounding

Python · LLM-as-judge with structured output scoring

from pydantic import BaseModel, Field
from openai import OpenAI

client = OpenAI()

class JudgmentResult(BaseModel):
    score: int = Field(ge=1, le=5, description="Quality score 1 (poor) to 5 (excellent)")
    reasoning: str = Field(description="One-sentence explanation of the score")
    passes_bar: bool = Field(description="True if score >= 3 (acceptable quality)")

def judge_response(question: str, response: str, criteria: str) -> JudgmentResult:
    system = f"""You are a calibrated evaluator. Use these criteria:
{criteria}
Score 1=wrong/harmful, 2=poor, 3=acceptable, 4=good, 5=excellent.
Be consistent: the same response quality should always get the same score."""

    result = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system},
            {"role": "user",   "content": f"Question: {question}

Response:
{response}"}
        ],
        response_format=JudgmentResult,
        temperature=0.0
    )
    return result.choices[0].message.parsed

# Example
j = judge_response(
    question="Explain what RAG is in 2 sentences.",
    response="RAG combines retrieval of relevant documents with LLM generation. It reduces hallucinations by grounding the model in external knowledge.",
    criteria="1. Factual accuracy  2. Completeness  3. Appropriate length"
)
print(f"Score: {j.score}/5 | Pass: {j.passes_bar}")
print(f"Reason: {j.reasoning}")

✓ Tip: Test your judge rubric on a few golden examples first. Does the judge consistently give the scores you expect? Adjust the rubric if not.

05 — Observability

Production Monitoring (Langfuse, LangSmith, W&B)

Offline evaluation gets you to deploy. Production monitoring ensures you catch quality regression before it affects users. Without monitoring, silent failures can persist for weeks.

Production Monitoring Tools

Observability

Langfuse

Open-source LLM observability: tracing, evals, cost tracking, dashboards.

Observability

LangSmith

LangChain's native platform for tracing, monitoring, and LLMOps.

Observability

Weights & Biases

ML monitoring for LLMs: evals, cost tracking, comparison dashboards.

Observability

Arize Phoenix

Production ML monitoring; detect drift and anomalies in real-time.

What to Monitor

Latency: P50, P95, P99 response times. Alert if P99 > threshold.

Cost per token: Track total and per-request cost across models.

Error rate: Crashes, API errors, timeouts. Alert if > 0.1%.

Quality metrics: LLM-as-judge scores on production traffic sample (5-10%).

User feedback: Thumbs up/down, explicit ratings, complaint tickets.

⚠️ Sampling strategy: Running LLM-as-judge on every production request is expensive. Sample 5-10% of requests daily for evaluation. If quality drops below threshold, increase sampling frequency.

Detecting Quality Regression

Set up alerts for: (1) Judge score drops >5% day-over-day, (2) Retrieval precision drops below baseline, (3) Error rate spikes, (4) Latency increases >30%. When an alert fires, compare with recent code changes and revert if needed.

06 — Continuous

Continuous Evaluation Pipeline

Single-run evaluations are not enough. Production AI systems need continuous evaluation: every model update, prompt change, or retrieval schema change should trigger an automated eval run that catches regressions before they reach users.

A practical pipeline: nightly eval job against a frozen golden set → LLM-as-judge scoring on sampled live traffic → alert on score drop >5% → human review queue for flagged outputs. Gate model promotions on eval passage in CI/CD. The golden set should cover both typical cases and known hard cases — edge cases that previously caused production incidents are especially valuable.

Python · Nightly eval job with regression gate

import json, statistics, sys
from openai import OpenAI

client = OpenAI()

def llm_score(question: str, expected: str, actual: str) -> float:
    """LLM-as-judge: returns 0.0–1.0."""
    prompt = f"""Rate how well the response answers the question compared to the expected answer.
Question: {question}
Expected: {expected}
Actual: {actual}
Reply with a single number 0-10 only."""
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=5, temperature=0.0
    ).choices[0].message.content.strip()
    try:
        return float(resp) / 10.0
    except ValueError:
        return 0.0

def run_eval(golden_path: str, model: str, threshold: float = 0.80) -> dict:
    golden = [json.loads(l) for l in open(golden_path)]
    scores = []
    failures = []

    for item in golden:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": item["input"]}]
        ).choices[0].message.content
        score = llm_score(item["input"], item["expected"], response)
        scores.append(score)
        if score < 0.6:
            failures.append({"input": item["input"], "score": score, "response": response})

    avg = statistics.mean(scores)
    result = {"score": avg, "n": len(golden), "failures": failures, "passed": avg >= threshold}
    print(f"Eval: n={len(golden)} avg={avg:.3f} failures={len(failures)} → {'PASS' if result['passed'] else 'FAIL'}")
    return result

# Run and gate on result
result = run_eval("golden_set.jsonl", "gpt-4o", threshold=0.80)
if not result["passed"]:
    sys.exit(f"Eval regression: {result['score']:.3f} < 0.80")

06 — Explore

References

Academic Papers

Paper Zheng, L. et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685 — arxiv:2306.05685 ↗
Paper Es, S. et al. (2023). RAGAS: A Reference-Free Evaluation Metric for Retrieval-Augmented Generation. arXiv:2309.15217 — arxiv:2309.15217 ↗

Frameworks & Tools

Docs RAGAS documentation. — docs.ragas.io ↗
Docs Langfuse: Open-source LLM observability. — langfuse.com ↗

Blogs & Guides

Blog Hamel Husain. Your AI Product Needs Evals. — hamel.dev/blog ↗

Evaluating LLM Systems

Why Evaluation Matters

Three Levels of Measurement

Offline Evaluation (Benchmarks + Golden Sets)

Building a Golden Dataset

Offline Metrics

Exact Match — For factual answers

BLEU / ROUGE — Overlap metrics

Semantic Similarity — Embedding-based

LLM-as-Judge — Human-like grading

RAG-Specific Evaluation (RAGAS, TruLens, DeepEval)

RAG Evaluation Frameworks

Key RAG Metrics

LLM-as-Judge Pattern

Judge Rubrics

Rubric: Answer Correctness

Rubric: Grounding in Context

Production Monitoring (Langfuse, LangSmith, W&B)

Production Monitoring Tools

What to Monitor

Detecting Quality Regression

Continuous Evaluation Pipeline

Related Topics

References

Evaluating LLM Systems

Why Evaluation Matters

Three Levels of Measurement

Offline Evaluation (Benchmarks + Golden Sets)

Building a Golden Dataset

Offline Metrics

Exact Match — For factual answers

BLEU / ROUGE — Overlap metrics

Semantic Similarity — Embedding-based

LLM-as-Judge — Human-like grading

RAG-Specific Evaluation (RAGAS, TruLens, DeepEval)

RAG Evaluation Frameworks

Key RAG Metrics

LLM-as-Judge Pattern

Judge Rubrics

Rubric: Answer Correctness

Rubric: Grounding in Context

Production Monitoring (Langfuse, LangSmith, W&B)

Production Monitoring Tools

What to Monitor

Detecting Quality Regression

Continuous Evaluation Pipeline

Related Topics

References

Related concepts