Evaluation & Testing

LLM Evaluation in Practice

Building eval pipelines that actually catch regressions — metrics, frameworks, and the human-vs-automatic tradeoff

offline → shadow → live the eval funnel
LLM-as-judge the emerging standard
50+ evals before ship quality bar to set
Contents
  1. Why evals are hard
  2. The eval funnel
  3. Metrics for quality
  4. LLM-as-judge
  5. Golden sets
  6. Frameworks & infrastructure
  7. Anti-patterns
01 — The Challenge

Why Evals Are Hard

Classic software testing is deterministic and binary: a function either returns the correct value or it doesn't. LLM evaluation is fundamentally different. Model outputs are non-deterministic, open-ended, and contextual. A response can be correct in multiple ways, and "correctness" itself is often subjective.

The eval trilemma defines the core tension: accuracy (does your metric actually measure what matters?), cost (can you run it frequently without bankrupt budget?), and speed (do you get feedback before your next deployment?). You can usually achieve only two.

No single eval covers all failure modes. You need a suite. Common mistakes: evaluating only on benchmark datasets (train-test contamination and narrow coverage), using BLEU/ROUGE for generation (poor correlation with human judgment), and ignoring regression testing over time. These shortcuts feel fast until production fails.

⚠️ Critical insight: A model can score higher on MMLU while getting worse on your actual task. Always evaluate on task-specific data, not just benchmarks. Public benchmarks measure general capability; your product measures specific value.
02 — Layered Strategy

The Eval Funnel: Offline → Shadow → Live

Effective eval strategies have three layers. Each layer catches different failure modes and carries different risk profiles. Together they form a funnel that progressively de-risks model deployment.

The Three Layers

Layer 1 — Offline evals: Run before any deployment. Automated, fast, cheap. Catch obvious regressions on your golden set. Layer 2 — Shadow evals: New model handles real traffic but responses aren't shown to users. Compare outputs between old and new. Layer 3 — Online evals: A/B test. Real users, real stakes. Monitor engagement, satisfaction, thumbs-up/down signals.

Offline evals (automated, pre-deploy): ✓ Task-specific accuracy on golden set (target: >92%) ✓ Regression suite: 200 previously-failing test cases ✓ Safety classifier: 0 violations in red-team set Shadow evals (live traffic, no user impact): ✓ Side-by-side LLM judge: new vs old on 1000 requests ✓ Latency comparison (TTFT, TBT) Online evals (A/B test, 5% traffic): ✓ User satisfaction score (thumbs up/down) ✓ Session length and follow-up query rate
LayerSpeedCostSignal qualityRisk
OfflineFastLowMediumZero
ShadowSlowMediumHighZero
Online A/BSlowHighHighestReal users
Python · Eval funnel: offline → shadow → live with metrics dashboard
import json, statistics, time
from openai import OpenAI
from enum import Enum

client = OpenAI()

class EvalStage(Enum):
    OFFLINE = "offline"   # run against golden set before deploy
    SHADOW  = "shadow"    # run new model in parallel, compare to prod
    LIVE    = "live"      # sample % of live traffic for quality check

def offline_eval(golden_path: str, model: str) -> dict:
    """Stage 1: evaluate against frozen golden set."""
    golden = [json.loads(l) for l in open(golden_path)]
    scores = []
    for item in golden:
        resp = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": item["input"]}],
            temperature=0.0
        ).choices[0].message.content
        # Exact-match or LLM-judge scoring
        score = 1.0 if item["expected"].lower() in resp.lower() else 0.0
        scores.append(score)
    return {"stage": "offline", "n": len(golden),
            "accuracy": round(statistics.mean(scores), 3),
            "passed": statistics.mean(scores) >= 0.80}

def shadow_eval(requests: list[dict], prod_model: str, new_model: str) -> dict:
    """Stage 2: run both models, flag where new model differs significantly."""
    diffs = []
    for req in requests[:100]:  # sample first 100
        prod_resp = client.chat.completions.create(
            model=prod_model,
            messages=[{"role": "user", "content": req["input"]}]
        ).choices[0].message.content
        new_resp = client.chat.completions.create(
            model=new_model,
            messages=[{"role": "user", "content": req["input"]}]
        ).choices[0].message.content
        # Simple diff: check response length ratio
        ratio = len(new_resp) / max(len(prod_resp), 1)
        if ratio < 0.5 or ratio > 2.0:
            diffs.append({"input": req["input"], "ratio": ratio})
    return {"stage": "shadow", "n": len(requests[:100]),
            "significant_diffs": len(diffs), "diff_rate": len(diffs)/100}

# Usage
result = offline_eval("golden_set.jsonl", "gpt-4o")
if result["passed"]:
    print(f"Offline eval PASSED ({result['accuracy']:.0%})")
else:
    raise SystemExit(f"Offline eval FAILED ({result['accuracy']:.0%})")
03 — Measuring Output

Metrics for Generation Quality

Exact match: For classification and extraction tasks. Not useful for open-ended generation. BLEU / ROUGE: n-gram overlap with a reference answer. Still used for translation and summarization but poorly correlates with actual quality. BERTScore: Embedding-based semantic similarity. Better than BLEU, still requires a reference. G-Eval: Use GPT-4 to score responses on rubric dimensions (fluency, coherence, relevance, accuracy). High correlation with human judgments.

MetricReference neededHuman correlationSpeedCost
Exact matchYesHigh (for extraction)FastFree
BLEU / ROUGEYesLow (generation)FastFree
BERTScoreYesMediumFastFree
G-Eval (LLM judge)OptionalHighSlowAPI cost
Human evalN/AGround truthVery slowHigh

The choice depends on your use case. For factual extraction, exact match is reliable. For creative generation, LLM judging is more trustworthy than n-gram overlap. Invest in building ground truth — a small set of human-evaluated examples that validates your automated metrics.

04 — Automated Preference

LLM-as-Judge

Use a capable LLM (GPT-4o, Claude) to evaluate outputs on custom rubrics. This scales human judgment without re-hiring annotators. But it introduces new failure modes: position bias (preferring the first response), verbosity bias (preferring longer answers), and self-serving bias (models prefer their own outputs).

Point-wise vs Pair-wise

Point-wise: Score each response independently on a 1–5 scale per dimension. Faster, easier to parallelize, but noisier. Pair-wise: Compare two responses, pick the better one. More reliable than point-wise because the relative comparison is clearer. Requires more judge calls but produces higher-quality signals.

Reference-free Evaluation

You don't need a ground-truth reference to use LLM-as-judge. The judge can evaluate coherence, safety, and helpfulness from first principles. This is more expensive than metrics but works for open-ended tasks where multiple correct answers exist.

You are evaluating an AI assistant's response. [TASK]: {task_description} [USER QUERY]: {query} [RESPONSE]: {response} Rate the response on: - Accuracy (1-5): Does it correctly answer the question? - Completeness (1-5): Does it cover all aspects? - Safety (1-5): Is it free of harmful content? Provide a brief justification for each score. Return JSON: {"accuracy": N, "completeness": N, "safety": N, "reasoning": "..."}

Failure Modes and Mitigations

Position bias: Judge prefers first response. Mitigation: swap order and average. Verbosity bias: Longer answers score higher. Mitigation: require conciseness in the rubric. Self-serving bias: Judge model prefers outputs from its own family. Mitigation: use diverse judge models (GPT-4o, Claude, Llama).

⚠️ Validation is essential: Always validate your LLM judge against human labels on a 50-100 item sample before trusting it at scale. Calibrate your rubric against ground truth.
05 — Source of Truth

Golden Sets and Regression Testing

Your golden dataset is a curated set of (input, expected output) pairs that define acceptable behavior. It's your product spec in eval form. Sources include real user queries (sampled and labeled), adversarial cases (red-team), and edge cases discovered in production.

Building and Growing Your Golden Set

Size guidance: Start with 100-200 cases for initial coverage. Grow to 500-1000 as your product matures. Prioritize diversity over raw size. Labeling: Have domain experts label these examples. A single label is better than disagreement. Versioning: Keep golden sets under version control. Track when cases were added and why.

Regression Testing Workflow

Run your golden set on every model change. Flag if score drops >2% on any category. Use slice analysis: don't just track overall accuracy — track by query type, language, user segment, and difficulty level. This catches regressions that hide in the aggregate.

Golden sets are your contract with users. If you can't write eval cases for a feature, you don't understand it well enough to ship. Evals force clarity.
06 — Tools Landscape

Eval Frameworks and Infrastructure

Many frameworks exist to manage eval pipelines. Choose based on your team's preferences and your existing stack. The key is integrating evals into CI/CD so they block bad deployments automatically.

Framework
OpenAI Evals
Model grading + dataset registry. Python-native. Best for GPT-based products.
Framework
LangSmith
Tracing + evals together. Integrates with LangChain. Visual dashboards.
Framework
Braintrust
Fast eval runs, git-native. Strong replay and debugging tools.
Framework
PromptFoo
CLI-first, YAML config. Great for prompt testing and comparison.
Framework
DeepEval
Many built-in metrics. Optimized for RAG and chatbots.
Framework
Weights & Biases
Eval logging + experiment tracking. Rich visualizations.
Framework
RAGAS
Specialized for RAG evaluation. Metrics for retrieval quality.
Framework
Arize Phoenix
Model monitoring + evals. Production focus.

CI/CD Integration

Run offline evals on every PR. Block merge if score drops below threshold. Log every eval run with model version, prompt version, and dataset version — this enables trend analysis and bisecting regressions to specific changes. Use smaller judge models for fast CI runs (Llama 2, Mistral). Reserve GPT-4o for weekly deep evals.

FrameworkStrengthsBest forOpen source
OpenAI EvalsModel grading, datasetsGPT-based productsYes
LangSmithTracing + evalsLangChain appsNo
BraintrustFast, git-nativeGeneral LLM productsNo
PromptFooCLI-first, YAMLPrompt testingYes
DeepEvalBuilt-in metricsRAG + chatbotsYes
07 — What to Avoid

Eval Anti-Patterns

Common pitfalls that lead teams to trust evals that don't actually measure product quality:

1

Teaching to the test — benchmark overfitting

Using benchmark datasets (MMLU, HumanEval) as your only eval. Models can overfit to benchmarks without improving on real tasks. Benchmarks are coarse signals of general capability, not product metrics.

2

Single-metric obsession — hidden regressions

Optimizing one number (e.g., helpfulness) while degrading others (safety, factuality). Always track a metric suite. Monitor at least: accuracy, safety, latency, and cost.

3

No human baseline — validating noise

Running LLM evals without validating against human judgments. You may be optimizing arbitrary scoring noise. Always calibrate your LLM judge against ground truth on a small sample.

4

Static golden set — stale signal

Never updating your golden set as user behavior evolves. Add 10-20 new cases quarterly. Retire cases that no longer reflect real use. Live evals keep pace with product drift.

08 — Further Reading

References

Academic Papers
Documentation & Guides
Practitioner Writing