LLM Evaluation in Practice

Contents

Why evals are hard
The eval funnel
Metrics for quality
LLM-as-judge
Golden sets
Frameworks & infrastructure
Anti-patterns

01 — The Challenge

Why Evals Are Hard

Classic software testing is deterministic and binary: a function either returns the correct value or it doesn't. LLM evaluation is fundamentally different. Model outputs are non-deterministic, open-ended, and contextual. A response can be correct in multiple ways, and "correctness" itself is often subjective.

The eval trilemma defines the core tension: accuracy (does your metric actually measure what matters?), cost (can you run it frequently without bankrupt budget?), and speed (do you get feedback before your next deployment?). You can usually achieve only two.

No single eval covers all failure modes. You need a suite. Common mistakes: evaluating only on benchmark datasets (train-test contamination and narrow coverage), using BLEU/ROUGE for generation (poor correlation with human judgment), and ignoring regression testing over time. These shortcuts feel fast until production fails.

⚠️ Critical insight: A model can score higher on MMLU while getting worse on your actual task. Always evaluate on task-specific data, not just benchmarks. Public benchmarks measure general capability; your product measures specific value.

02 — Layered Strategy

The Eval Funnel: Offline → Shadow → Live

Effective eval strategies have three layers. Each layer catches different failure modes and carries different risk profiles. Together they form a funnel that progressively de-risks model deployment.

The Three Layers

Layer 1 — Offline evals: Run before any deployment. Automated, fast, cheap. Catch obvious regressions on your golden set. Layer 2 — Shadow evals: New model handles real traffic but responses aren't shown to users. Compare outputs between old and new. Layer 3 — Online evals: A/B test. Real users, real stakes. Monitor engagement, satisfaction, thumbs-up/down signals.

Offline evals (automated, pre-deploy): ✓ Task-specific accuracy on golden set (target: >92%) ✓ Regression suite: 200 previously-failing test cases ✓ Safety classifier: 0 violations in red-team set Shadow evals (live traffic, no user impact): ✓ Side-by-side LLM judge: new vs old on 1000 requests ✓ Latency comparison (TTFT, TBT) Online evals (A/B test, 5% traffic): ✓ User satisfaction score (thumbs up/down) ✓ Session length and follow-up query rate

Layer	Speed	Cost	Signal quality	Risk
Offline	Fast	Low	Medium	Zero
Shadow	Slow	Medium	High	Zero
Online A/B	Slow	High	Highest	Real users

Python · Eval funnel: offline → shadow → live with metrics dashboard

import json, statistics, time
from openai import OpenAI
from enum import Enum

client = OpenAI()

class EvalStage(Enum):
    OFFLINE = "offline"   # run against golden set before deploy
    SHADOW  = "shadow"    # run new model in parallel, compare to prod
    LIVE    = "live"      # sample % of live traffic for quality check

def offline_eval(golden_path: str, model: str) -> dict:
    """Stage 1: evaluate against frozen golden set."""
    golden = [json.loads(l) for l in open(golden_path)]
    scores = []
    for item in golden:
        resp = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": item["input"]}],
            temperature=0.0
        ).choices[0].message.content
        # Exact-match or LLM-judge scoring
        score = 1.0 if item["expected"].lower() in resp.lower() else 0.0
        scores.append(score)
    return {"stage": "offline", "n": len(golden),
            "accuracy": round(statistics.mean(scores), 3),
            "passed": statistics.mean(scores) >= 0.80}

def shadow_eval(requests: list[dict], prod_model: str, new_model: str) -> dict:
    """Stage 2: run both models, flag where new model differs significantly."""
    diffs = []
    for req in requests[:100]:  # sample first 100
        prod_resp = client.chat.completions.create(
            model=prod_model,
            messages=[{"role": "user", "content": req["input"]}]
        ).choices[0].message.content
        new_resp = client.chat.completions.create(
            model=new_model,
            messages=[{"role": "user", "content": req["input"]}]
        ).choices[0].message.content
        # Simple diff: check response length ratio
        ratio = len(new_resp) / max(len(prod_resp), 1)
        if ratio < 0.5 or ratio > 2.0:
            diffs.append({"input": req["input"], "ratio": ratio})
    return {"stage": "shadow", "n": len(requests[:100]),
            "significant_diffs": len(diffs), "diff_rate": len(diffs)/100}

# Usage
result = offline_eval("golden_set.jsonl", "gpt-4o")
if result["passed"]:
    print(f"Offline eval PASSED ({result['accuracy']:.0%})")
else:
    raise SystemExit(f"Offline eval FAILED ({result['accuracy']:.0%})")

03 — Measuring Output

Metrics for Generation Quality

Exact match: For classification and extraction tasks. Not useful for open-ended generation. BLEU / ROUGE: n-gram overlap with a reference answer. Still used for translation and summarization but poorly correlates with actual quality. BERTScore: Embedding-based semantic similarity. Better than BLEU, still requires a reference. G-Eval: Use GPT-4 to score responses on rubric dimensions (fluency, coherence, relevance, accuracy). High correlation with human judgments.

Metric	Reference needed	Human correlation	Speed	Cost
Exact match	Yes	High (for extraction)	Fast	Free
BLEU / ROUGE	Yes	Low (generation)	Fast	Free
BERTScore	Yes	Medium	Fast	Free
G-Eval (LLM judge)	Optional	High	Slow	API cost
Human eval	N/A	Ground truth	Very slow	High

The choice depends on your use case. For factual extraction, exact match is reliable. For creative generation, LLM judging is more trustworthy than n-gram overlap. Invest in building ground truth — a small set of human-evaluated examples that validates your automated metrics.

04 — Automated Preference

LLM-as-Judge

Use a capable LLM (GPT-4o, Claude) to evaluate outputs on custom rubrics. This scales human judgment without re-hiring annotators. But it introduces new failure modes: position bias (preferring the first response), verbosity bias (preferring longer answers), and self-serving bias (models prefer their own outputs).

Point-wise vs Pair-wise

Point-wise: Score each response independently on a 1–5 scale per dimension. Faster, easier to parallelize, but noisier. Pair-wise: Compare two responses, pick the better one. More reliable than point-wise because the relative comparison is clearer. Requires more judge calls but produces higher-quality signals.

Reference-free Evaluation

You don't need a ground-truth reference to use LLM-as-judge. The judge can evaluate coherence, safety, and helpfulness from first principles. This is more expensive than metrics but works for open-ended tasks where multiple correct answers exist.

You are evaluating an AI assistant's response. [TASK]: {task_description} [USER QUERY]: {query} [RESPONSE]: {response} Rate the response on: - Accuracy (1-5): Does it correctly answer the question? - Completeness (1-5): Does it cover all aspects? - Safety (1-5): Is it free of harmful content? Provide a brief justification for each score. Return JSON: {"accuracy": N, "completeness": N, "safety": N, "reasoning": "..."}

Failure Modes and Mitigations

Position bias: Judge prefers first response. Mitigation: swap order and average. Verbosity bias: Longer answers score higher. Mitigation: require conciseness in the rubric. Self-serving bias: Judge model prefers outputs from its own family. Mitigation: use diverse judge models (GPT-4o, Claude, Llama).

⚠️ Validation is essential: Always validate your LLM judge against human labels on a 50-100 item sample before trusting it at scale. Calibrate your rubric against ground truth.

05 — Source of Truth

Golden Sets and Regression Testing

Your golden dataset is a curated set of (input, expected output) pairs that define acceptable behavior. It's your product spec in eval form. Sources include real user queries (sampled and labeled), adversarial cases (red-team), and edge cases discovered in production.

Building and Growing Your Golden Set

Size guidance: Start with 100-200 cases for initial coverage. Grow to 500-1000 as your product matures. Prioritize diversity over raw size. Labeling: Have domain experts label these examples. A single label is better than disagreement. Versioning: Keep golden sets under version control. Track when cases were added and why.

Regression Testing Workflow

Run your golden set on every model change. Flag if score drops >2% on any category. Use slice analysis: don't just track overall accuracy — track by query type, language, user segment, and difficulty level. This catches regressions that hide in the aggregate.

✓ Golden sets are your contract with users. If you can't write eval cases for a feature, you don't understand it well enough to ship. Evals force clarity.

06 — Tools Landscape

Eval Frameworks and Infrastructure

Many frameworks exist to manage eval pipelines. Choose based on your team's preferences and your existing stack. The key is integrating evals into CI/CD so they block bad deployments automatically.

Framework

OpenAI Evals

Model grading + dataset registry. Python-native. Best for GPT-based products.

Framework

LangSmith

Tracing + evals together. Integrates with LangChain. Visual dashboards.

Framework

Braintrust

Fast eval runs, git-native. Strong replay and debugging tools.

Framework

PromptFoo

CLI-first, YAML config. Great for prompt testing and comparison.

Framework

DeepEval

Many built-in metrics. Optimized for RAG and chatbots.

Framework

Weights & Biases

Eval logging + experiment tracking. Rich visualizations.

Framework

RAGAS

Specialized for RAG evaluation. Metrics for retrieval quality.

Framework

Arize Phoenix

Model monitoring + evals. Production focus.

CI/CD Integration

Run offline evals on every PR. Block merge if score drops below threshold. Log every eval run with model version, prompt version, and dataset version — this enables trend analysis and bisecting regressions to specific changes. Use smaller judge models for fast CI runs (Llama 2, Mistral). Reserve GPT-4o for weekly deep evals.

Framework	Strengths	Best for	Open source
OpenAI Evals	Model grading, datasets	GPT-based products	Yes
LangSmith	Tracing + evals	LangChain apps	No
Braintrust	Fast, git-native	General LLM products	No
PromptFoo	CLI-first, YAML	Prompt testing	Yes
DeepEval	Built-in metrics	RAG + chatbots	Yes

07 — What to Avoid

Eval Anti-Patterns

Common pitfalls that lead teams to trust evals that don't actually measure product quality:

Teaching to the test — benchmark overfitting

Using benchmark datasets (MMLU, HumanEval) as your only eval. Models can overfit to benchmarks without improving on real tasks. Benchmarks are coarse signals of general capability, not product metrics.

Single-metric obsession — hidden regressions

Optimizing one number (e.g., helpfulness) while degrading others (safety, factuality). Always track a metric suite. Monitor at least: accuracy, safety, latency, and cost.

No human baseline — validating noise

Running LLM evals without validating against human judgments. You may be optimizing arbitrary scoring noise. Always calibrate your LLM judge against ground truth on a small sample.

Static golden set — stale signal

Never updating your golden set as user behavior evolves. Add 10-20 new cases quarterly. Retire cases that no longer reflect real use. Live evals keep pace with product drift.

08 — Further Reading

References

Academic Papers

Paper G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment. Liu et al. (2023). arXiv:2303.16634. — arxiv:2303.16634 ↗
Paper Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. LMSYS (2024). arXiv:2306.05685. — arxiv:2306.05685 ↗
Paper Judging LLM-as-Judge with an LLM-as-Judge. Dubois et al. (2024). arXiv:2310.05470. — arxiv:2310.05470 ↗

Documentation & Guides

Docs OpenAI Evals — GitHub. github.com/openai/evals ↗
Docs DeepEval — GitHub. github.com/confident-ai/deepeval ↗
Docs LangSmith — Eval Docs. docs.smith.langchain.com ↗
Guide Braintrust — Eval Best Practices. braintrust.dev ↗

Practitioner Writing

Blog Liang et al. — HELM: Holistic Evaluation. stanford.edu/helm ↗
Blog OpenAI. — Building Better Evals. openai.com/blog ↗

LLM Evaluation in Practice

Why Evals Are Hard

The Eval Funnel: Offline → Shadow → Live

The Three Layers

Metrics for Generation Quality

LLM-as-Judge

Point-wise vs Pair-wise

Reference-free Evaluation

Failure Modes and Mitigations

Golden Sets and Regression Testing

Building and Growing Your Golden Set

Regression Testing Workflow

Eval Frameworks and Infrastructure

CI/CD Integration

Eval Anti-Patterns

Teaching to the test — benchmark overfitting

Single-metric obsession — hidden regressions

No human baseline — validating noise

Static golden set — stale signal

References

Related concepts