Eval Design

Why eval design comes first
Metric hierarchy
Choosing the right metrics
Human vs automated evaluation
Setting baselines
Eval dataset construction
Gotchas

SECTION 01

Why eval design comes first

Most LLM projects fail not because the model is bad but because the team never defined what success looks like. Without a clear eval, you're flying blind: you can't tell if a prompt change is an improvement, you can't compare model versions, and you can't know when the system is ready to ship. Eval design should happen before you write your first LLM call — it forces clarity on what the task actually is and what "done" means.

The discipline: write three things before any code — (1) a task description, (2) three examples of ideal outputs, and (3) three examples of unacceptable outputs. If you can't write these, the task isn't defined well enough to build.

SECTION 02

Metric hierarchy

Think in three levels:

Business metric: The thing you actually care about. Customer satisfaction, support ticket resolution rate, revenue lift from product descriptions. Hard to measure directly from LLM outputs.
Task metric: A proxy for the business metric that you can measure from LLM outputs. Factual accuracy, completeness, tone adherence. Correlates with the business metric but is measurable.
Automated proxy: A fast, cheap signal you can compute at scale. BLEU score, exact match, LLM-as-judge score 1-10, assertion pass rate. Correlates with the task metric but may have noise.

Build your eval system to capture all three levels — automated proxies for fast iteration, task metrics for CI gates, and periodic business metric measurement for strategic decisions.

SECTION 03

Choosing the right metrics

METRIC_GUIDE = {
    "summarisation": {
        "primary": "factual_consistency",      # does the summary contain only info from source?
        "secondary": ["coverage", "conciseness"],
        "automated": ["BERTScore", "SummaC", "LLM-judge(1-10)"],
        "avoid": ["ROUGE"],                    # low correlation with human preference
    },
    "qa_retrieval": {
        "primary": "answer_correctness",       # is the answer factually correct?
        "secondary": ["groundedness", "completeness"],
        "automated": ["exact_match", "F1", "LLM-judge"],
        "avoid": ["BLEU"],
    },
    "code_generation": {
        "primary": "functional_correctness",   # does the code run and pass tests?
        "secondary": ["style", "efficiency"],
        "automated": ["pass@k", "unit_test_pass_rate"],
        "avoid": ["syntax_check_only"],        # passes syntax != correct
    },
    "classification": {
        "primary": "accuracy",
        "secondary": ["precision", "recall", "calibration"],
        "automated": ["sklearn.metrics"],
        "avoid": ["accuracy_alone"],           # misleading on imbalanced classes
    },
}

SECTION 04

Human vs automated evaluation

Human evaluation: Ground truth. Expensive ($1–10 per annotation), slow (days not seconds), but the only way to validate that your automated metrics actually track quality. Use for: initial validation of your eval setup, periodic spot-checks (5–10% of production outputs), ambiguous cases that automated evals disagree on.

LLM-as-judge: Use GPT-4 or Claude to rate outputs. ~$0.01 per eval, runs in seconds, scales infinitely. Correlation with human preference: 0.7–0.9 depending on task. Use for: regression testing on every change, A/B comparison, bulk evaluation of large datasets. Always validate judge scores against human labels first.

Rule-based assertions: Fastest and most reliable for binary properties. "Output is valid JSON", "response mentions the product name", "no toxic content detected". Use as pre-filters before more expensive evaluation.

SECTION 05

Setting baselines

def establish_baselines(eval_dataset: list[dict], llm_call) -> dict:
    # Always measure these before optimising:
    baselines = {}

    # 1. Zero-shot baseline (no examples)
    zero_shot_scores = [evaluate(llm_call(item["input"]), item["expected"])
                        for item in eval_dataset]
    baselines["zero_shot"] = sum(zero_shot_scores) / len(zero_shot_scores)

    # 2. Naive baseline (simplest possible approach)
    # e.g. for summarisation: first 3 sentences
    naive_scores = [evaluate(item["input"][:500], item["expected"])
                    for item in eval_dataset]
    baselines["naive"] = sum(naive_scores) / len(naive_scores)

    # 3. Human ceiling (sample of human-written outputs)
    if any("human_output" in item for item in eval_dataset[:5]):
        human_scores = [evaluate(item.get("human_output", ""), item["expected"])
                        for item in eval_dataset if "human_output" in item]
        baselines["human_ceiling"] = sum(human_scores) / len(human_scores)

    print("Baselines established:")
    for name, score in baselines.items():
        print(f"  {name}: {score:.3f}")
    return baselines

SECTION 06

Eval dataset construction

A good eval dataset needs: (1) coverage — represents the full distribution of real inputs, including edge cases; (2) quality labels — correct answers verified by domain experts, not just the LLM itself; (3) size — enough examples to detect statistically significant changes (usually 100–500 for task evals, 50–100 for more expensive human-reviewed sets).

Construction steps: sample from real production data (or simulate if pre-launch), have domain experts label 100–200 examples, use those to calibrate your automated eval, then expand with LLM-assisted labeling (verified by human spot-checks). Never use the same data for both prompt development and eval.

SECTION 07

Gotchas

Goodhart's law: When a metric becomes a target, it ceases to be a good metric. If you optimise prompts exclusively to maximise a single automated score, the score will improve but real quality may not. Always maintain human eval as the ground truth.
Train/eval contamination: If you use eval examples to tune prompts, the eval no longer measures generalisation — it measures memorisation of specific inputs. Keep a held-out test set that you never touch during development.
LLM judge bias: LLM judges (GPT-4, Claude) prefer longer, more structured responses — even when a short direct answer is better. Calibrate by comparing judge scores to human scores on a sample before trusting the judge at scale.

Evaluation metric selection guide

Selecting evaluation metrics requires matching the metric's measurement assumptions to the actual quality dimensions that matter for the application. Many teams default to reference-based metrics (BLEU, ROUGE) because they are simple to compute, then discover that these metrics poorly correlate with human quality judgments for the tasks they are evaluating. The table below maps task types to appropriate primary and secondary metrics.

Task type	Primary metric	Secondary metric	Avoid
RAG question answering	Faithfulness (LLM judge)	Answer relevance	BLEU (no reference match)
Summarization	Coverage (LLM judge)	ROUGE-L (recall)	ROUGE-1 precision alone
Classification	Accuracy / F1	Calibration (ECE)	Accuracy on imbalanced data
Code generation	Pass@k (unit tests)	Edit distance to reference	BLEU on code
Dialogue	Human preference rate	Task completion rate	Perplexity

import json
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

# Define test case from RAG pipeline output
test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="Paris is the capital of France.",
    retrieval_context=["France is a country in Western Europe. Its capital is Paris."]
)

faithfulness = FaithfulnessMetric(threshold=0.7)
relevancy = AnswerRelevancyMetric(threshold=0.7)

faithfulness.measure(test_case)
relevancy.measure(test_case)
print(f"Faithfulness: {faithfulness.score:.2f}, Relevancy: {relevancy.score:.2f}")

Evaluation frequency decisions involve balancing the cost of running evaluations against the risk of shipping quality regressions. Running a full eval suite on every pull request is ideal for catching regressions early but is cost-prohibitive when the suite takes hours to run or uses expensive LLM-as-judge metrics. A tiered evaluation approach — a fast smoke-test suite on every commit, a full suite on main branch merges, and a comprehensive dataset including adversarial examples on release candidates — provides regression coverage at each criticality level without creating cost barriers to frequent iteration during development.

Statistical significance testing on evaluation metric changes prevents teams from acting on noise. When comparing two model versions on a 200-example test set, a difference of 3 percentage points on a binary metric corresponds to a p-value of approximately 0.15 — not statistically significant at the conventional 0.05 threshold. McNemar's test for paired binary metrics and paired t-tests for continuous metrics are the appropriate statistical tools for determining whether an observed improvement is reliable. Establishing minimum detectable effect sizes before designing evaluation experiments ensures that test sets are sized to provide sufficient power for the quality differences that the team cares about detecting.

Calibration of LLM-as-judge metrics against human annotations is essential before using automated metrics to drive decisions. An LLM judge that systematically overscores verbose responses or underscores technically correct but tersely written answers will push prompt optimization in the wrong direction. Computing the Pearson or Spearman correlation between judge scores and human preference scores on a held-out calibration set of 100–200 examples identifies systematic biases before they propagate into evaluation-driven development decisions. If correlation is below 0.6, the judge prompt or judge model requires adjustment before the metric provides reliable signal.

Evaluation pipeline automation reduces the overhead of running evaluations on every model change to near zero. A CI/CD pipeline that triggers the evaluation suite on main branch merges, logs results to MLflow or Weights & Biases, and posts a summary comment to the pull request with metric comparisons against the previous baseline creates a culture of continuous quality measurement. Teams that establish this infrastructure early in development find that quality regressions are caught within hours of the causative change rather than discovered during manual review cycles days or weeks later, dramatically reducing the debugging cost of quality issues.

Eval dataset contamination — where documents from the evaluation set appear in the model's training data — produces artificially inflated benchmark scores that do not reflect real-world quality. For custom fine-tuned models, verifying that evaluation examples were not present in the fine-tuning data requires deduplication checks between the eval and training datasets. For proprietary models where training data is not disclosed, using evaluation examples that were created after the model's knowledge cutoff date, or constructing evaluation queries from internal data that could not have been in public training corpora, provides contamination-resistant quality measurements.