Define what 'good' means before writing a single line of pipeline code. The most important and most skipped step in LLM system development.
Most LLM projects fail not because the model is bad but because the team never defined what success looks like. Without a clear eval, you're flying blind: you can't tell if a prompt change is an improvement, you can't compare model versions, and you can't know when the system is ready to ship. Eval design should happen before you write your first LLM call — it forces clarity on what the task actually is and what "done" means.
The discipline: write three things before any code — (1) a task description, (2) three examples of ideal outputs, and (3) three examples of unacceptable outputs. If you can't write these, the task isn't defined well enough to build.
Think in three levels:
Build your eval system to capture all three levels — automated proxies for fast iteration, task metrics for CI gates, and periodic business metric measurement for strategic decisions.
METRIC_GUIDE = {
"summarisation": {
"primary": "factual_consistency", # does the summary contain only info from source?
"secondary": ["coverage", "conciseness"],
"automated": ["BERTScore", "SummaC", "LLM-judge(1-10)"],
"avoid": ["ROUGE"], # low correlation with human preference
},
"qa_retrieval": {
"primary": "answer_correctness", # is the answer factually correct?
"secondary": ["groundedness", "completeness"],
"automated": ["exact_match", "F1", "LLM-judge"],
"avoid": ["BLEU"],
},
"code_generation": {
"primary": "functional_correctness", # does the code run and pass tests?
"secondary": ["style", "efficiency"],
"automated": ["pass@k", "unit_test_pass_rate"],
"avoid": ["syntax_check_only"], # passes syntax != correct
},
"classification": {
"primary": "accuracy",
"secondary": ["precision", "recall", "calibration"],
"automated": ["sklearn.metrics"],
"avoid": ["accuracy_alone"], # misleading on imbalanced classes
},
}
Human evaluation: Ground truth. Expensive ($1–10 per annotation), slow (days not seconds), but the only way to validate that your automated metrics actually track quality. Use for: initial validation of your eval setup, periodic spot-checks (5–10% of production outputs), ambiguous cases that automated evals disagree on.
LLM-as-judge: Use GPT-4 or Claude to rate outputs. ~$0.01 per eval, runs in seconds, scales infinitely. Correlation with human preference: 0.7–0.9 depending on task. Use for: regression testing on every change, A/B comparison, bulk evaluation of large datasets. Always validate judge scores against human labels first.
Rule-based assertions: Fastest and most reliable for binary properties. "Output is valid JSON", "response mentions the product name", "no toxic content detected". Use as pre-filters before more expensive evaluation.
def establish_baselines(eval_dataset: list[dict], llm_call) -> dict:
# Always measure these before optimising:
baselines = {}
# 1. Zero-shot baseline (no examples)
zero_shot_scores = [evaluate(llm_call(item["input"]), item["expected"])
for item in eval_dataset]
baselines["zero_shot"] = sum(zero_shot_scores) / len(zero_shot_scores)
# 2. Naive baseline (simplest possible approach)
# e.g. for summarisation: first 3 sentences
naive_scores = [evaluate(item["input"][:500], item["expected"])
for item in eval_dataset]
baselines["naive"] = sum(naive_scores) / len(naive_scores)
# 3. Human ceiling (sample of human-written outputs)
if any("human_output" in item for item in eval_dataset[:5]):
human_scores = [evaluate(item.get("human_output", ""), item["expected"])
for item in eval_dataset if "human_output" in item]
baselines["human_ceiling"] = sum(human_scores) / len(human_scores)
print("Baselines established:")
for name, score in baselines.items():
print(f" {name}: {score:.3f}")
return baselines
A good eval dataset needs: (1) coverage — represents the full distribution of real inputs, including edge cases; (2) quality labels — correct answers verified by domain experts, not just the LLM itself; (3) size — enough examples to detect statistically significant changes (usually 100–500 for task evals, 50–100 for more expensive human-reviewed sets).
Construction steps: sample from real production data (or simulate if pre-launch), have domain experts label 100–200 examples, use those to calibrate your automated eval, then expand with LLM-assisted labeling (verified by human spot-checks). Never use the same data for both prompt development and eval.
Selecting evaluation metrics requires matching the metric's measurement assumptions to the actual quality dimensions that matter for the application. Many teams default to reference-based metrics (BLEU, ROUGE) because they are simple to compute, then discover that these metrics poorly correlate with human quality judgments for the tasks they are evaluating. The table below maps task types to appropriate primary and secondary metrics.
| Task type | Primary metric | Secondary metric | Avoid |
|---|---|---|---|
| RAG question answering | Faithfulness (LLM judge) | Answer relevance | BLEU (no reference match) |
| Summarization | Coverage (LLM judge) | ROUGE-L (recall) | ROUGE-1 precision alone |
| Classification | Accuracy / F1 | Calibration (ECE) | Accuracy on imbalanced data |
| Code generation | Pass@k (unit tests) | Edit distance to reference | BLEU on code |
| Dialogue | Human preference rate | Task completion rate | Perplexity |
import json
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
# Define test case from RAG pipeline output
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="Paris is the capital of France.",
retrieval_context=["France is a country in Western Europe. Its capital is Paris."]
)
faithfulness = FaithfulnessMetric(threshold=0.7)
relevancy = AnswerRelevancyMetric(threshold=0.7)
faithfulness.measure(test_case)
relevancy.measure(test_case)
print(f"Faithfulness: {faithfulness.score:.2f}, Relevancy: {relevancy.score:.2f}")
Evaluation frequency decisions involve balancing the cost of running evaluations against the risk of shipping quality regressions. Running a full eval suite on every pull request is ideal for catching regressions early but is cost-prohibitive when the suite takes hours to run or uses expensive LLM-as-judge metrics. A tiered evaluation approach — a fast smoke-test suite on every commit, a full suite on main branch merges, and a comprehensive dataset including adversarial examples on release candidates — provides regression coverage at each criticality level without creating cost barriers to frequent iteration during development.
Statistical significance testing on evaluation metric changes prevents teams from acting on noise. When comparing two model versions on a 200-example test set, a difference of 3 percentage points on a binary metric corresponds to a p-value of approximately 0.15 — not statistically significant at the conventional 0.05 threshold. McNemar's test for paired binary metrics and paired t-tests for continuous metrics are the appropriate statistical tools for determining whether an observed improvement is reliable. Establishing minimum detectable effect sizes before designing evaluation experiments ensures that test sets are sized to provide sufficient power for the quality differences that the team cares about detecting.
Calibration of LLM-as-judge metrics against human annotations is essential before using automated metrics to drive decisions. An LLM judge that systematically overscores verbose responses or underscores technically correct but tersely written answers will push prompt optimization in the wrong direction. Computing the Pearson or Spearman correlation between judge scores and human preference scores on a held-out calibration set of 100–200 examples identifies systematic biases before they propagate into evaluation-driven development decisions. If correlation is below 0.6, the judge prompt or judge model requires adjustment before the metric provides reliable signal.
Evaluation pipeline automation reduces the overhead of running evaluations on every model change to near zero. A CI/CD pipeline that triggers the evaluation suite on main branch merges, logs results to MLflow or Weights & Biases, and posts a summary comment to the pull request with metric comparisons against the previous baseline creates a culture of continuous quality measurement. Teams that establish this infrastructure early in development find that quality regressions are caught within hours of the causative change rather than discovered during manual review cycles days or weeks later, dramatically reducing the debugging cost of quality issues.
Eval dataset contamination — where documents from the evaluation set appear in the model's training data — produces artificially inflated benchmark scores that do not reflect real-world quality. For custom fine-tuned models, verifying that evaluation examples were not present in the fine-tuning data requires deduplication checks between the eval and training datasets. For proprietary models where training data is not disclosed, using evaluation examples that were created after the model's knowledge cutoff date, or constructing evaluation queries from internal data that could not have been in public training corpora, provides contamination-resistant quality measurements.