Curated, human-verified examples that act as your regression suite. The foundation of reliable LLM development — build them early, maintain them continuously, and run them on every change.
A "golden test set" is a curated collection of (input, expected_output) pairs where the expected outputs have been verified by humans to represent correct, high-quality responses. "Golden" means: (1) human-verified — not generated by the same LLM you're evaluating; (2) stable — the expected answers don't change unless you deliberately update them; (3) representative — covers the full range of inputs your system will encounter in production, including edge cases; (4) versioned — tracked in Git so you know exactly what changed and when.
Start small and iterate:
import json
from pathlib import Path
from dataclasses import dataclass, asdict
from typing import Optional
@dataclass
class GoldenExample:
id: str
category: str # "typical", "edge_case", "adversarial", "regression"
input: str | dict # the user input or full conversation
expected_output: str # human-verified ideal response
expected_facts: list[str] # key facts that must appear in any acceptable answer
forbidden_content: list[str] # things that must NOT appear
notes: str = "" # why this example is in the set
added_date: str = ""
added_by: str = ""
# Store as JSONL — one example per line, easy to diff in git
def save_golden_set(examples: list[GoldenExample], path: str):
with open(path, "w") as f:
for ex in examples:
f.write(json.dumps(asdict(ex)) + "\n")
def load_golden_set(path: str) -> list[GoldenExample]:
examples = []
with open(path) as f:
for line in f:
examples.append(GoldenExample(**json.loads(line)))
return examples
# Store in git alongside your code:
# evals/golden/customer_support.jsonl
# evals/golden/product_descriptions.jsonl
# evals/golden/sql_generation.jsonl
import asyncio
from typing import Callable
async def run_golden_eval(
examples: list[GoldenExample],
system_under_test: Callable,
judge_llm: Callable,
) -> dict:
results = []
for ex in examples:
output = await system_under_test(ex.input)
# Check hard assertions first (fast, cheap)
hard_pass = all(fact.lower() in output.lower() for fact in ex.expected_facts)
hard_fail = any(forbidden in output.lower() for forbidden in ex.forbidden_content)
# LLM judge for soft quality (slower, more expensive)
judge_score = await judge_llm(
f"Rate this response 1-10 compared to the ideal.\n"
f"Question: {ex.input}\nIdeal: {ex.expected_output}\nActual: {output}"
)
results.append({
"id": ex.id,
"category": ex.category,
"hard_pass": hard_pass and not hard_fail,
"judge_score": judge_score,
"output": output,
})
by_category = {}
for r in results:
cat = r["category"]
if cat not in by_category:
by_category[cat] = []
by_category[cat].append(r)
return {
"overall_pass_rate": sum(r["hard_pass"] for r in results) / len(results),
"mean_judge_score": sum(r["judge_score"] for r in results) / len(results),
"by_category": {
cat: {
"pass_rate": sum(r["hard_pass"] for r in items) / len(items),
"mean_score": sum(r["judge_score"] for r in items) / len(items),
}
for cat, items in by_category.items()
},
}
A golden set is a living document:
Golden sets and A/B testing are complementary. Golden sets catch regressions — changes that break cases that previously worked. A/B testing measures improvements — whether a new prompt does better on the full distribution of real traffic. Use golden sets to gate releases (don't ship if pass rate drops), and A/B testing to measure whether a change actually improves the metrics you care about in production.
A golden test set is a curated collection of high-quality evaluation examples with verified correct answers, used as the authoritative benchmark for measuring LLM system quality. Unlike dynamic evaluation data that may change, golden test sets are stable references against which all system versions are compared, providing a consistent quality baseline over time.
| Property | Good Golden Set | Poor Golden Set |
|---|---|---|
| Coverage | Represents full task diversity | Overweights easy/common cases |
| Difficulty | Mix of easy, medium, hard | Only easy cases (inflates scores) |
| Ground truth quality | Verified by domain experts | Generated by the same model being tested |
| Stability | Rarely modified, versioned | Frequently changed, unversioned |
| Size | 100–500 examples | 10 (too small) or 10,000 (expensive to evaluate) |
Domain expert verification is the critical quality gate that distinguishes truly golden test sets from convenience samples. Answers generated by an LLM and accepted without expert review encode the model's knowledge biases and errors into the ground truth, causing evaluations to measure alignment with the model's existing behavior rather than actual correctness. For factual domains — medical, legal, financial — expert review is non-negotiable. For subjective quality dimensions, calibration sessions where multiple annotators independently label examples and discuss disagreements establish reliable inter-rater agreement before scaling annotation.
Test set contamination prevention requires isolating evaluation data from training and fine-tuning pipelines. If examples from the golden test set appear in fine-tuning data — even as few-shot examples in prompts or as augmented variants — the model will be evaluated against data it has effectively memorized, producing inflated scores that do not represent generalization quality. Hashing evaluation examples and adding them to a blocklist checked during data pipeline construction prevents accidental contamination as the system evolves.
# Golden test set structure
golden_set = [
{
"id": "gs-001",
"input": "What is the capital of France?",
"expected_output": "Paris",
"difficulty": "easy",
"category": "factual-geography",
"verified_by": "human",
"added_date": "2024-01-15"
},
# ... more examples
]
Stratified sampling ensures that golden test sets cover the full distribution of difficulty levels and categories proportional to their frequency in production traffic. Oversampling easy cases produces test sets that appear high-quality in aggregate (high average score) while masking poor performance on the hard edge cases that frustrate users most. Tracking difficulty distribution metrics alongside quality scores provides a complete picture — a model might score 95% on easy cases but only 40% on hard cases, an important distinction that aggregate accuracy obscures.
Golden test set maintenance cadence should align with the rate of domain change. For static knowledge domains — historical facts, mathematical identities, legal definitions — a test set can remain valid for years. For dynamic domains — current events, product availability, regulatory requirements — annual or quarterly review cycles are necessary to remove outdated examples and add new ones reflecting current domain state. Versioning the test set with semantic versions and maintaining a changelog of additions and removals provides transparency about what quality scores measure at each point in time.
Adversarial golden test set examples deliberately probe known model weaknesses: long-context reasoning, negation handling, multi-hop deduction chains, numerical reasoning, and prompt injection resistance. Including adversarial examples ensures that the test set catches regressions in these challenging areas that might be masked by high scores on straightforward examples. Red-teaming sessions, where domain experts deliberately try to construct examples that fool the system, are a productive source of adversarial golden set additions that capture realistic attack patterns.
Test set size requirements depend on the statistical precision needed for quality measurements. Detecting a 2-percentage-point improvement in a binary quality metric with 80% power requires approximately 2,000 test examples, while detecting a 5-point improvement requires only 300. For expensive human-evaluated metrics, a 300-example test set often provides sufficient precision for the quality differences that matter in practice. Prioritizing test set representativeness over size — ensuring coverage of all important categories and difficulty levels — produces more actionable quality signals than a large but distribution-mismatched test set.