Golden Test Sets

What makes a test set golden
Building your first golden set
Example structure and storage
Running golden set evaluations
Maintaining and growing the set
Golden sets vs A/B testing
Gotchas

SECTION 01

What makes a test set golden

A "golden test set" is a curated collection of (input, expected_output) pairs where the expected outputs have been verified by humans to represent correct, high-quality responses. "Golden" means: (1) human-verified — not generated by the same LLM you're evaluating; (2) stable — the expected answers don't change unless you deliberately update them; (3) representative — covers the full range of inputs your system will encounter in production, including edge cases; (4) versioned — tracked in Git so you know exactly what changed and when.

SECTION 02

Building your first golden set

Start small and iterate:

Collect 50–100 real or simulated inputs: For a pre-launch system, manually craft inputs covering your intended use cases. For a live system, sample from recent production traffic.
Generate candidate outputs: Run your current best system on each input to get a starting point.
Human review and correction: Have domain experts review each candidate output, marking it pass/fail and rewriting failed outputs to the ideal response. This is the most expensive step — budget 5–15 minutes per example.
Categorise examples: Tag each example (e.g. "edge_case", "typical", "adversarial") so you can analyse performance by category.
Set acceptance criteria: Define what score on the golden set means "ready to ship" (e.g. 90% pass rate on typical, 75% on edge cases).

SECTION 03

Example structure and storage

import json
from pathlib import Path
from dataclasses import dataclass, asdict
from typing import Optional

@dataclass
class GoldenExample:
    id: str
    category: str          # "typical", "edge_case", "adversarial", "regression"
    input: str | dict      # the user input or full conversation
    expected_output: str   # human-verified ideal response
    expected_facts: list[str]  # key facts that must appear in any acceptable answer
    forbidden_content: list[str]  # things that must NOT appear
    notes: str = ""        # why this example is in the set
    added_date: str = ""
    added_by: str = ""

# Store as JSONL — one example per line, easy to diff in git
def save_golden_set(examples: list[GoldenExample], path: str):
    with open(path, "w") as f:
        for ex in examples:
            f.write(json.dumps(asdict(ex)) + "\n")

def load_golden_set(path: str) -> list[GoldenExample]:
    examples = []
    with open(path) as f:
        for line in f:
            examples.append(GoldenExample(**json.loads(line)))
    return examples

# Store in git alongside your code:
# evals/golden/customer_support.jsonl
# evals/golden/product_descriptions.jsonl
# evals/golden/sql_generation.jsonl

SECTION 04

Running golden set evaluations

import asyncio
from typing import Callable

async def run_golden_eval(
    examples: list[GoldenExample],
    system_under_test: Callable,
    judge_llm: Callable,
) -> dict:
    results = []
    for ex in examples:
        output = await system_under_test(ex.input)

        # Check hard assertions first (fast, cheap)
        hard_pass = all(fact.lower() in output.lower() for fact in ex.expected_facts)
        hard_fail = any(forbidden in output.lower() for forbidden in ex.forbidden_content)

        # LLM judge for soft quality (slower, more expensive)
        judge_score = await judge_llm(
            f"Rate this response 1-10 compared to the ideal.\n"
            f"Question: {ex.input}\nIdeal: {ex.expected_output}\nActual: {output}"
        )
        results.append({
            "id": ex.id,
            "category": ex.category,
            "hard_pass": hard_pass and not hard_fail,
            "judge_score": judge_score,
            "output": output,
        })

    by_category = {}
    for r in results:
        cat = r["category"]
        if cat not in by_category:
            by_category[cat] = []
        by_category[cat].append(r)

    return {
        "overall_pass_rate": sum(r["hard_pass"] for r in results) / len(results),
        "mean_judge_score": sum(r["judge_score"] for r in results) / len(results),
        "by_category": {
            cat: {
                "pass_rate": sum(r["hard_pass"] for r in items) / len(items),
                "mean_score": sum(r["judge_score"] for r in items) / len(items),
            }
            for cat, items in by_category.items()
        },
    }

SECTION 05

Maintaining and growing the set

A golden set is a living document:

Add regressions immediately: Every time a bug reaches production, add it as a golden example. This is your regression test suite.
Monthly review: Prune examples that no longer represent real use cases; update expected outputs if your quality bar has risen.
Systematic growth: Each sprint, add 5–10 new examples sampled from production edge cases. Target the categories where your pass rate is lowest.
Never delete without replacement: If an example no longer applies, replace it with a similar current one. Shrinking golden sets means losing coverage.

SECTION 06

Golden sets vs A/B testing

Golden sets and A/B testing are complementary. Golden sets catch regressions — changes that break cases that previously worked. A/B testing measures improvements — whether a new prompt does better on the full distribution of real traffic. Use golden sets to gate releases (don't ship if pass rate drops), and A/B testing to measure whether a change actually improves the metrics you care about in production.

SECTION 07

Gotchas

Don't use LLM to create golden labels: If you use GPT-4 to generate both your training data and your golden set, you're measuring consistency with GPT-4, not actual quality. Golden labels must come from humans.
Expected output is not the only acceptable output: For generative tasks, many outputs are equally valid. Your evaluation should check for correctness and quality, not exact match to the expected string. Use fact-based assertions and LLM judges rather than exact match.
Example contamination: If golden examples leak into the training data for your fine-tuned models, you're measuring memorisation. Keep golden sets strictly held out.

Golden Test Set Design and Curation

A golden test set is a curated collection of high-quality evaluation examples with verified correct answers, used as the authoritative benchmark for measuring LLM system quality. Unlike dynamic evaluation data that may change, golden test sets are stable references against which all system versions are compared, providing a consistent quality baseline over time.

Property	Good Golden Set	Poor Golden Set
Coverage	Represents full task diversity	Overweights easy/common cases
Difficulty	Mix of easy, medium, hard	Only easy cases (inflates scores)
Ground truth quality	Verified by domain experts	Generated by the same model being tested
Stability	Rarely modified, versioned	Frequently changed, unversioned
Size	100–500 examples	10 (too small) or 10,000 (expensive to evaluate)

Domain expert verification is the critical quality gate that distinguishes truly golden test sets from convenience samples. Answers generated by an LLM and accepted without expert review encode the model's knowledge biases and errors into the ground truth, causing evaluations to measure alignment with the model's existing behavior rather than actual correctness. For factual domains — medical, legal, financial — expert review is non-negotiable. For subjective quality dimensions, calibration sessions where multiple annotators independently label examples and discuss disagreements establish reliable inter-rater agreement before scaling annotation.

Test set contamination prevention requires isolating evaluation data from training and fine-tuning pipelines. If examples from the golden test set appear in fine-tuning data — even as few-shot examples in prompts or as augmented variants — the model will be evaluated against data it has effectively memorized, producing inflated scores that do not represent generalization quality. Hashing evaluation examples and adding them to a blocklist checked during data pipeline construction prevents accidental contamination as the system evolves.

# Golden test set structure
golden_set = [
    {
        "id": "gs-001",
        "input": "What is the capital of France?",
        "expected_output": "Paris",
        "difficulty": "easy",
        "category": "factual-geography",
        "verified_by": "human",
        "added_date": "2024-01-15"
    },
    # ... more examples
]

Stratified sampling ensures that golden test sets cover the full distribution of difficulty levels and categories proportional to their frequency in production traffic. Oversampling easy cases produces test sets that appear high-quality in aggregate (high average score) while masking poor performance on the hard edge cases that frustrate users most. Tracking difficulty distribution metrics alongside quality scores provides a complete picture — a model might score 95% on easy cases but only 40% on hard cases, an important distinction that aggregate accuracy obscures.

Golden test set maintenance cadence should align with the rate of domain change. For static knowledge domains — historical facts, mathematical identities, legal definitions — a test set can remain valid for years. For dynamic domains — current events, product availability, regulatory requirements — annual or quarterly review cycles are necessary to remove outdated examples and add new ones reflecting current domain state. Versioning the test set with semantic versions and maintaining a changelog of additions and removals provides transparency about what quality scores measure at each point in time.

Adversarial golden test set examples deliberately probe known model weaknesses: long-context reasoning, negation handling, multi-hop deduction chains, numerical reasoning, and prompt injection resistance. Including adversarial examples ensures that the test set catches regressions in these challenging areas that might be masked by high scores on straightforward examples. Red-teaming sessions, where domain experts deliberately try to construct examples that fool the system, are a productive source of adversarial golden set additions that capture realistic attack patterns.

Test set size requirements depend on the statistical precision needed for quality measurements. Detecting a 2-percentage-point improvement in a binary quality metric with 80% power requires approximately 2,000 test examples, while detecting a 5-point improvement requires only 300. For expensive human-evaluated metrics, a 300-example test set often provides sufficient precision for the quality differences that matter in practice. Prioritizing test set representativeness over size — ensuring coverage of all important categories and difficulty levels — produces more actionable quality signals than a large but distribution-mismatched test set.