Building eval pipelines that actually catch regressions — metrics, frameworks, and the human-vs-automatic tradeoff
Classic software testing is deterministic and binary: a function either returns the correct value or it doesn't. LLM evaluation is fundamentally different. Model outputs are non-deterministic, open-ended, and contextual. A response can be correct in multiple ways, and "correctness" itself is often subjective.
The eval trilemma defines the core tension: accuracy (does your metric actually measure what matters?), cost (can you run it frequently without bankrupt budget?), and speed (do you get feedback before your next deployment?). You can usually achieve only two.
No single eval covers all failure modes. You need a suite. Common mistakes: evaluating only on benchmark datasets (train-test contamination and narrow coverage), using BLEU/ROUGE for generation (poor correlation with human judgment), and ignoring regression testing over time. These shortcuts feel fast until production fails.
Effective eval strategies have three layers. Each layer catches different failure modes and carries different risk profiles. Together they form a funnel that progressively de-risks model deployment.
Layer 1 — Offline evals: Run before any deployment. Automated, fast, cheap. Catch obvious regressions on your golden set. Layer 2 — Shadow evals: New model handles real traffic but responses aren't shown to users. Compare outputs between old and new. Layer 3 — Online evals: A/B test. Real users, real stakes. Monitor engagement, satisfaction, thumbs-up/down signals.
| Layer | Speed | Cost | Signal quality | Risk |
|---|---|---|---|---|
| Offline | Fast | Low | Medium | Zero |
| Shadow | Slow | Medium | High | Zero |
| Online A/B | Slow | High | Highest | Real users |
import json, statistics, time
from openai import OpenAI
from enum import Enum
client = OpenAI()
class EvalStage(Enum):
OFFLINE = "offline" # run against golden set before deploy
SHADOW = "shadow" # run new model in parallel, compare to prod
LIVE = "live" # sample % of live traffic for quality check
def offline_eval(golden_path: str, model: str) -> dict:
"""Stage 1: evaluate against frozen golden set."""
golden = [json.loads(l) for l in open(golden_path)]
scores = []
for item in golden:
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": item["input"]}],
temperature=0.0
).choices[0].message.content
# Exact-match or LLM-judge scoring
score = 1.0 if item["expected"].lower() in resp.lower() else 0.0
scores.append(score)
return {"stage": "offline", "n": len(golden),
"accuracy": round(statistics.mean(scores), 3),
"passed": statistics.mean(scores) >= 0.80}
def shadow_eval(requests: list[dict], prod_model: str, new_model: str) -> dict:
"""Stage 2: run both models, flag where new model differs significantly."""
diffs = []
for req in requests[:100]: # sample first 100
prod_resp = client.chat.completions.create(
model=prod_model,
messages=[{"role": "user", "content": req["input"]}]
).choices[0].message.content
new_resp = client.chat.completions.create(
model=new_model,
messages=[{"role": "user", "content": req["input"]}]
).choices[0].message.content
# Simple diff: check response length ratio
ratio = len(new_resp) / max(len(prod_resp), 1)
if ratio < 0.5 or ratio > 2.0:
diffs.append({"input": req["input"], "ratio": ratio})
return {"stage": "shadow", "n": len(requests[:100]),
"significant_diffs": len(diffs), "diff_rate": len(diffs)/100}
# Usage
result = offline_eval("golden_set.jsonl", "gpt-4o")
if result["passed"]:
print(f"Offline eval PASSED ({result['accuracy']:.0%})")
else:
raise SystemExit(f"Offline eval FAILED ({result['accuracy']:.0%})")
Exact match: For classification and extraction tasks. Not useful for open-ended generation. BLEU / ROUGE: n-gram overlap with a reference answer. Still used for translation and summarization but poorly correlates with actual quality. BERTScore: Embedding-based semantic similarity. Better than BLEU, still requires a reference. G-Eval: Use GPT-4 to score responses on rubric dimensions (fluency, coherence, relevance, accuracy). High correlation with human judgments.
| Metric | Reference needed | Human correlation | Speed | Cost |
|---|---|---|---|---|
| Exact match | Yes | High (for extraction) | Fast | Free |
| BLEU / ROUGE | Yes | Low (generation) | Fast | Free |
| BERTScore | Yes | Medium | Fast | Free |
| G-Eval (LLM judge) | Optional | High | Slow | API cost |
| Human eval | N/A | Ground truth | Very slow | High |
The choice depends on your use case. For factual extraction, exact match is reliable. For creative generation, LLM judging is more trustworthy than n-gram overlap. Invest in building ground truth — a small set of human-evaluated examples that validates your automated metrics.
Use a capable LLM (GPT-4o, Claude) to evaluate outputs on custom rubrics. This scales human judgment without re-hiring annotators. But it introduces new failure modes: position bias (preferring the first response), verbosity bias (preferring longer answers), and self-serving bias (models prefer their own outputs).
Point-wise: Score each response independently on a 1–5 scale per dimension. Faster, easier to parallelize, but noisier. Pair-wise: Compare two responses, pick the better one. More reliable than point-wise because the relative comparison is clearer. Requires more judge calls but produces higher-quality signals.
You don't need a ground-truth reference to use LLM-as-judge. The judge can evaluate coherence, safety, and helpfulness from first principles. This is more expensive than metrics but works for open-ended tasks where multiple correct answers exist.
Position bias: Judge prefers first response. Mitigation: swap order and average. Verbosity bias: Longer answers score higher. Mitigation: require conciseness in the rubric. Self-serving bias: Judge model prefers outputs from its own family. Mitigation: use diverse judge models (GPT-4o, Claude, Llama).
Your golden dataset is a curated set of (input, expected output) pairs that define acceptable behavior. It's your product spec in eval form. Sources include real user queries (sampled and labeled), adversarial cases (red-team), and edge cases discovered in production.
Size guidance: Start with 100-200 cases for initial coverage. Grow to 500-1000 as your product matures. Prioritize diversity over raw size. Labeling: Have domain experts label these examples. A single label is better than disagreement. Versioning: Keep golden sets under version control. Track when cases were added and why.
Run your golden set on every model change. Flag if score drops >2% on any category. Use slice analysis: don't just track overall accuracy — track by query type, language, user segment, and difficulty level. This catches regressions that hide in the aggregate.
Many frameworks exist to manage eval pipelines. Choose based on your team's preferences and your existing stack. The key is integrating evals into CI/CD so they block bad deployments automatically.
Run offline evals on every PR. Block merge if score drops below threshold. Log every eval run with model version, prompt version, and dataset version — this enables trend analysis and bisecting regressions to specific changes. Use smaller judge models for fast CI runs (Llama 2, Mistral). Reserve GPT-4o for weekly deep evals.
| Framework | Strengths | Best for | Open source |
|---|---|---|---|
| OpenAI Evals | Model grading, datasets | GPT-based products | Yes |
| LangSmith | Tracing + evals | LangChain apps | No |
| Braintrust | Fast, git-native | General LLM products | No |
| PromptFoo | CLI-first, YAML | Prompt testing | Yes |
| DeepEval | Built-in metrics | RAG + chatbots | Yes |
Common pitfalls that lead teams to trust evals that don't actually measure product quality:
Using benchmark datasets (MMLU, HumanEval) as your only eval. Models can overfit to benchmarks without improving on real tasks. Benchmarks are coarse signals of general capability, not product metrics.
Optimizing one number (e.g., helpfulness) while degrading others (safety, factuality). Always track a metric suite. Monitor at least: accuracy, safety, latency, and cost.
Running LLM evals without validating against human judgments. You may be optimizing arbitrary scoring noise. Always calibrate your LLM judge against ground truth on a small sample.
Never updating your golden set as user behavior evolves. Add 10-20 new cases quarterly. Retire cases that no longer reflect real use. Live evals keep pace with product drift.