You can't improve what you don't measure — the discipline that separates prototypes from production
You can't improve what you don't measure. Evaluation tells you if your model knows enough, if your RAG is retrieving the right chunks, and whether quality has quietly degraded since last week — before your users notice.
A typical LLM system has three stages of evaluation: offline (before deployment), canary (small traffic), and production (full rollout). Without rigorous offline evaluation, bugs reach production. Without monitoring, quality drift goes undetected for weeks.
| Stage | When | Method | Risk |
|---|---|---|---|
| Offline | Before deployment | Golden set, benchmarks | Missing edge cases |
| Canary | 5-10% production traffic | LLM-as-judge, user feedback | Statistical noise |
| Production | All traffic, ongoing | Monitoring dashboards, alerts | False positives |
Offline evaluation is your first gate. You test on a fixed dataset of questions and expected answers before any code reaches production. It's fast, reproducible, and catches obvious failures.
A golden dataset is a small (20–100 questions), curated collection of representative examples with ground-truth answers. Include edge cases: typos, ambiguous queries, multi-part questions, rare entities.
Did the answer match the expected answer word-for-word or semantically? Best for Q&A with definitive answers (dates, names, policies).
Measure n-gram overlap between generated and expected answer. Fast but shallow — high overlap doesn't guarantee correctness.
Embed both expected and actual answer, compute cosine distance. Captures meaning even if wording differs.
Ask an LLM to grade the answer against the expected answer and rubric. Most expressive but slightly slower.
RAG systems have unique evaluation challenges. You need to measure retrieval quality (did we get relevant documents?), answer quality, and grounding (is the answer supported by context?).
Retrieval Precision (Context Precision): What fraction of retrieved chunks are actually relevant to the query? Retrieval Recall (Context Recall): What fraction of ground-truth information was retrieved? Faithfulness: Are the claims in the answer supported by the retrieved context? Answer Relevancy: Does the final answer actually address the user's question?
LLM-as-judge is the most practical approach for evaluating open-ended answers. Instead of hand-grading every output, you ask a strong model (Claude, GPT-4) to grade against a rubric. It's fast, scalable, and surprisingly accurate when given a good rubric.
A good rubric describes what each score level means and what the judge should look for. Be specific about what "good" looks like.
from pydantic import BaseModel, Field
from openai import OpenAI
client = OpenAI()
class JudgmentResult(BaseModel):
score: int = Field(ge=1, le=5, description="Quality score 1 (poor) to 5 (excellent)")
reasoning: str = Field(description="One-sentence explanation of the score")
passes_bar: bool = Field(description="True if score >= 3 (acceptable quality)")
def judge_response(question: str, response: str, criteria: str) -> JudgmentResult:
system = f"""You are a calibrated evaluator. Use these criteria:
{criteria}
Score 1=wrong/harmful, 2=poor, 3=acceptable, 4=good, 5=excellent.
Be consistent: the same response quality should always get the same score."""
result = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": f"Question: {question}
Response:
{response}"}
],
response_format=JudgmentResult,
temperature=0.0
)
return result.choices[0].message.parsed
# Example
j = judge_response(
question="Explain what RAG is in 2 sentences.",
response="RAG combines retrieval of relevant documents with LLM generation. It reduces hallucinations by grounding the model in external knowledge.",
criteria="1. Factual accuracy 2. Completeness 3. Appropriate length"
)
print(f"Score: {j.score}/5 | Pass: {j.passes_bar}")
print(f"Reason: {j.reasoning}")
Offline evaluation gets you to deploy. Production monitoring ensures you catch quality regression before it affects users. Without monitoring, silent failures can persist for weeks.
Set up alerts for: (1) Judge score drops >5% day-over-day, (2) Retrieval precision drops below baseline, (3) Error rate spikes, (4) Latency increases >30%. When an alert fires, compare with recent code changes and revert if needed.
Single-run evaluations are not enough. Production AI systems need continuous evaluation: every model update, prompt change, or retrieval schema change should trigger an automated eval run that catches regressions before they reach users.
A practical pipeline: nightly eval job against a frozen golden set → LLM-as-judge scoring on sampled live traffic → alert on score drop >5% → human review queue for flagged outputs. Gate model promotions on eval passage in CI/CD. The golden set should cover both typical cases and known hard cases — edge cases that previously caused production incidents are especially valuable.
import json, statistics, sys
from openai import OpenAI
client = OpenAI()
def llm_score(question: str, expected: str, actual: str) -> float:
"""LLM-as-judge: returns 0.0–1.0."""
prompt = f"""Rate how well the response answers the question compared to the expected answer.
Question: {question}
Expected: {expected}
Actual: {actual}
Reply with a single number 0-10 only."""
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=5, temperature=0.0
).choices[0].message.content.strip()
try:
return float(resp) / 10.0
except ValueError:
return 0.0
def run_eval(golden_path: str, model: str, threshold: float = 0.80) -> dict:
golden = [json.loads(l) for l in open(golden_path)]
scores = []
failures = []
for item in golden:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": item["input"]}]
).choices[0].message.content
score = llm_score(item["input"], item["expected"], response)
scores.append(score)
if score < 0.6:
failures.append({"input": item["input"], "score": score, "response": response})
avg = statistics.mean(scores)
result = {"score": avg, "n": len(golden), "failures": failures, "passed": avg >= threshold}
print(f"Eval: n={len(golden)} avg={avg:.3f} failures={len(failures)} → {'PASS' if result['passed'] else 'FAIL'}")
return result
# Run and gate on result
result = run_eval("golden_set.jsonl", "gpt-4o", threshold=0.80)
if not result["passed"]:
sys.exit(f"Eval regression: {result['score']:.3f} < 0.80")
Dive deeper into specific evaluation practices and tools: