SECTION 01
The Evaluation Bottleneck
Evaluating LLM quality is hard. Manual human annotation is the gold standard but is expensive and slow: hiring annotators, defining rubrics, ensuring consistency across thousands of examples. For most labs, human evaluation costs $5-50 per example depending on task complexity.
Scale: If you evaluate 10k model outputs annually (a typical baseline for a new model), manual evaluation costs $50k-500k. With budget constraints, you can only evaluate 1-5% of your outputs, missing important failure modes.
The Solution: LLM-as-Judge
Use a strong, reliable LLM (GPT-4, Claude) to automatically evaluate outputs from other models (including weaker versions of itself). A single GPT-4 API call costs ~$0.001-0.01 per evaluation, roughly 1000x cheaper than human annotation.
Why it Works
- Pattern recognition: LLMs are trained on high-quality text; they learn what "good" looks like
- Consistency: LLM judges are deterministic (given the same prompt/model seed)
- Multi-dimensional: Can evaluate along multiple criteria (accuracy, safety, helpfulness, reasoning)
- Explanation: Judges can provide chain-of-thought reasoning for their scores
Limitations
- Not perfect: LLM judges correlate with human judgement ~80-90%, not 100%
- Biased: Judges have their own biases (preference for longer answers, certain writing styles)
- Domain-specific: GPT-4 is a generalist; domain experts outperform it in specialized fields
- Cost still adds up: Evaluating 100k outputs with GPT-4 still costs $1000+
Key Insight: LLM judges are not replacements for human evaluation; they're augmenters. Use judges to screen 100% of outputs (cheap), then use humans to spot-check controversial cases (~5%) for ground truth.
SECTION 02
Judging Formats
There are several ways to structure LLM judgement. Choice depends on your task:
1. Pointwise Scoring (Absolute)
Rate a single output on a rubric, independent of other outputs:
Task: Evaluate this response to "How does photosynthesis work?"
Response: "Photosynthesis is a process where plants convert
sunlight into chemical energy. It happens in two stages:
light-dependent reactions in the thylakoid membrane, and
the Calvin cycle in the stroma..."
Rubric:
- Accuracy (0-10): Does it explain the mechanism correctly?
- Completeness (0-10): Does it cover key concepts?
- Clarity (0-10): Is it understandable?
- Depth (0-10): Is it appropriately detailed?
Judge output:
{
"accuracy": 9,
"completeness": 8,
"clarity": 9,
"depth": 8,
"overall": 8.5,
"reasoning": "Response accurately explains both stages.
Minor: could mention chlorophyll role."
}
2. Pairwise Comparison (Relative)
Compare two outputs and decide which is better:
- A: Response from Model 1
- B: Response from Model 2
- Question: Which is better? (A / B / Tie)
Advantage: Easier for judges (binary choice) and robust to judge bias (both A and B get the same bias, so differences stand out). Used by Chatbot Arena.
3. Reference-Based Evaluation
Compare output to a gold standard reference answer:
Question: "What's the capital of France?"
Reference Answer: "Paris"
Model Output: "The capital of France is Paris, the largest
city by population in the country."
Evaluation: Output matches reference (CORRECT) and adds
helpful context. Score: 10/10.
4. Reference-Free Evaluation
Judge output quality without a reference. Relies on rubric and judge knowledge:
- No "answer key"
- Judge evaluates based on quality criteria (coherence, factuality, helpfulness)
- Harder to standardize but more realistic for open-ended tasks
5. G-Eval Framework
A structured approach combining form-filling and chain-of-thought:
G-Eval steps:
1. Aspect (what to evaluate): Coherence, Relevance, etc.
2. Evaluation Criteria (rubric): Clear definitions of 1-5 or 1-10
3. Evaluation Steps (chain-of-thought): Judge breaks down reasoning
4. Output Format (JSON): Structured score + explanation
Example:
{
"aspect": "helpfulness",
"criteria": {
"1": "Not helpful at all",
"3": "Somewhat helpful, missing key info",
"5": "Directly addresses question"
},
"reasoning": "Output answers the core question and provides
examples, meeting criteria 5.",
"score": 5
}
Format Choice: Pairwise is easiest for judges (fewer errors). Pointwise gives richer feedback. Reference-based works for factual tasks. G-Eval is best for detailed analysis with reasoning.
SECTION 03
Prompt Design for Judges
The judge prompt is critical. Poor prompts lead to noisy, biased judgements. Here's how to design good judge prompts:
1. Clear Rubric
Bad rubric: "Rate how good this answer is. 1-5 scale."
Judge: Too vague. What makes something "good"?
Good rubric:
"Rate on CORRECTNESS (1-5):
1 = Factually wrong
2 = Partially correct, significant errors
3 = Mostly correct with minor errors
4 = Correct, complete, well-explained
5 = Correct, comprehensive, excellent clarity"
2. Chain-of-Thought Reasoning
Ask the judge to explain before scoring:
Prompt structure:
"First, analyze the response:
1. Is the core claim accurate?
2. Are there omissions or errors?
3. Is the explanation clear?
Then, assign a score 1-5."
Benefit: Judge's reasoning is visible. If you disagree,
you can see where the judge went wrong.
3. Positive & Negative Examples
Provide exemplars to calibrate the judge:
"Here are examples of scores:
Example 1 (Score 5):
Question: 'How photosynthesis works?'
Response: 'Photosynthesis converts light energy to chemical
energy via light reactions and Calvin cycle...'
Reason: Accurate, complete, clear.
Example 2 (Score 2):
Question: 'How photosynthesis works?'
Response: 'Plants make energy from sun.'
Reason: Oversimplified, missing key mechanisms.
Now evaluate this response: ..."
4. JSON Output Schema
Structure the output for programmatic parsing:
"Respond with JSON:
{
\"analysis\": \"Detailed reasoning\",
\"score\": 4,
\"confidence\": 0.85,
\"strengths\": [\"...\"],
\"weaknesses\": [\"...\"],
\"suggestions\": \"How to improve\"
}"
5. Avoid Bias Signals
Don't include information that biases the judge:
- Model names: Don't mention "GPT-4's answer vs GPT-3.5's" β judge has pre-existing bias
- Human feedback: Don't say "this was written by an expert" β judge anchors on authority
- Formatting: Make all outputs similar length/style β avoid length bias
Prompt Template: Always use: clear rubric + exemplars + CoT reasoning + JSON output + bias mitigation. This 5-part formula dramatically improves judge consistency.
SECTION 04
Biases in LLM Judges
LLM judges are not objective. They have systematic biases that skew evaluations:
1. Verbosity Bias
Judges prefer longer responses, even if unnecessary:
- Short correct answer: "The capital is Paris" β Score 7/10
- Long correct answer: "The capital of France is Paris, a city in north-central France..." β Score 9/10
- Both are correct, but verbose answer scores higher
2. Position Bias (A/B Preference)
In pairwise comparisons, the first option sometimes scores higher:
- A vs B: Judge picks A 55% of the time
- B vs A: Judge picks B 50% of the time
- Neutral: 45%
3. Self-Enhancement Bias
If the judge is GPT-4, it scores GPT-4's outputs higher than competitors:
- GPT-4 vs Claude: GPT-4 wins 60% (should be 50% if equal)
- Use different models as judges to average out this bias
4. Sycophancy
Judges give higher scores to responses that agree with the judge's trained opinions:
- Question: "Is climate change real?"
- GPT-4 (trained on data with scientific consensus) scores affirmative responses higher
- This is appropriate for factual questions but problematic for subjective ones
5. Coherence Over Correctness
Judges reward well-written but wrong answers:
- Well-written wrong answer: "The Earth revolves around the Sun in a perfect circle..." β Score 7/10
- Correct but poorly-written: "earth revolves sun. not perfect circle, ellipse." β Score 4/10
- Second is more correct, but first scores higher due to writing quality
Mitigation Strategies
- Multiple judges: Use 3-5 judges; average their scores. Different models/biases cancel out.
- Blind evaluations: Hide model identities so judges don't have pre-existing biases.
- Explicit rubrics: Force judges to score on specific criteria, not overall "goodness".
- Human spot-checks: Regularly compare judge scores to human scores; recalibrate if diverging.
- Adversarial examples: Test judges with known weak/strong examples; verify they score correctly.
Critical: LLM judges are biased. Never trust a single judge's score. Always use multiple judges and human spot-checks to validate.
SECTION 05
MT-Bench & Chatbot Arena
Two landmark projects demonstrating LLM judging at scale:
MT-Bench (Multi-Turn Benchmark)
A benchmark of multi-turn conversations with GPT-4 as judge. Evaluates models' ability to handle follow-up questions and maintain context.
- Dataset: 80 curated multi-turn questions spanning domains (writing, math, coding, roleplay)
- Judge: GPT-4 with detailed rubric
- Metric: Pairwise comparison (which model's response is better?)
- Results: Strong correlation with human evaluation (>0.85)
Chatbot Arena (LMSYS)
A crowdsourced competition where users vote on which model gives better responses. GPT-4 judges also used as supplementary ranking.
- Scale: 100k+ crowdsourced comparisons; GPT-4 judges thousands
- Models: GPT-4, Claude, Llama, Mistral, etc.
- Judge format: Pairwise comparison (blindβusers don't see model names)
- Ranking: Elo rating system (like chess rankings). Stronger models gain points.
Findings
- GPT-4 and Claude are consistently top-ranked
- Open-source models (Llama, Mistral) are catching up
- Pairwise judge scores correlate well with human votes (r > 0.8)
- Domain variation is huge: a model strong in coding may be weak in creative writing
MT-Bench leaderboard (simplified):
Model | GPT-4 Judge Score | Human Eval
GPT-4-Turbo | 9.1 | 9.0
Claude-3-Opus | 8.9 | 8.8
Llama-2-70B | 7.1 | 7.3
Mistral-Large | 7.5 | 7.4
GPT-3.5-Turbo | 6.9 | 6.8
Judge scores align well with human evaluation.
Lesson from Arena: Pairwise evaluation (A vs B) is robust. Blind evaluation (hide model names) improves consistency. Crowdsourcing + judge validation together provide confidence in rankings.
SECTION 06
Building a Custom Judge
Here's a practical example of building a judge for a specific task (code generation):
# Custom Judge for Code Generation
import json
import anthropic
JUDGE_PROMPT = """You are an expert code reviewer. Evaluate this code
solution on:
1. CORRECTNESS (1-5): Does it solve the problem?
2. EFFICIENCY (1-5): Is it optimized?
3. READABILITY (1-5): Clear variable names, comments?
4. SAFETY (1-5): Error handling, edge cases?
Respond with JSON:
{
"correctness": N,
"efficiency": N,
"readability": N,
"safety": N,
"reasoning": "...",
"overall": N
}"""
def judge_code(problem: str, code: str) -> dict:
"""Judge a code solution."""
client = anthropic.Anthropic()
prompt = f"""Problem: {problem}
Solution:
{code}
{JUDGE_PROMPT}"""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
try:
return json.loads(response.content[0].text)
except:
return {"error": "Failed to parse judge response"}
def batch_evaluate(problems_codes: list[tuple]) -> list[dict]:
"""Evaluate multiple solutions."""
results = []
for problem, code in problems_codes:
judgment = judge_code(problem, code)
results.append({
"problem": problem[:50] + "...",
"code": code[:50] + "...",
"judgment": judgment
})
return results
# Example
problems = [
("Write a function to check if a number is prime",
"def is_prime(n):\n for i in range(2, int(n**0.5) + 1):\n if n % i == 0:\n return False\n return n > 1"),
("Reverse a list in-place",
"def reverse(arr):\n arr = arr[::-1]"), # Bug: doesn't modify in-place
]
results = batch_evaluate(problems)
for result in results:
print(f"Problem: {result['problem']}")
print(f"Overall: {result['judgment'].get('overall', 'N/A')}")
print(f"Reasoning: {result['judgment'].get('reasoning', '')}")
print()
Building Blocks
- Judge Model: GPT-4 or Claude (stronger = better quality)
- Rubric: Clear criteria (correctness, efficiency, etc.)
- Prompt: Task description + rubric + output format
- Error Handling: Fallback if judge fails to parse or respond
- Batch Evaluation: Process many examples; track statistics
Optimization Tips
- Use cheaper models for coarse filtering: GPT-3.5 for first pass, GPT-4 only for borderline cases
- Cache judge prompts: Reuse rubrics across evaluations
- Parallelize: Run multiple judge calls simultaneously
- Constrain output: Force JSON format to avoid parsing errors
Template Pattern: 1) Define rubric, 2) Write prompt, 3) Create judge function, 4) Batch evaluate, 5) Analyze results. This pattern is reusable across domains.
SECTION 07
Judge Calibration
Before deploying a judge, verify it aligns with human judgement. Calibration is crucial for confidence.
Validation Process
1. Collect Human Labels
Sample 50-100 examples. Have humans (2-3 per example) score or compare. This is your ground truth.
2. Run Judge on Same Examples
Get judge scores for the same 50-100 examples.
3. Compute Correlation
from scipy.stats import spearmanr, pearsonr
human_scores = [8, 7, 9, 6, 5, 8, 9, 7, 8, 6, ...] # Human avg
judge_scores = [8, 6, 9, 6, 4, 9, 8, 7, 8, 5, ...] # Judge
# Correlation
pearson_r, p_value = pearsonr(human_scores, judge_scores)
spearman_r, p_value = spearmanr(human_scores, judge_scores)
print(f"Pearson r: {pearson_r:.3f}") # Linear correlation
print(f"Spearman r: {spearman_r:.3f}") # Rank correlation
# Interpretation:
# r > 0.8: Strong agreement
# r = 0.6-0.8: Good agreement
# r < 0.6: Weak agreement (don't use judge)
4. Analyze Disagreements
Where does the judge disagree with humans?
- Off-by-one errors: Judge scores 7 when human says 8. Acceptable; both in same ballpark.
- Systematic bias: Judge always scores 2 points lower. Recalibrate (adjust judge prompt or use different model).
- Outliers: Judge scores 3 when human says 9. Investigate: Is this a human error or judge failure?
5. Calibration Plot
Visualize judge vs human agreement:
Plot: X-axis = Human score, Y-axis = Judge score
Points along diagonal y=x: Perfect agreement
Points above diagonal: Judge over-scores
Points below diagonal: Judge under-scores
Ideally: Tight cluster around diagonal
Problem pattern: Judge consistently above/below diagonal
Fix: Retune judge prompt or choose different judge model
When to Trust vs Distrust LLM Judges
| Scenario |
Correlation |
Decision |
| Factual QA (single right answer) |
> 0.85 |
Use judge confidently; spot-check 5% |
| Creative writing, subjective tasks |
0.70-0.80 |
Use judge with caution; validate 10-20% |
| Technical evaluation (code, math) |
> 0.80 |
Use judge; requires good rubric |
| Safety/toxicity detection |
0.75-0.85 |
Use for screening; humans validate flagged |
| Low correlation (< 0.65) |
< 0.65 |
Don't use; task too subjective or rubric poor |
Continuous Monitoring
Even after deployment, monitor judge performance:
- Quarterly: Re-validate on fresh human labels. Has correlation drifted?
- Weekly: Check for outliers. Are there patterns in judge failures?
- Per-update: If you update the judge prompt or model, re-calibrate
Golden Rule: Never deploy a judge without calibration. Invest in 50-100 human labels. The cost (100 Γ $5 = $500) is tiny compared to the cost of wrong decisions based on uncalibrated judges (wrong model selections, missed safety issues).