LLM-as-Judge

The Evaluation Bottleneck Judging Formats Prompt Design for Judges Biases in LLM Judges MT-Bench & Chatbot Arena Building a Custom Judge Judge Calibration

SECTION 01

The Evaluation Bottleneck

Evaluating LLM quality is hard. Manual human annotation is the gold standard but is expensive and slow: hiring annotators, defining rubrics, ensuring consistency across thousands of examples. For most labs, human evaluation costs $5-50 per example depending on task complexity.

Scale: If you evaluate 10k model outputs annually (a typical baseline for a new model), manual evaluation costs $50k-500k. With budget constraints, you can only evaluate 1-5% of your outputs, missing important failure modes.

The Solution: LLM-as-Judge

Use a strong, reliable LLM (GPT-4, Claude) to automatically evaluate outputs from other models (including weaker versions of itself). A single GPT-4 API call costs ~$0.001-0.01 per evaluation, roughly 1000x cheaper than human annotation.

Why it Works

Pattern recognition: LLMs are trained on high-quality text; they learn what "good" looks like
Consistency: LLM judges are deterministic (given the same prompt/model seed)
Multi-dimensional: Can evaluate along multiple criteria (accuracy, safety, helpfulness, reasoning)
Explanation: Judges can provide chain-of-thought reasoning for their scores

Limitations

Not perfect: LLM judges correlate with human judgement ~80-90%, not 100%
Biased: Judges have their own biases (preference for longer answers, certain writing styles)
Domain-specific: GPT-4 is a generalist; domain experts outperform it in specialized fields
Cost still adds up: Evaluating 100k outputs with GPT-4 still costs $1000+

Key Insight: LLM judges are not replacements for human evaluation; they're augmenters. Use judges to screen 100% of outputs (cheap), then use humans to spot-check controversial cases (~5%) for ground truth.

SECTION 02

Judging Formats

There are several ways to structure LLM judgement. Choice depends on your task:

1. Pointwise Scoring (Absolute)

Rate a single output on a rubric, independent of other outputs:

Task: Evaluate this response to "How does photosynthesis work?" Response: "Photosynthesis is a process where plants convert sunlight into chemical energy. It happens in two stages: light-dependent reactions in the thylakoid membrane, and the Calvin cycle in the stroma..." Rubric: - Accuracy (0-10): Does it explain the mechanism correctly? - Completeness (0-10): Does it cover key concepts? - Clarity (0-10): Is it understandable? - Depth (0-10): Is it appropriately detailed? Judge output: { "accuracy": 9, "completeness": 8, "clarity": 9, "depth": 8, "overall": 8.5, "reasoning": "Response accurately explains both stages. Minor: could mention chlorophyll role." }

2. Pairwise Comparison (Relative)

Compare two outputs and decide which is better:

A: Response from Model 1
B: Response from Model 2
Question: Which is better? (A / B / Tie)

Advantage: Easier for judges (binary choice) and robust to judge bias (both A and B get the same bias, so differences stand out). Used by Chatbot Arena.

3. Reference-Based Evaluation

Compare output to a gold standard reference answer:

Question: "What's the capital of France?" Reference Answer: "Paris" Model Output: "The capital of France is Paris, the largest city by population in the country." Evaluation: Output matches reference (CORRECT) and adds helpful context. Score: 10/10.

4. Reference-Free Evaluation

Judge output quality without a reference. Relies on rubric and judge knowledge:

No "answer key"
Judge evaluates based on quality criteria (coherence, factuality, helpfulness)
Harder to standardize but more realistic for open-ended tasks

5. G-Eval Framework

A structured approach combining form-filling and chain-of-thought:

G-Eval steps: 1. Aspect (what to evaluate): Coherence, Relevance, etc. 2. Evaluation Criteria (rubric): Clear definitions of 1-5 or 1-10 3. Evaluation Steps (chain-of-thought): Judge breaks down reasoning 4. Output Format (JSON): Structured score + explanation Example: { "aspect": "helpfulness", "criteria": { "1": "Not helpful at all", "3": "Somewhat helpful, missing key info", "5": "Directly addresses question" }, "reasoning": "Output answers the core question and provides examples, meeting criteria 5.", "score": 5 }

Format Choice: Pairwise is easiest for judges (fewer errors). Pointwise gives richer feedback. Reference-based works for factual tasks. G-Eval is best for detailed analysis with reasoning.

SECTION 03

Prompt Design for Judges

The judge prompt is critical. Poor prompts lead to noisy, biased judgements. Here's how to design good judge prompts:

1. Clear Rubric

Bad rubric: "Rate how good this answer is. 1-5 scale." Judge: Too vague. What makes something "good"? Good rubric: "Rate on CORRECTNESS (1-5): 1 = Factually wrong 2 = Partially correct, significant errors 3 = Mostly correct with minor errors 4 = Correct, complete, well-explained 5 = Correct, comprehensive, excellent clarity"

2. Chain-of-Thought Reasoning

Ask the judge to explain before scoring:

Prompt structure: "First, analyze the response: 1. Is the core claim accurate? 2. Are there omissions or errors? 3. Is the explanation clear? Then, assign a score 1-5." Benefit: Judge's reasoning is visible. If you disagree, you can see where the judge went wrong.

3. Positive & Negative Examples

Provide exemplars to calibrate the judge:

"Here are examples of scores: Example 1 (Score 5): Question: 'How photosynthesis works?' Response: 'Photosynthesis converts light energy to chemical energy via light reactions and Calvin cycle...' Reason: Accurate, complete, clear. Example 2 (Score 2): Question: 'How photosynthesis works?' Response: 'Plants make energy from sun.' Reason: Oversimplified, missing key mechanisms. Now evaluate this response: ..."

4. JSON Output Schema

Structure the output for programmatic parsing:

"Respond with JSON: { \"analysis\": \"Detailed reasoning\", \"score\": 4, \"confidence\": 0.85, \"strengths\": [\"...\"], \"weaknesses\": [\"...\"], \"suggestions\": \"How to improve\" }"

5. Avoid Bias Signals

Don't include information that biases the judge:

Model names: Don't mention "GPT-4's answer vs GPT-3.5's" → judge has pre-existing bias
Human feedback: Don't say "this was written by an expert" → judge anchors on authority
Formatting: Make all outputs similar length/style → avoid length bias

Prompt Template: Always use: clear rubric + exemplars + CoT reasoning + JSON output + bias mitigation. This 5-part formula dramatically improves judge consistency.

SECTION 04

Biases in LLM Judges

LLM judges are not objective. They have systematic biases that skew evaluations:

1. Verbosity Bias

Judges prefer longer responses, even if unnecessary:

Short correct answer: "The capital is Paris" → Score 7/10
Long correct answer: "The capital of France is Paris, a city in north-central France..." → Score 9/10
Both are correct, but verbose answer scores higher

2. Position Bias (A/B Preference)

In pairwise comparisons, the first option sometimes scores higher:

A vs B: Judge picks A 55% of the time
B vs A: Judge picks B 50% of the time
Neutral: 45%

3. Self-Enhancement Bias

If the judge is GPT-4, it scores GPT-4's outputs higher than competitors:

GPT-4 vs Claude: GPT-4 wins 60% (should be 50% if equal)
Use different models as judges to average out this bias

4. Sycophancy

Judges give higher scores to responses that agree with the judge's trained opinions:

Question: "Is climate change real?"
GPT-4 (trained on data with scientific consensus) scores affirmative responses higher
This is appropriate for factual questions but problematic for subjective ones

5. Coherence Over Correctness

Judges reward well-written but wrong answers:

Well-written wrong answer: "The Earth revolves around the Sun in a perfect circle..." → Score 7/10
Correct but poorly-written: "earth revolves sun. not perfect circle, ellipse." → Score 4/10
Second is more correct, but first scores higher due to writing quality

Mitigation Strategies

Multiple judges: Use 3-5 judges; average their scores. Different models/biases cancel out.
Blind evaluations: Hide model identities so judges don't have pre-existing biases.
Explicit rubrics: Force judges to score on specific criteria, not overall "goodness".
Human spot-checks: Regularly compare judge scores to human scores; recalibrate if diverging.
Adversarial examples: Test judges with known weak/strong examples; verify they score correctly.

Critical: LLM judges are biased. Never trust a single judge's score. Always use multiple judges and human spot-checks to validate.

SECTION 05

MT-Bench & Chatbot Arena

Two landmark projects demonstrating LLM judging at scale:

MT-Bench (Multi-Turn Benchmark)

A benchmark of multi-turn conversations with GPT-4 as judge. Evaluates models' ability to handle follow-up questions and maintain context.

Dataset: 80 curated multi-turn questions spanning domains (writing, math, coding, roleplay)
Judge: GPT-4 with detailed rubric
Metric: Pairwise comparison (which model's response is better?)
Results: Strong correlation with human evaluation (>0.85)

Chatbot Arena (LMSYS)

A crowdsourced competition where users vote on which model gives better responses. GPT-4 judges also used as supplementary ranking.

Scale: 100k+ crowdsourced comparisons; GPT-4 judges thousands
Models: GPT-4, Claude, Llama, Mistral, etc.
Judge format: Pairwise comparison (blind—users don't see model names)
Ranking: Elo rating system (like chess rankings). Stronger models gain points.

Findings

GPT-4 and Claude are consistently top-ranked
Open-source models (Llama, Mistral) are catching up
Pairwise judge scores correlate well with human votes (r > 0.8)
Domain variation is huge: a model strong in coding may be weak in creative writing

Lesson from Arena: Pairwise evaluation (A vs B) is robust. Blind evaluation (hide model names) improves consistency. Crowdsourcing + judge validation together provide confidence in rankings.

SECTION 06

Building a Custom Judge

Here's a practical example of building a judge for a specific task (code generation):

# Custom Judge for Code Generation import json import anthropic JUDGE_PROMPT = """You are an expert code reviewer. Evaluate this code solution on: 1. CORRECTNESS (1-5): Does it solve the problem? 2. EFFICIENCY (1-5): Is it optimized? 3. READABILITY (1-5): Clear variable names, comments? 4. SAFETY (1-5): Error handling, edge cases? Respond with JSON: { "correctness": N, "efficiency": N, "readability": N, "safety": N, "reasoning": "...", "overall": N }""" def judge_code(problem: str, code: str) -> dict: """Judge a code solution.""" client = anthropic.Anthropic() prompt = f"""Problem: {problem} Solution: {code} {JUDGE_PROMPT}""" response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=500, messages=[{"role": "user", "content": prompt}] ) try: return json.loads(response.content[0].text) except: return {"error": "Failed to parse judge response"} def batch_evaluate(problems_codes: list[tuple]) -> list[dict]: """Evaluate multiple solutions.""" results = [] for problem, code in problems_codes: judgment = judge_code(problem, code) results.append({ "problem": problem[:50] + "...", "code": code[:50] + "...", "judgment": judgment }) return results # Example problems = [ ("Write a function to check if a number is prime", "def is_prime(n):\n for i in range(2, int(n**0.5) + 1):\n if n % i == 0:\n return False\n return n > 1"), ("Reverse a list in-place", "def reverse(arr):\n arr = arr[::-1]"), # Bug: doesn't modify in-place ] results = batch_evaluate(problems) for result in results: print(f"Problem: {result['problem']}") print(f"Overall: {result['judgment'].get('overall', 'N/A')}") print(f"Reasoning: {result['judgment'].get('reasoning', '')}") print()

Building Blocks

Judge Model: GPT-4 or Claude (stronger = better quality)
Rubric: Clear criteria (correctness, efficiency, etc.)
Prompt: Task description + rubric + output format
Error Handling: Fallback if judge fails to parse or respond
Batch Evaluation: Process many examples; track statistics

Optimization Tips

Use cheaper models for coarse filtering: GPT-3.5 for first pass, GPT-4 only for borderline cases
Cache judge prompts: Reuse rubrics across evaluations
Parallelize: Run multiple judge calls simultaneously
Constrain output: Force JSON format to avoid parsing errors

Template Pattern: 1) Define rubric, 2) Write prompt, 3) Create judge function, 4) Batch evaluate, 5) Analyze results. This pattern is reusable across domains.

SECTION 07

Judge Calibration

Before deploying a judge, verify it aligns with human judgement. Calibration is crucial for confidence.

Validation Process

1. Collect Human Labels

Sample 50-100 examples. Have humans (2-3 per example) score or compare. This is your ground truth.

2. Run Judge on Same Examples

Get judge scores for the same 50-100 examples.

3. Compute Correlation

from scipy.stats import spearmanr, pearsonr human_scores = [8, 7, 9, 6, 5, 8, 9, 7, 8, 6, ...] # Human avg judge_scores = [8, 6, 9, 6, 4, 9, 8, 7, 8, 5, ...] # Judge # Correlation pearson_r, p_value = pearsonr(human_scores, judge_scores) spearman_r, p_value = spearmanr(human_scores, judge_scores) print(f"Pearson r: {pearson_r:.3f}") # Linear correlation print(f"Spearman r: {spearman_r:.3f}") # Rank correlation # Interpretation: # r > 0.8: Strong agreement # r = 0.6-0.8: Good agreement # r < 0.6: Weak agreement (don't use judge)

4. Analyze Disagreements

Where does the judge disagree with humans?

Off-by-one errors: Judge scores 7 when human says 8. Acceptable; both in same ballpark.
Systematic bias: Judge always scores 2 points lower. Recalibrate (adjust judge prompt or use different model).
Outliers: Judge scores 3 when human says 9. Investigate: Is this a human error or judge failure?

5. Calibration Plot

Visualize judge vs human agreement:

Plot: X-axis = Human score, Y-axis = Judge score Points along diagonal y=x: Perfect agreement Points above diagonal: Judge over-scores Points below diagonal: Judge under-scores Ideally: Tight cluster around diagonal Problem pattern: Judge consistently above/below diagonal Fix: Retune judge prompt or choose different judge model

When to Trust vs Distrust LLM Judges

Scenario	Correlation	Decision
Factual QA (single right answer)	> 0.85	Use judge confidently; spot-check 5%
Creative writing, subjective tasks	0.70-0.80	Use judge with caution; validate 10-20%
Technical evaluation (code, math)	> 0.80	Use judge; requires good rubric
Safety/toxicity detection	0.75-0.85	Use for screening; humans validate flagged
Low correlation (< 0.65)	< 0.65	Don't use; task too subjective or rubric poor

Continuous Monitoring

Even after deployment, monitor judge performance:

Quarterly: Re-validate on fresh human labels. Has correlation drifted?
Weekly: Check for outliers. Are there patterns in judge failures?
Per-update: If you update the judge prompt or model, re-calibrate

Golden Rule: Never deploy a judge without calibration. Invest in 50-100 human labels. The cost (100 × $5 = $500) is tiny compared to the cost of wrong decisions based on uncalibrated judges (wrong model selections, missed safety issues).

Table of Contents

The Evaluation Bottleneck

Judging Formats

Prompt Design for Judges

Biases in LLM Judges

MT-Bench & Chatbot Arena

Building a Custom Judge

Judge Calibration

Productionising Your Judge

LLM-as-Judge

Table of Contents

The Evaluation Bottleneck

Judging Formats

Prompt Design for Judges

Biases in LLM Judges

MT-Bench & Chatbot Arena

Building a Custom Judge

Judge Calibration

Productionising Your Judge

Related concepts