Safety & Evaluation

LLM-as-Judge

Using language models to automatically evaluate and score other model outputs, enabling scalable evaluation without human annotation.

Scalable
Key Advantage
GPT-4
Common Judge
>80%
Human Correlation

Table of Contents

SECTION 01

The Evaluation Bottleneck

Evaluating LLM quality is hard. Manual human annotation is the gold standard but is expensive and slow: hiring annotators, defining rubrics, ensuring consistency across thousands of examples. For most labs, human evaluation costs $5-50 per example depending on task complexity.

Scale: If you evaluate 10k model outputs annually (a typical baseline for a new model), manual evaluation costs $50k-500k. With budget constraints, you can only evaluate 1-5% of your outputs, missing important failure modes.

The Solution: LLM-as-Judge

Use a strong, reliable LLM (GPT-4, Claude) to automatically evaluate outputs from other models (including weaker versions of itself). A single GPT-4 API call costs ~$0.001-0.01 per evaluation, roughly 1000x cheaper than human annotation.

Why it Works

Limitations

Key Insight: LLM judges are not replacements for human evaluation; they're augmenters. Use judges to screen 100% of outputs (cheap), then use humans to spot-check controversial cases (~5%) for ground truth.
SECTION 02

Judging Formats

There are several ways to structure LLM judgement. Choice depends on your task:

1. Pointwise Scoring (Absolute)

Rate a single output on a rubric, independent of other outputs:

Task: Evaluate this response to "How does photosynthesis work?" Response: "Photosynthesis is a process where plants convert sunlight into chemical energy. It happens in two stages: light-dependent reactions in the thylakoid membrane, and the Calvin cycle in the stroma..." Rubric: - Accuracy (0-10): Does it explain the mechanism correctly? - Completeness (0-10): Does it cover key concepts? - Clarity (0-10): Is it understandable? - Depth (0-10): Is it appropriately detailed? Judge output: { "accuracy": 9, "completeness": 8, "clarity": 9, "depth": 8, "overall": 8.5, "reasoning": "Response accurately explains both stages. Minor: could mention chlorophyll role." }

2. Pairwise Comparison (Relative)

Compare two outputs and decide which is better:

Advantage: Easier for judges (binary choice) and robust to judge bias (both A and B get the same bias, so differences stand out). Used by Chatbot Arena.

3. Reference-Based Evaluation

Compare output to a gold standard reference answer:

Question: "What's the capital of France?" Reference Answer: "Paris" Model Output: "The capital of France is Paris, the largest city by population in the country." Evaluation: Output matches reference (CORRECT) and adds helpful context. Score: 10/10.

4. Reference-Free Evaluation

Judge output quality without a reference. Relies on rubric and judge knowledge:

5. G-Eval Framework

A structured approach combining form-filling and chain-of-thought:

G-Eval steps: 1. Aspect (what to evaluate): Coherence, Relevance, etc. 2. Evaluation Criteria (rubric): Clear definitions of 1-5 or 1-10 3. Evaluation Steps (chain-of-thought): Judge breaks down reasoning 4. Output Format (JSON): Structured score + explanation Example: { "aspect": "helpfulness", "criteria": { "1": "Not helpful at all", "3": "Somewhat helpful, missing key info", "5": "Directly addresses question" }, "reasoning": "Output answers the core question and provides examples, meeting criteria 5.", "score": 5 }
Format Choice: Pairwise is easiest for judges (fewer errors). Pointwise gives richer feedback. Reference-based works for factual tasks. G-Eval is best for detailed analysis with reasoning.
SECTION 03

Prompt Design for Judges

The judge prompt is critical. Poor prompts lead to noisy, biased judgements. Here's how to design good judge prompts:

1. Clear Rubric

Bad rubric: "Rate how good this answer is. 1-5 scale." Judge: Too vague. What makes something "good"? Good rubric: "Rate on CORRECTNESS (1-5): 1 = Factually wrong 2 = Partially correct, significant errors 3 = Mostly correct with minor errors 4 = Correct, complete, well-explained 5 = Correct, comprehensive, excellent clarity"

2. Chain-of-Thought Reasoning

Ask the judge to explain before scoring:

Prompt structure: "First, analyze the response: 1. Is the core claim accurate? 2. Are there omissions or errors? 3. Is the explanation clear? Then, assign a score 1-5." Benefit: Judge's reasoning is visible. If you disagree, you can see where the judge went wrong.

3. Positive & Negative Examples

Provide exemplars to calibrate the judge:

"Here are examples of scores: Example 1 (Score 5): Question: 'How photosynthesis works?' Response: 'Photosynthesis converts light energy to chemical energy via light reactions and Calvin cycle...' Reason: Accurate, complete, clear. Example 2 (Score 2): Question: 'How photosynthesis works?' Response: 'Plants make energy from sun.' Reason: Oversimplified, missing key mechanisms. Now evaluate this response: ..."

4. JSON Output Schema

Structure the output for programmatic parsing:

"Respond with JSON: { \"analysis\": \"Detailed reasoning\", \"score\": 4, \"confidence\": 0.85, \"strengths\": [\"...\"], \"weaknesses\": [\"...\"], \"suggestions\": \"How to improve\" }"

5. Avoid Bias Signals

Don't include information that biases the judge:

Prompt Template: Always use: clear rubric + exemplars + CoT reasoning + JSON output + bias mitigation. This 5-part formula dramatically improves judge consistency.
SECTION 04

Biases in LLM Judges

LLM judges are not objective. They have systematic biases that skew evaluations:

1. Verbosity Bias

Judges prefer longer responses, even if unnecessary:

2. Position Bias (A/B Preference)

In pairwise comparisons, the first option sometimes scores higher:

3. Self-Enhancement Bias

If the judge is GPT-4, it scores GPT-4's outputs higher than competitors:

4. Sycophancy

Judges give higher scores to responses that agree with the judge's trained opinions:

5. Coherence Over Correctness

Judges reward well-written but wrong answers:

Mitigation Strategies

Critical: LLM judges are biased. Never trust a single judge's score. Always use multiple judges and human spot-checks to validate.
SECTION 05

MT-Bench & Chatbot Arena

Two landmark projects demonstrating LLM judging at scale:

MT-Bench (Multi-Turn Benchmark)

A benchmark of multi-turn conversations with GPT-4 as judge. Evaluates models' ability to handle follow-up questions and maintain context.

Chatbot Arena (LMSYS)

A crowdsourced competition where users vote on which model gives better responses. GPT-4 judges also used as supplementary ranking.

Findings

MT-Bench leaderboard (simplified): Model | GPT-4 Judge Score | Human Eval GPT-4-Turbo | 9.1 | 9.0 Claude-3-Opus | 8.9 | 8.8 Llama-2-70B | 7.1 | 7.3 Mistral-Large | 7.5 | 7.4 GPT-3.5-Turbo | 6.9 | 6.8 Judge scores align well with human evaluation.
Lesson from Arena: Pairwise evaluation (A vs B) is robust. Blind evaluation (hide model names) improves consistency. Crowdsourcing + judge validation together provide confidence in rankings.
SECTION 06

Building a Custom Judge

Here's a practical example of building a judge for a specific task (code generation):

# Custom Judge for Code Generation import json import anthropic JUDGE_PROMPT = """You are an expert code reviewer. Evaluate this code solution on: 1. CORRECTNESS (1-5): Does it solve the problem? 2. EFFICIENCY (1-5): Is it optimized? 3. READABILITY (1-5): Clear variable names, comments? 4. SAFETY (1-5): Error handling, edge cases? Respond with JSON: { "correctness": N, "efficiency": N, "readability": N, "safety": N, "reasoning": "...", "overall": N }""" def judge_code(problem: str, code: str) -> dict: """Judge a code solution.""" client = anthropic.Anthropic() prompt = f"""Problem: {problem} Solution: {code} {JUDGE_PROMPT}""" response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=500, messages=[{"role": "user", "content": prompt}] ) try: return json.loads(response.content[0].text) except: return {"error": "Failed to parse judge response"} def batch_evaluate(problems_codes: list[tuple]) -> list[dict]: """Evaluate multiple solutions.""" results = [] for problem, code in problems_codes: judgment = judge_code(problem, code) results.append({ "problem": problem[:50] + "...", "code": code[:50] + "...", "judgment": judgment }) return results # Example problems = [ ("Write a function to check if a number is prime", "def is_prime(n):\n for i in range(2, int(n**0.5) + 1):\n if n % i == 0:\n return False\n return n > 1"), ("Reverse a list in-place", "def reverse(arr):\n arr = arr[::-1]"), # Bug: doesn't modify in-place ] results = batch_evaluate(problems) for result in results: print(f"Problem: {result['problem']}") print(f"Overall: {result['judgment'].get('overall', 'N/A')}") print(f"Reasoning: {result['judgment'].get('reasoning', '')}") print()

Building Blocks

Optimization Tips

Template Pattern: 1) Define rubric, 2) Write prompt, 3) Create judge function, 4) Batch evaluate, 5) Analyze results. This pattern is reusable across domains.
SECTION 07

Judge Calibration

Before deploying a judge, verify it aligns with human judgement. Calibration is crucial for confidence.

Validation Process

1. Collect Human Labels

Sample 50-100 examples. Have humans (2-3 per example) score or compare. This is your ground truth.

2. Run Judge on Same Examples

Get judge scores for the same 50-100 examples.

3. Compute Correlation

from scipy.stats import spearmanr, pearsonr human_scores = [8, 7, 9, 6, 5, 8, 9, 7, 8, 6, ...] # Human avg judge_scores = [8, 6, 9, 6, 4, 9, 8, 7, 8, 5, ...] # Judge # Correlation pearson_r, p_value = pearsonr(human_scores, judge_scores) spearman_r, p_value = spearmanr(human_scores, judge_scores) print(f"Pearson r: {pearson_r:.3f}") # Linear correlation print(f"Spearman r: {spearman_r:.3f}") # Rank correlation # Interpretation: # r > 0.8: Strong agreement # r = 0.6-0.8: Good agreement # r < 0.6: Weak agreement (don't use judge)

4. Analyze Disagreements

Where does the judge disagree with humans?

5. Calibration Plot

Visualize judge vs human agreement:

Plot: X-axis = Human score, Y-axis = Judge score Points along diagonal y=x: Perfect agreement Points above diagonal: Judge over-scores Points below diagonal: Judge under-scores Ideally: Tight cluster around diagonal Problem pattern: Judge consistently above/below diagonal Fix: Retune judge prompt or choose different judge model

When to Trust vs Distrust LLM Judges

Scenario Correlation Decision
Factual QA (single right answer) > 0.85 Use judge confidently; spot-check 5%
Creative writing, subjective tasks 0.70-0.80 Use judge with caution; validate 10-20%
Technical evaluation (code, math) > 0.80 Use judge; requires good rubric
Safety/toxicity detection 0.75-0.85 Use for screening; humans validate flagged
Low correlation (< 0.65) < 0.65 Don't use; task too subjective or rubric poor

Continuous Monitoring

Even after deployment, monitor judge performance:

Golden Rule: Never deploy a judge without calibration. Invest in 50-100 human labels. The cost (100 Γ— $5 = $500) is tiny compared to the cost of wrong decisions based on uncalibrated judges (wrong model selections, missed safety issues).
SECTION 08

Productionising Your Judge

Moving from a prototype judge to a production eval pipeline requires solving three operational challenges: throughput, consistency, and drift detection. Throughput: LLM judgements are slow (~1–3 seconds each); for large eval sets, run judgements in parallel with a bounded async pool. Aim for 500–1000 judgements per minute using Haiku β€” this lets you eval a 10,000-item test set in under 30 minutes at minimal cost.

Consistency: non-determinism in the judge model means the same item may receive different scores on different runs. Fix this by setting temperature=0 and caching judge prompts. For high-stakes evaluations, run each item through the judge twice with swapped position (a/b then b/a) and average the scores to cancel out position bias. Log the raw judge output alongside the score so you can audit disagreements.

Drift detection: your judge's calibration can shift when you upgrade the judge model or when your task distribution changes. Maintain a "golden set" of 200–300 items with known human-verified scores. Run the judge on this set after every model upgrade and alert if the judge's mean absolute error against human scores increases by more than 0.5 points. Treat the judge itself as a component with regression tests, not a static oracle.