Benchmarks & Leaderboards

LLM Benchmarks Explained

What MMLU, HumanEval, HELM, and Chatbot Arena actually measure — and where they mislead

60+ active benchmarks in circulation
contamination risk the core problem
arena > static the lesson
Contents
  1. Why benchmarks matter
  2. Knowledge & reasoning
  3. Coding benchmarks
  4. Math benchmarks
  5. Holistic & safety
  6. Chatbot Arena
  7. Selection guide
01 — Context

Why Benchmarks Matter and Where They Fail

Benchmarks provide standardized comparison across models and over time. Without them, every team uses different test sets, results aren't comparable, and progress is subjective. They're essential infrastructure for the field.

But benchmarks fail in predictable ways. Train-test contamination: benchmark data leaks into training sets (via web crawl or intentional inclusion), inflating scores. Goodhart's Law: when a metric becomes the target, it ceases to measure what matters — models optimize the benchmark, not the underlying capability. Benchmark saturation: top models all cluster near ceiling, losing discriminative power. Narrow coverage: one benchmark doesn't equal general intelligence.

⚠️ Core risk: When a benchmark becomes widely used, models start being trained on its data — intentionally or via web crawl. Treat any model scoring >90% on a popular benchmark with skepticism. Frontier model clusters are not all equally capable; saturated benchmarks hide real differences.
02 — Broad Coverage

Knowledge and Reasoning Benchmarks

MMLU, GPQA, HellaSwag, and WinoGrande are the workhorse benchmarks for general capability. Each measures different aspects: breadth of knowledge, common sense, and reasoning.

BenchmarkWhat it testsFormatSaturation risk
MMLU57 subjects, academic knowledge4-choice MCQHigh (frontier >85%)
MMLU-ProHarder MMLU with 10 choices10-choice MCQMedium
ARC (Challenge)Grade-school science, needs reasoning4-choice MCQHigh
HellaSwagCommonsense completion4-choiceHigh
WinoGrandePronoun disambiguationBinaryHigh
GPQAExpert-level PhD Q&A4-choiceLow (frontier ~60%)

MMLU Details

14,000 questions across 57 subjects from elementary to PhD level. Originally hard (GPT-3 scored 43.9%); frontier models now exceed 85%. At this ceiling, small differences in percentage points don't translate to meaningful capability differences. MMLU scores above 85% are largely indistinguishable.

GPQA: The Durable Benchmark

Graduate-level Google-Proof Q&A: questions where expert humans with internet access get right only 65% of the time. Current frontier models score ~60%. This is a genuine hard problem. GPQA is contamination-resistant by design — questions come from recent research papers, so the test set is narrow and hard to memorize.

⚠️ Signal interpretation: MMLU scores above 85% are largely indistinguishable in practical terms. When comparing frontier models, focus on GPQA and domain-specific evaluations. MMLU is a table-stakes check, not a differentiator.
Python · Run a custom benchmark using lm-evaluation-harness
# Install: pip install lm-eval
# Evaluate a model on MMLU and HellaSwag from the command line:
# lm_eval --model openai-chat-completions #   --model_args model=gpt-4o #   --tasks mmlu,hellaswag #   --num_fewshot 5 #   --output_path ./results

# Programmatic evaluation:
from lm_eval import evaluator, tasks
from lm_eval.models.openai_completions import OpenaiChatCompletionsLM

def benchmark_model(model_name: str, task_names: list[str]) -> dict:
    model = OpenaiChatCompletionsLM(model=model_name)

    results = evaluator.evaluate(
        lm=model,
        task_dict=tasks.get_task_dict(task_names),
        num_fewshot=5,
        limit=100   # limit samples for quick eval
    )
    return results["results"]

results = benchmark_model("gpt-4o-mini", ["mmlu", "arc_challenge"])
for task, metrics in results.items():
    acc = metrics.get("acc_norm,none", metrics.get("acc,none", 0))
    print(f"{task}: {acc:.3f}")
# mmlu:          0.821
# arc_challenge: 0.886
03 — Programming Tasks

Coding Benchmarks

Coding benchmarks have evolved from simple function completion to real-world issue fixing. HumanEval was the baseline; SWE-Bench is now the gold standard.

BenchmarkTasksEval methodLimitation
HumanEval164 Python functionsUnit tests (pass@k)Too easy (frontier >90%)
MBPP374 crowd-sourced problemsUnit testsEasy
SWE-BenchReal GitHub issues, 300 reposPatch applies + tests passHard, realistic
SWE-Bench VerifiedHuman-verified subset of SWE-BenchSameGold standard
LiveCodeBenchOngoing scrape of new LeetCodeUnit testsContamination-resistant
BigCodeBenchFunction-call / library usageUnit testsTests real API knowledge

SWE-Bench: The Realistic Standard

Fix a real GitHub issue in a real repository with real dependencies. This is the current gold standard for agentic coding. GPT-4 scored 1.7% initially; best agents now exceed 50%. SWE-Bench measures actual engineering capability, not toy problem-solving.

LiveCodeBench: Contamination-Resistant

Scrapes new LeetCode problems monthly, keeping the test set fresh. By design, no model can be trained on future problems. LiveCodeBench is the best signal for trending model capability over time without worrying about data contamination.

Reliable signal: For production coding assistants, SWE-Bench Verified is the benchmark that correlates best with developer satisfaction. Use LiveCodeBench for trend analysis (monthly updates, zero contamination risk).
04 — Mathematical Reasoning

Math Benchmarks

Math benchmarks range from elementary to research-level, each revealing different reasoning depths.

BenchmarkDifficultyEval methodSaturation
GSM8KGrade school math (8 steps)Exact matchHigh (frontier >95%)
MATHCompetition math (AMC, AIME)Exact matchMedium (frontier ~80%)
AIME 2024Actual 2024 AIME problemsExact matchLow (frontier ~20–50%)
FrontierMathResearch-level, novel problemsExact matchVery low (<5%)
OlympiadBench8476 multilingual olympiadExact matchLow

Saturation Progression

GSM8K: saturated. Frontier models solve 95%+ of elementary problems. MATH: nearly saturated by o1-class models with chain-of-thought. AIME 2024: emerging battleground where frontier models solve 20-50 out of 30 problems, depending on search budget and reasoning depth.

FrontierMath: The Unsolved Frontier

Epoch AI benchmark: Problems posed by research mathematicians, designed to resist contamination. No public model has solved even 5%. This is the true frontier of mathematical reasoning capability.

⚠️ Watch for gaming: AIME 2025 problems will be the next battleground as they become public. Monitor contamination risk carefully — models trained on 2024 AIME may appear better on 2025 problems through overlap and pattern matching rather than genuine improvement.
05 — Broad & Safety

Holistic and Safety Benchmarks

Some benchmarks go beyond narrow metrics to measure robustness, fairness, and safety across diverse scenarios.

HELM (Holistic Evaluation): 42 scenarios with 7 metrics per scenario — accuracy, calibration, robustness, fairness, bias, toxicity, efficiency. Comprehensive but expensive to compute. BIG-Bench Hard (BBH): 23 tasks that still challenge frontier models — algorithmic, world knowledge, language understanding. TruthfulQA: 817 questions where humans often have false beliefs; tests whether models propagate misconceptions. MT-Bench: 80 multi-turn chat questions judged by GPT-4, covering math, coding, reasoning, roleplay, writing, STEM.

BenchmarkCoverageEval methodBest for
HELMBroad (42 scenarios)Multi-metricComprehensive comparison
BIG-Bench HardReasoning-focusedExact matchReasoning capability
TruthfulQAFactualityJudge or MCQTruthfulness
MT-BenchChat qualityLLM judgeChat assistant quality

These benchmarks provide breadth but sacrifice depth. A single HELM score hides variation across scenarios. Use holistic benchmarks for baseline health checks, not for final model selection.

06 — Human Signal

Chatbot Arena and Human Preference

Chatbot Arena (LMSYS): Real users ask questions, see two anonymous responses, pick the better one. ELO-style rating updated after each battle. Currently the most reliable ranking for overall model quality because contamination is impossible (new questions), coverage is real (millions of votes), and results update in real-time.

Why Arena is Better Than Static Benchmarks

Static benchmarks are fixed forever; Arena evolves. Users reveal true preferences; benchmarks measure narrow capabilities. Arena is large (millions of votes); benchmarks are small (thousands of examples). This scale difference matters.

DimensionArenaStatic benchmarks
Contamination riskZeroHigh
CoverageReal queries (biased)Fixed test set (narrow)
Speed of updateReal-timeMonths
ReproducibilityLow (stochastic)High
Cost to gameHardEasy

Arena Limitations

English bias: Skewed toward English queries. Chat bias: Skewed toward chat-style questions (less technical). Gameable: Can optimize for superficial features that humans prefer (verbosity, disclaimers) without improving actual reasoning.

Single most trustworthy signal: Chatbot Arena is the best ranking available, but look at confidence intervals, not just ELO point estimates. Models within 50 points are statistically indistinguishable.
Python · Track benchmark scores over time and alert on regression
import json, statistics
from pathlib import Path
from datetime import datetime

HISTORY_FILE = Path("benchmark_history.jsonl")

def save_benchmark_run(model: str, task: str, score: float, metadata: dict = None):
    entry = {
        "timestamp": datetime.now().isoformat(),
        "model": model,
        "task": task,
        "score": score,
        **(metadata or {})
    }
    HISTORY_FILE.open("a").write(json.dumps(entry) + "
")

def check_regression(model: str, task: str, new_score: float,
                      threshold: float = 0.03) -> dict:
    """Alert if new score is more than threshold below historical mean."""
    if not HISTORY_FILE.exists():
        return {"status": "no_history"}

    history = [
        json.loads(l) for l in HISTORY_FILE.read_text().splitlines()
        if json.loads(l)["model"] == model and json.loads(l)["task"] == task
    ]
    if len(history) < 3:
        return {"status": "insufficient_history", "n": len(history)}

    historical_scores = [h["score"] for h in history[-10:]]  # last 10 runs
    mean_score = statistics.mean(historical_scores)
    drop = mean_score - new_score

    if drop > threshold:
        return {
            "status": "REGRESSION",
            "new_score": new_score,
            "historical_mean": round(mean_score, 4),
            "drop": round(drop, 4),
            "alert": f"Score dropped {drop:.1%} below historical mean"
        }
    return {"status": "ok", "new_score": new_score, "mean": round(mean_score, 4)}

# Example workflow
save_benchmark_run("gpt-4o", "mmlu", 0.887)
result = check_regression("gpt-4o", "mmlu", new_score=0.841)
print(result)
07 — How to Choose

Benchmark Selection Guide

Different contexts demand different benchmarks. Here's a framework for choosing wisely.

1

For general capability comparison — broad signal

Use MMLU-Pro + GPQA + MATH + SWE-Bench Verified + Arena ELO. This suite covers reasoning, knowledge, coding, and human preference. Spans breadth and depth.

Python · Build a task-specific benchmark from your own golden set
import json, statistics
from openai import OpenAI

client = OpenAI()

# Your domain-specific golden set
GOLDEN_SET = [
    {"input": "What is the capital of France?", "expected": "Paris", "category": "factual"},
    {"input": "What is 15% of 200?",            "expected": "30",    "category": "math"},
    {"input": "Translate 'hello' to Spanish.",  "expected": "hola",  "category": "translation"},
]

def exact_match(response: str, expected: str) -> bool:
    return expected.lower().strip() in response.lower().strip()

def run_domain_benchmark(model: str) -> dict:
    results_by_category: dict = {}

    for item in GOLDEN_SET:
        resp = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": item["input"]}],
            max_tokens=64,
            temperature=0.0
        ).choices[0].message.content

        cat = item["category"]
        if cat not in results_by_category:
            results_by_category[cat] = []
        results_by_category[cat].append(exact_match(resp, item["expected"]))

    overall = [v for vals in results_by_category.values() for v in vals]
    return {
        "model": model,
        "overall_accuracy": round(statistics.mean(overall), 3),
        "by_category": {
            cat: round(statistics.mean(vals), 3)
            for cat, vals in results_by_category.items()
        }
    }

for model in ["gpt-4o-mini", "gpt-4o"]:
    print(json.dumps(run_domain_benchmark(model), indent=2))
2

For coding products specifically — deep signal

HumanEval as a baseline (fast, for quick iteration), SWE-Bench Verified as gold standard (real-world), LiveCodeBench for contamination-resistant trending. Skip MMLU for pure coding comparisons.

3

For safety-critical deployments — safety focus

TruthfulQA + MT-Bench safety categories + custom red-team suite. Public benchmarks alone are insufficient. Build your own safety evals for domain-specific harms.

4

For your own product — task-specific signal

Build a task-specific golden set (see evals-practice.html). Public benchmarks measure general capability; your users care about your specific task. Benchmark scores don't translate to product quality.

Reporting Best Practices

Always report benchmark scores with model size, training compute, and inference compute. A small model with 64× test-time search can outperform a large model on math benchmarks — these are not the same capability. Fairness demands full context.

⚠️ Context is essential: "Model X scored 90% on MMLU" is meaningless without knowing model size, inference strategy, and whether test-time search was used. Always publish full details.
08 — Further Reading

References

Academic Papers
Blog Posts & Resources