LLM Benchmarks Explained

Contents

Why benchmarks matter
Knowledge & reasoning
Coding benchmarks
Math benchmarks
Holistic & safety
Chatbot Arena
Selection guide

01 — Context

Why Benchmarks Matter and Where They Fail

Benchmarks provide standardized comparison across models and over time. Without them, every team uses different test sets, results aren't comparable, and progress is subjective. They're essential infrastructure for the field.

But benchmarks fail in predictable ways. Train-test contamination: benchmark data leaks into training sets (via web crawl or intentional inclusion), inflating scores. Goodhart's Law: when a metric becomes the target, it ceases to measure what matters — models optimize the benchmark, not the underlying capability. Benchmark saturation: top models all cluster near ceiling, losing discriminative power. Narrow coverage: one benchmark doesn't equal general intelligence.

⚠️ Core risk: When a benchmark becomes widely used, models start being trained on its data — intentionally or via web crawl. Treat any model scoring >90% on a popular benchmark with skepticism. Frontier model clusters are not all equally capable; saturated benchmarks hide real differences.

02 — Broad Coverage

Knowledge and Reasoning Benchmarks

MMLU, GPQA, HellaSwag, and WinoGrande are the workhorse benchmarks for general capability. Each measures different aspects: breadth of knowledge, common sense, and reasoning.

Benchmark	What it tests	Format	Saturation risk
MMLU	57 subjects, academic knowledge	4-choice MCQ	High (frontier >85%)
MMLU-Pro	Harder MMLU with 10 choices	10-choice MCQ	Medium
ARC (Challenge)	Grade-school science, needs reasoning	4-choice MCQ	High
HellaSwag	Commonsense completion	4-choice	High
WinoGrande	Pronoun disambiguation	Binary	High
GPQA	Expert-level PhD Q&A	4-choice	Low (frontier ~60%)

MMLU Details

14,000 questions across 57 subjects from elementary to PhD level. Originally hard (GPT-3 scored 43.9%); frontier models now exceed 85%. At this ceiling, small differences in percentage points don't translate to meaningful capability differences. MMLU scores above 85% are largely indistinguishable.

GPQA: The Durable Benchmark

Graduate-level Google-Proof Q&A: questions where expert humans with internet access get right only 65% of the time. Current frontier models score ~60%. This is a genuine hard problem. GPQA is contamination-resistant by design — questions come from recent research papers, so the test set is narrow and hard to memorize.

⚠️ Signal interpretation: MMLU scores above 85% are largely indistinguishable in practical terms. When comparing frontier models, focus on GPQA and domain-specific evaluations. MMLU is a table-stakes check, not a differentiator.

Python · Run a custom benchmark using lm-evaluation-harness

# Install: pip install lm-eval
# Evaluate a model on MMLU and HellaSwag from the command line:
# lm_eval --model openai-chat-completions #   --model_args model=gpt-4o #   --tasks mmlu,hellaswag #   --num_fewshot 5 #   --output_path ./results

# Programmatic evaluation:
from lm_eval import evaluator, tasks
from lm_eval.models.openai_completions import OpenaiChatCompletionsLM

def benchmark_model(model_name: str, task_names: list[str]) -> dict:
    model = OpenaiChatCompletionsLM(model=model_name)

    results = evaluator.evaluate(
        lm=model,
        task_dict=tasks.get_task_dict(task_names),
        num_fewshot=5,
        limit=100   # limit samples for quick eval
    )
    return results["results"]

results = benchmark_model("gpt-4o-mini", ["mmlu", "arc_challenge"])
for task, metrics in results.items():
    acc = metrics.get("acc_norm,none", metrics.get("acc,none", 0))
    print(f"{task}: {acc:.3f}")
# mmlu:          0.821
# arc_challenge: 0.886

03 — Programming Tasks

Coding Benchmarks

Coding benchmarks have evolved from simple function completion to real-world issue fixing. HumanEval was the baseline; SWE-Bench is now the gold standard.

Benchmark	Tasks	Eval method	Limitation
HumanEval	164 Python functions	Unit tests (pass@k)	Too easy (frontier >90%)
MBPP	374 crowd-sourced problems	Unit tests	Easy
SWE-Bench	Real GitHub issues, 300 repos	Patch applies + tests pass	Hard, realistic
SWE-Bench Verified	Human-verified subset of SWE-Bench	Same	Gold standard
LiveCodeBench	Ongoing scrape of new LeetCode	Unit tests	Contamination-resistant
BigCodeBench	Function-call / library usage	Unit tests	Tests real API knowledge

SWE-Bench: The Realistic Standard

Fix a real GitHub issue in a real repository with real dependencies. This is the current gold standard for agentic coding. GPT-4 scored 1.7% initially; best agents now exceed 50%. SWE-Bench measures actual engineering capability, not toy problem-solving.

LiveCodeBench: Contamination-Resistant

Scrapes new LeetCode problems monthly, keeping the test set fresh. By design, no model can be trained on future problems. LiveCodeBench is the best signal for trending model capability over time without worrying about data contamination.

✓ Reliable signal: For production coding assistants, SWE-Bench Verified is the benchmark that correlates best with developer satisfaction. Use LiveCodeBench for trend analysis (monthly updates, zero contamination risk).

04 — Mathematical Reasoning

Math Benchmarks

Math benchmarks range from elementary to research-level, each revealing different reasoning depths.

Benchmark	Difficulty	Eval method	Saturation
GSM8K	Grade school math (8 steps)	Exact match	High (frontier >95%)
MATH	Competition math (AMC, AIME)	Exact match	Medium (frontier ~80%)
AIME 2024	Actual 2024 AIME problems	Exact match	Low (frontier ~20–50%)
FrontierMath	Research-level, novel problems	Exact match	Very low (<5%)
OlympiadBench	8476 multilingual olympiad	Exact match	Low

Saturation Progression

GSM8K: saturated. Frontier models solve 95%+ of elementary problems. MATH: nearly saturated by o1-class models with chain-of-thought. AIME 2024: emerging battleground where frontier models solve 20-50 out of 30 problems, depending on search budget and reasoning depth.

FrontierMath: The Unsolved Frontier

Epoch AI benchmark: Problems posed by research mathematicians, designed to resist contamination. No public model has solved even 5%. This is the true frontier of mathematical reasoning capability.

⚠️ Watch for gaming: AIME 2025 problems will be the next battleground as they become public. Monitor contamination risk carefully — models trained on 2024 AIME may appear better on 2025 problems through overlap and pattern matching rather than genuine improvement.

05 — Broad & Safety

Holistic and Safety Benchmarks

Some benchmarks go beyond narrow metrics to measure robustness, fairness, and safety across diverse scenarios.

HELM (Holistic Evaluation): 42 scenarios with 7 metrics per scenario — accuracy, calibration, robustness, fairness, bias, toxicity, efficiency. Comprehensive but expensive to compute. BIG-Bench Hard (BBH): 23 tasks that still challenge frontier models — algorithmic, world knowledge, language understanding. TruthfulQA: 817 questions where humans often have false beliefs; tests whether models propagate misconceptions. MT-Bench: 80 multi-turn chat questions judged by GPT-4, covering math, coding, reasoning, roleplay, writing, STEM.

Benchmark	Coverage	Eval method	Best for
HELM	Broad (42 scenarios)	Multi-metric	Comprehensive comparison
BIG-Bench Hard	Reasoning-focused	Exact match	Reasoning capability
TruthfulQA	Factuality	Judge or MCQ	Truthfulness
MT-Bench	Chat quality	LLM judge	Chat assistant quality

These benchmarks provide breadth but sacrifice depth. A single HELM score hides variation across scenarios. Use holistic benchmarks for baseline health checks, not for final model selection.

06 — Human Signal

Chatbot Arena and Human Preference

Chatbot Arena (LMSYS): Real users ask questions, see two anonymous responses, pick the better one. ELO-style rating updated after each battle. Currently the most reliable ranking for overall model quality because contamination is impossible (new questions), coverage is real (millions of votes), and results update in real-time.

Why Arena is Better Than Static Benchmarks

Static benchmarks are fixed forever; Arena evolves. Users reveal true preferences; benchmarks measure narrow capabilities. Arena is large (millions of votes); benchmarks are small (thousands of examples). This scale difference matters.

Dimension	Arena	Static benchmarks
Contamination risk	Zero	High
Coverage	Real queries (biased)	Fixed test set (narrow)
Speed of update	Real-time	Months
Reproducibility	Low (stochastic)	High
Cost to game	Hard	Easy

Arena Limitations

English bias: Skewed toward English queries. Chat bias: Skewed toward chat-style questions (less technical). Gameable: Can optimize for superficial features that humans prefer (verbosity, disclaimers) without improving actual reasoning.

✓ Single most trustworthy signal: Chatbot Arena is the best ranking available, but look at confidence intervals, not just ELO point estimates. Models within 50 points are statistically indistinguishable.

Python · Track benchmark scores over time and alert on regression

import json, statistics
from pathlib import Path
from datetime import datetime

HISTORY_FILE = Path("benchmark_history.jsonl")

def save_benchmark_run(model: str, task: str, score: float, metadata: dict = None):
    entry = {
        "timestamp": datetime.now().isoformat(),
        "model": model,
        "task": task,
        "score": score,
        **(metadata or {})
    }
    HISTORY_FILE.open("a").write(json.dumps(entry) + "
")

def check_regression(model: str, task: str, new_score: float,
                      threshold: float = 0.03) -> dict:
    """Alert if new score is more than threshold below historical mean."""
    if not HISTORY_FILE.exists():
        return {"status": "no_history"}

    history = [
        json.loads(l) for l in HISTORY_FILE.read_text().splitlines()
        if json.loads(l)["model"] == model and json.loads(l)["task"] == task
    ]
    if len(history) < 3:
        return {"status": "insufficient_history", "n": len(history)}

    historical_scores = [h["score"] for h in history[-10:]]  # last 10 runs
    mean_score = statistics.mean(historical_scores)
    drop = mean_score - new_score

    if drop > threshold:
        return {
            "status": "REGRESSION",
            "new_score": new_score,
            "historical_mean": round(mean_score, 4),
            "drop": round(drop, 4),
            "alert": f"Score dropped {drop:.1%} below historical mean"
        }
    return {"status": "ok", "new_score": new_score, "mean": round(mean_score, 4)}

# Example workflow
save_benchmark_run("gpt-4o", "mmlu", 0.887)
result = check_regression("gpt-4o", "mmlu", new_score=0.841)
print(result)

07 — How to Choose

Benchmark Selection Guide

Different contexts demand different benchmarks. Here's a framework for choosing wisely.

For general capability comparison — broad signal

Use MMLU-Pro + GPQA + MATH + SWE-Bench Verified + Arena ELO. This suite covers reasoning, knowledge, coding, and human preference. Spans breadth and depth.

Python · Build a task-specific benchmark from your own golden set

import json, statistics
from openai import OpenAI

client = OpenAI()

# Your domain-specific golden set
GOLDEN_SET = [
    {"input": "What is the capital of France?", "expected": "Paris", "category": "factual"},
    {"input": "What is 15% of 200?",            "expected": "30",    "category": "math"},
    {"input": "Translate 'hello' to Spanish.",  "expected": "hola",  "category": "translation"},
]

def exact_match(response: str, expected: str) -> bool:
    return expected.lower().strip() in response.lower().strip()

def run_domain_benchmark(model: str) -> dict:
    results_by_category: dict = {}

    for item in GOLDEN_SET:
        resp = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": item["input"]}],
            max_tokens=64,
            temperature=0.0
        ).choices[0].message.content

        cat = item["category"]
        if cat not in results_by_category:
            results_by_category[cat] = []
        results_by_category[cat].append(exact_match(resp, item["expected"]))

    overall = [v for vals in results_by_category.values() for v in vals]
    return {
        "model": model,
        "overall_accuracy": round(statistics.mean(overall), 3),
        "by_category": {
            cat: round(statistics.mean(vals), 3)
            for cat, vals in results_by_category.items()
        }
    }

for model in ["gpt-4o-mini", "gpt-4o"]:
    print(json.dumps(run_domain_benchmark(model), indent=2))

For coding products specifically — deep signal

HumanEval as a baseline (fast, for quick iteration), SWE-Bench Verified as gold standard (real-world), LiveCodeBench for contamination-resistant trending. Skip MMLU for pure coding comparisons.

For safety-critical deployments — safety focus

TruthfulQA + MT-Bench safety categories + custom red-team suite. Public benchmarks alone are insufficient. Build your own safety evals for domain-specific harms.

For your own product — task-specific signal

Build a task-specific golden set (see evals-practice.html). Public benchmarks measure general capability; your users care about your specific task. Benchmark scores don't translate to product quality.

Reporting Best Practices

Always report benchmark scores with model size, training compute, and inference compute. A small model with 64× test-time search can outperform a large model on math benchmarks — these are not the same capability. Fairness demands full context.

⚠️ Context is essential: "Model X scored 90% on MMLU" is meaningless without knowing model size, inference strategy, and whether test-time search was used. Always publish full details.

08 — Further Reading

References

Academic Papers

Paper MMLU: Measuring Massive Multitask Language Understanding. Hendrycks et al. (2020). arXiv:2009.03300. — arxiv:2009.03300 ↗
Paper HumanEval: Evaluating Large Language Models Trained on Code. Chen et al. (2021). arXiv:2107.03374. — arxiv:2107.03374 ↗
Paper SWE-Bench: Can Language Models Resolve Real-World GitHub Issues? Jimenez et al. (2023). arXiv:2310.06770. — arxiv:2310.06770 ↗
Paper HELM: Holistic Evaluation of Language Models. Liang et al. (2022). arXiv:2211.09110. — arxiv:2211.09110 ↗
Paper Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. LMSYS (2024). arXiv:2306.05685. — arxiv:2306.05685 ↗
Paper GPQA: A Graduate-Level Google-Proof Q&A Benchmark. Rein et al. (2023). arXiv:2311.12022. — arxiv:2311.12022 ↗

Blog Posts & Resources

Blog FrontierMath — Epoch AI. Research-level unsolved math problems. — epochai.org ↗
Blog OpenAI Evals Repository. Benchmark datasets and grading code. — github.com/openai/evals ↗
Blog LMSYS Chatbot Arena. Live leaderboard with human preference data. — huggingface.co/chatbot-arena ↗
Blog SWE-Bench Live Leaderboard. Software engineering benchmark with model results. — swebench.com ↗

LLM Benchmarks Explained

Why Benchmarks Matter and Where They Fail

Knowledge and Reasoning Benchmarks

MMLU Details

GPQA: The Durable Benchmark

Coding Benchmarks

SWE-Bench: The Realistic Standard

LiveCodeBench: Contamination-Resistant

Math Benchmarks

Saturation Progression

FrontierMath: The Unsolved Frontier

Holistic and Safety Benchmarks

Chatbot Arena and Human Preference

Why Arena is Better Than Static Benchmarks

Arena Limitations

Benchmark Selection Guide

For general capability comparison — broad signal

For coding products specifically — deep signal

For safety-critical deployments — safety focus

For your own product — task-specific signal

Reporting Best Practices

References

Related concepts