What MMLU, HumanEval, HELM, and Chatbot Arena actually measure — and where they mislead
Benchmarks provide standardized comparison across models and over time. Without them, every team uses different test sets, results aren't comparable, and progress is subjective. They're essential infrastructure for the field.
But benchmarks fail in predictable ways. Train-test contamination: benchmark data leaks into training sets (via web crawl or intentional inclusion), inflating scores. Goodhart's Law: when a metric becomes the target, it ceases to measure what matters — models optimize the benchmark, not the underlying capability. Benchmark saturation: top models all cluster near ceiling, losing discriminative power. Narrow coverage: one benchmark doesn't equal general intelligence.
MMLU, GPQA, HellaSwag, and WinoGrande are the workhorse benchmarks for general capability. Each measures different aspects: breadth of knowledge, common sense, and reasoning.
| Benchmark | What it tests | Format | Saturation risk |
|---|---|---|---|
| MMLU | 57 subjects, academic knowledge | 4-choice MCQ | High (frontier >85%) |
| MMLU-Pro | Harder MMLU with 10 choices | 10-choice MCQ | Medium |
| ARC (Challenge) | Grade-school science, needs reasoning | 4-choice MCQ | High |
| HellaSwag | Commonsense completion | 4-choice | High |
| WinoGrande | Pronoun disambiguation | Binary | High |
| GPQA | Expert-level PhD Q&A | 4-choice | Low (frontier ~60%) |
14,000 questions across 57 subjects from elementary to PhD level. Originally hard (GPT-3 scored 43.9%); frontier models now exceed 85%. At this ceiling, small differences in percentage points don't translate to meaningful capability differences. MMLU scores above 85% are largely indistinguishable.
Graduate-level Google-Proof Q&A: questions where expert humans with internet access get right only 65% of the time. Current frontier models score ~60%. This is a genuine hard problem. GPQA is contamination-resistant by design — questions come from recent research papers, so the test set is narrow and hard to memorize.
# Install: pip install lm-eval
# Evaluate a model on MMLU and HellaSwag from the command line:
# lm_eval --model openai-chat-completions # --model_args model=gpt-4o # --tasks mmlu,hellaswag # --num_fewshot 5 # --output_path ./results
# Programmatic evaluation:
from lm_eval import evaluator, tasks
from lm_eval.models.openai_completions import OpenaiChatCompletionsLM
def benchmark_model(model_name: str, task_names: list[str]) -> dict:
model = OpenaiChatCompletionsLM(model=model_name)
results = evaluator.evaluate(
lm=model,
task_dict=tasks.get_task_dict(task_names),
num_fewshot=5,
limit=100 # limit samples for quick eval
)
return results["results"]
results = benchmark_model("gpt-4o-mini", ["mmlu", "arc_challenge"])
for task, metrics in results.items():
acc = metrics.get("acc_norm,none", metrics.get("acc,none", 0))
print(f"{task}: {acc:.3f}")
# mmlu: 0.821
# arc_challenge: 0.886
Coding benchmarks have evolved from simple function completion to real-world issue fixing. HumanEval was the baseline; SWE-Bench is now the gold standard.
| Benchmark | Tasks | Eval method | Limitation |
|---|---|---|---|
| HumanEval | 164 Python functions | Unit tests (pass@k) | Too easy (frontier >90%) |
| MBPP | 374 crowd-sourced problems | Unit tests | Easy |
| SWE-Bench | Real GitHub issues, 300 repos | Patch applies + tests pass | Hard, realistic |
| SWE-Bench Verified | Human-verified subset of SWE-Bench | Same | Gold standard |
| LiveCodeBench | Ongoing scrape of new LeetCode | Unit tests | Contamination-resistant |
| BigCodeBench | Function-call / library usage | Unit tests | Tests real API knowledge |
Fix a real GitHub issue in a real repository with real dependencies. This is the current gold standard for agentic coding. GPT-4 scored 1.7% initially; best agents now exceed 50%. SWE-Bench measures actual engineering capability, not toy problem-solving.
Scrapes new LeetCode problems monthly, keeping the test set fresh. By design, no model can be trained on future problems. LiveCodeBench is the best signal for trending model capability over time without worrying about data contamination.
Math benchmarks range from elementary to research-level, each revealing different reasoning depths.
| Benchmark | Difficulty | Eval method | Saturation |
|---|---|---|---|
| GSM8K | Grade school math (8 steps) | Exact match | High (frontier >95%) |
| MATH | Competition math (AMC, AIME) | Exact match | Medium (frontier ~80%) |
| AIME 2024 | Actual 2024 AIME problems | Exact match | Low (frontier ~20–50%) |
| FrontierMath | Research-level, novel problems | Exact match | Very low (<5%) |
| OlympiadBench | 8476 multilingual olympiad | Exact match | Low |
GSM8K: saturated. Frontier models solve 95%+ of elementary problems. MATH: nearly saturated by o1-class models with chain-of-thought. AIME 2024: emerging battleground where frontier models solve 20-50 out of 30 problems, depending on search budget and reasoning depth.
Epoch AI benchmark: Problems posed by research mathematicians, designed to resist contamination. No public model has solved even 5%. This is the true frontier of mathematical reasoning capability.
Some benchmarks go beyond narrow metrics to measure robustness, fairness, and safety across diverse scenarios.
HELM (Holistic Evaluation): 42 scenarios with 7 metrics per scenario — accuracy, calibration, robustness, fairness, bias, toxicity, efficiency. Comprehensive but expensive to compute. BIG-Bench Hard (BBH): 23 tasks that still challenge frontier models — algorithmic, world knowledge, language understanding. TruthfulQA: 817 questions where humans often have false beliefs; tests whether models propagate misconceptions. MT-Bench: 80 multi-turn chat questions judged by GPT-4, covering math, coding, reasoning, roleplay, writing, STEM.
| Benchmark | Coverage | Eval method | Best for |
|---|---|---|---|
| HELM | Broad (42 scenarios) | Multi-metric | Comprehensive comparison |
| BIG-Bench Hard | Reasoning-focused | Exact match | Reasoning capability |
| TruthfulQA | Factuality | Judge or MCQ | Truthfulness |
| MT-Bench | Chat quality | LLM judge | Chat assistant quality |
These benchmarks provide breadth but sacrifice depth. A single HELM score hides variation across scenarios. Use holistic benchmarks for baseline health checks, not for final model selection.
Chatbot Arena (LMSYS): Real users ask questions, see two anonymous responses, pick the better one. ELO-style rating updated after each battle. Currently the most reliable ranking for overall model quality because contamination is impossible (new questions), coverage is real (millions of votes), and results update in real-time.
Static benchmarks are fixed forever; Arena evolves. Users reveal true preferences; benchmarks measure narrow capabilities. Arena is large (millions of votes); benchmarks are small (thousands of examples). This scale difference matters.
| Dimension | Arena | Static benchmarks |
|---|---|---|
| Contamination risk | Zero | High |
| Coverage | Real queries (biased) | Fixed test set (narrow) |
| Speed of update | Real-time | Months |
| Reproducibility | Low (stochastic) | High |
| Cost to game | Hard | Easy |
English bias: Skewed toward English queries. Chat bias: Skewed toward chat-style questions (less technical). Gameable: Can optimize for superficial features that humans prefer (verbosity, disclaimers) without improving actual reasoning.
import json, statistics
from pathlib import Path
from datetime import datetime
HISTORY_FILE = Path("benchmark_history.jsonl")
def save_benchmark_run(model: str, task: str, score: float, metadata: dict = None):
entry = {
"timestamp": datetime.now().isoformat(),
"model": model,
"task": task,
"score": score,
**(metadata or {})
}
HISTORY_FILE.open("a").write(json.dumps(entry) + "
")
def check_regression(model: str, task: str, new_score: float,
threshold: float = 0.03) -> dict:
"""Alert if new score is more than threshold below historical mean."""
if not HISTORY_FILE.exists():
return {"status": "no_history"}
history = [
json.loads(l) for l in HISTORY_FILE.read_text().splitlines()
if json.loads(l)["model"] == model and json.loads(l)["task"] == task
]
if len(history) < 3:
return {"status": "insufficient_history", "n": len(history)}
historical_scores = [h["score"] for h in history[-10:]] # last 10 runs
mean_score = statistics.mean(historical_scores)
drop = mean_score - new_score
if drop > threshold:
return {
"status": "REGRESSION",
"new_score": new_score,
"historical_mean": round(mean_score, 4),
"drop": round(drop, 4),
"alert": f"Score dropped {drop:.1%} below historical mean"
}
return {"status": "ok", "new_score": new_score, "mean": round(mean_score, 4)}
# Example workflow
save_benchmark_run("gpt-4o", "mmlu", 0.887)
result = check_regression("gpt-4o", "mmlu", new_score=0.841)
print(result)
Different contexts demand different benchmarks. Here's a framework for choosing wisely.
Use MMLU-Pro + GPQA + MATH + SWE-Bench Verified + Arena ELO. This suite covers reasoning, knowledge, coding, and human preference. Spans breadth and depth.
import json, statistics
from openai import OpenAI
client = OpenAI()
# Your domain-specific golden set
GOLDEN_SET = [
{"input": "What is the capital of France?", "expected": "Paris", "category": "factual"},
{"input": "What is 15% of 200?", "expected": "30", "category": "math"},
{"input": "Translate 'hello' to Spanish.", "expected": "hola", "category": "translation"},
]
def exact_match(response: str, expected: str) -> bool:
return expected.lower().strip() in response.lower().strip()
def run_domain_benchmark(model: str) -> dict:
results_by_category: dict = {}
for item in GOLDEN_SET:
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": item["input"]}],
max_tokens=64,
temperature=0.0
).choices[0].message.content
cat = item["category"]
if cat not in results_by_category:
results_by_category[cat] = []
results_by_category[cat].append(exact_match(resp, item["expected"]))
overall = [v for vals in results_by_category.values() for v in vals]
return {
"model": model,
"overall_accuracy": round(statistics.mean(overall), 3),
"by_category": {
cat: round(statistics.mean(vals), 3)
for cat, vals in results_by_category.items()
}
}
for model in ["gpt-4o-mini", "gpt-4o"]:
print(json.dumps(run_domain_benchmark(model), indent=2))
HumanEval as a baseline (fast, for quick iteration), SWE-Bench Verified as gold standard (real-world), LiveCodeBench for contamination-resistant trending. Skip MMLU for pure coding comparisons.
Build a task-specific golden set (see evals-practice.html). Public benchmarks measure general capability; your users care about your specific task. Benchmark scores don't translate to product quality.
Always report benchmark scores with model size, training compute, and inference compute. A small model with 64× test-time search can outperform a large model on math benchmarks — these are not the same capability. Fairness demands full context.