MMLU Benchmark

What MMLU measures
Benchmark structure
Running MMLU evaluation in Python
Score interpretation
Limitations and criticisms
MMLU variants
Gotchas

SECTION 01

What MMLU measures

MMLU (Massive Multitask Language Understanding, Hendrycks et al. 2020) tests a model's world knowledge and problem-solving ability across 57 subjects. Questions range from high-school level (US history, basic maths) to professional level (medical licensing, legal bar exam, college physics). The benchmark is designed to test knowledge that a model can only acquire through extensive pretraining — not reasoning skills alone.

It became the standard comparison point after GPT-4's 86.4% score in 2023 dramatically exceeded the ~70% human expert average, making headline news and establishing it as the go-to benchmark for new model releases.

SECTION 02

Benchmark structure

Each of 57 subjects has a development set (5 examples used for few-shot prompting) and a test set (question + 4 answer choices, one correct). The standard evaluation uses 5-shot prompting: prepend the 5 development examples to each test question. The model must output A, B, C, or D.

Subject categories: STEM (abstract algebra, college chemistry, high school physics...), humanities (philosophy, world history, jurisprudence...), social sciences (econometrics, political science, sociology...), other (clinical knowledge, medical genetics, nutrition...).

SECTION 03

Running MMLU evaluation in Python

from datasets import load_dataset
import openai, re

client = openai.OpenAI()

def evaluate_mmlu_subject(subject: str, n_examples: int = 100) -> float:
    dataset = load_dataset("cais/mmlu", subject)
    dev = list(dataset["dev"])    # 5 few-shot examples
    test = list(dataset["test"])  # test questions

    # Build few-shot prompt
    def format_question(item, include_answer=False):
        choices = "ABCD"
        q = f"Question: {item['question']}\n"
        for i, choice in enumerate(item["choices"]):
            q += f"{choices[i]}. {choice}\n"
        if include_answer:
            q += f"Answer: {choices[item['answer']]}\n\n"
        else:
            q += "Answer:"
        return q

    few_shot = "".join(format_question(ex, include_answer=True) for ex in dev)

    correct = 0
    for item in test[:n_examples]:
        prompt = few_shot + format_question(item)
        resp = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=1, temperature=0.0,
        )
        predicted = resp.choices[0].message.content.strip().upper()
        correct_letter = "ABCD"[item["answer"]]
        if predicted == correct_letter:
            correct += 1

    return correct / n_examples

score = evaluate_mmlu_subject("abstract_algebra")
print(f"Abstract algebra: {score:.1%}")

SECTION 04

Score interpretation

Reference scores (5-shot, as of early 2025):

GPT-4o: 88.7%
Claude 3.5 Sonnet: 88.3%
Llama 3 70B: 82.0%
Llama 3 8B: 68.4%
Random baseline: 25%
Human expert estimate: ~89.8%

MMLU score is not a single number — aggregate scores hide large variance across subjects. A model can score 95% on high-school history while scoring 55% on formal logic. Always report per-category breakdowns for meaningful comparisons.

SECTION 05

Limitations and criticisms

MMLU has attracted significant criticism as frontier models approached human-level scores:

Test set contamination: MMLU was published in 2020; most models trained after 2021 have likely seen some or all MMLU questions in their pretraining data. Scores on contaminated questions measure memorisation, not generalisation.
Question quality: Multiple peer reviews found errors in the answer keys and poorly worded questions across subjects. Estimated 5–10% of questions have disputed or incorrect answers.
Format sensitivity: Scores vary by whether you use 0-shot, 5-shot, chain-of-thought, or perplexity-based scoring. Comparison across papers that use different formats is invalid.

SECTION 06

MMLU variants

Several successors address MMLU's limitations:

MMLU-Pro: 10-choice questions (harder), verified answer keys, more reasoning-heavy. Frontier models score 50–75% vs 85%+ on original MMLU.
MMLU-Redux: Re-annotated subset of MMLU with corrected answer keys — use this for clean comparisons.
GPQA: Graduate-level questions in biology, chemistry, physics. Even domain experts score ~65%; frontier models score ~50–60%.
ARC-Challenge: Standardised test questions. Simpler but cleaner than MMLU with better-verified labels.

# Compare model scores across MMLU variants using lm-evaluation-harness
# pip install lm-eval

# CLI: evaluate a model on MMLU and MMLU-Pro
# lm_eval --model hf --model_args pretrained=mistralai/Mistral-7B-v0.1 #         --tasks mmlu,mmlu_pro --num_fewshot 5 --output_path results/

import json, subprocess

def run_eval(model_name, tasks=("mmlu", "mmlu_pro"), shots=5):
    cmd = [
        "lm_eval", "--model", "hf",
        "--model_args", f"pretrained={model_name}",
        "--tasks", ",".join(tasks),
        "--num_fewshot", str(shots),
        "--output_path", "results/",
        "--log_samples",
    ]
    result = subprocess.run(cmd, capture_output=True, text=True)
    return result

# Parse results
def parse_mmlu_results(results_dir):
    import glob, os
    results = {}
    for f in glob.glob(f"{results_dir}/**/*.json", recursive=True):
        data = json.load(open(f))
        for task, metrics in data.get("results", {}).items():
            if "mmlu" in task:
                results[task] = metrics.get("acc,none", metrics.get("acc", "?"))
    return results

# Typical score ranges (5-shot, 2024 models):
# GPT-4o:        MMLU=88.7%  MMLU-Pro=72.6%
# Claude 3.5:    MMLU=88.3%  MMLU-Pro=~70%
# Mistral-7B:    MMLU=64.2%  MMLU-Pro=~36%
# Llama-3-8B:    MMLU=66.6%  MMLU-Pro=~41%

SECTION 07

Gotchas

Aggregate vs per-subject: Always report per-subject scores. Aggregate MMLU can mislead — a model strong in humanities but weak in STEM will show a similar aggregate to one with the opposite profile.
Prompt sensitivity: MMLU scores can vary by 2–5 percentage points based on prompt wording, few-shot example selection, and whether you use log-probs or generation. Use a standardised evaluation harness (lm-evaluation-harness) for reproducible comparisons.
Don't use as sole metric: MMLU measures knowledge recall, not instruction following, reasoning, or safety. Use alongside HumanEval (coding), MT-Bench (instruction), and task-specific benchmarks.

MMLU Benchmark Interpretation Guide

MMLU (Massive Multitask Language Understanding) evaluates language models across 57 academic subjects spanning STEM, humanities, social sciences, and professional domains. Each subject contains multiple-choice questions derived from practice exams, textbooks, and standardized tests, making MMLU a broad measure of the factual knowledge and reasoning capability encoded in a model's parameters.

Score Range	Interpretation	Comparable Human Level
25–40%	Near-random (chance = 25%)	Below layperson
40–60%	Basic comprehension	General public
60–75%	Solid knowledge	College graduate
75–85%	Expert-level in many subjects	Domain expert average
85%+	Near-expert across domains	Top domain experts

MMLU score inflation from data contamination is a significant concern when comparing models. If a model's pre-training corpus includes MMLU questions or their answer keys — even indirectly through web crawls of exam prep sites — the benchmark scores may reflect memorization rather than generalization. Newer contamination-controlled benchmarks like MMLU-Pro (harder questions) and MMLU-Redux (manually verified questions) are increasingly used alongside standard MMLU to provide more reliable capability estimates.

Subject-level MMLU scores reveal capability profiles that aggregate scores obscure. A model might score 85% on Clinical Knowledge and 60% on Formal Logic, or 90% on High School Mathematics and 55% on Moral Scenarios. For applications targeting a specific domain, subject-level analysis identifies whether the model has sufficient knowledge depth for the task. Models with high aggregate MMLU scores can have significant gaps in specific subfields that matter for particular deployment contexts.

# Evaluate model on MMLU using lm-evaluation-harness
# pip install lm-eval

lm_eval --model hf   --model_args pretrained=meta-llama/Meta-Llama-3-8B   --tasks mmlu   --num_fewshot 5   --output_path ./results/llama3-8b-mmlu.json

Few-shot prompting for MMLU evaluation uses 5 in-context examples per subject to guide the model's response format. These examples are drawn from a separate development set (not the test set) and are prepended to each evaluation question. The 5-shot format is the standard configuration reported in most papers, but performance at 0-shot (no examples) has become increasingly relevant as instruction-tuned models improve at following the multiple-choice format without explicit examples. Reporting both 0-shot and 5-shot scores provides a more complete picture of model capability and instruction-following quality.

MMLU subject groupings reveal which knowledge domains are most and least represented. The professional categories (Professional Medicine, Professional Law, Professional Accounting, Professional Psychology) have more questions than general STEM or humanities subjects, giving them disproportionate weight in subject-averaged scores. A model that excels at professional knowledge may appear stronger on MMLU than a model that excels at fundamental STEM reasoning, even if the latter is more useful for most production applications. Weighted scoring that adjusts for subject representation provides a more calibrated capability estimate for specific deployment contexts.

MMLU macro-average versus micro-average scoring produces different rankings for models with uneven subject performance. Macro-average weights each subject equally; micro-average weights each question equally, which gives more weight to subjects with more questions. Since professional categories have more questions than some academic subjects, micro-average scores tend to favor models strong on professional knowledge. Reporting both averages alongside the standard subject-grouped breakdown provides the most complete picture of a model's academic knowledge distribution.