Evaluation

MMLU Benchmark

Massive Multitask Language Understanding: 57-subject multiple-choice benchmark covering STEM, humanities, and social sciences. The de facto standard for measuring LLM knowledge breadth.

57 subjects
STEM + humanities + social
14,079 questions
5-shot format
Random baseline
25% (4-choice)

Table of Contents

SECTION 01

What MMLU measures

MMLU (Massive Multitask Language Understanding, Hendrycks et al. 2020) tests a model's world knowledge and problem-solving ability across 57 subjects. Questions range from high-school level (US history, basic maths) to professional level (medical licensing, legal bar exam, college physics). The benchmark is designed to test knowledge that a model can only acquire through extensive pretraining — not reasoning skills alone.

It became the standard comparison point after GPT-4's 86.4% score in 2023 dramatically exceeded the ~70% human expert average, making headline news and establishing it as the go-to benchmark for new model releases.

SECTION 02

Benchmark structure

Each of 57 subjects has a development set (5 examples used for few-shot prompting) and a test set (question + 4 answer choices, one correct). The standard evaluation uses 5-shot prompting: prepend the 5 development examples to each test question. The model must output A, B, C, or D.

Subject categories: STEM (abstract algebra, college chemistry, high school physics...), humanities (philosophy, world history, jurisprudence...), social sciences (econometrics, political science, sociology...), other (clinical knowledge, medical genetics, nutrition...).

SECTION 03

Running MMLU evaluation in Python

from datasets import load_dataset
import openai, re

client = openai.OpenAI()

def evaluate_mmlu_subject(subject: str, n_examples: int = 100) -> float:
    dataset = load_dataset("cais/mmlu", subject)
    dev = list(dataset["dev"])    # 5 few-shot examples
    test = list(dataset["test"])  # test questions

    # Build few-shot prompt
    def format_question(item, include_answer=False):
        choices = "ABCD"
        q = f"Question: {item['question']}\n"
        for i, choice in enumerate(item["choices"]):
            q += f"{choices[i]}. {choice}\n"
        if include_answer:
            q += f"Answer: {choices[item['answer']]}\n\n"
        else:
            q += "Answer:"
        return q

    few_shot = "".join(format_question(ex, include_answer=True) for ex in dev)

    correct = 0
    for item in test[:n_examples]:
        prompt = few_shot + format_question(item)
        resp = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=1, temperature=0.0,
        )
        predicted = resp.choices[0].message.content.strip().upper()
        correct_letter = "ABCD"[item["answer"]]
        if predicted == correct_letter:
            correct += 1

    return correct / n_examples

score = evaluate_mmlu_subject("abstract_algebra")
print(f"Abstract algebra: {score:.1%}")
SECTION 04

Score interpretation

Reference scores (5-shot, as of early 2025):

MMLU score is not a single number — aggregate scores hide large variance across subjects. A model can score 95% on high-school history while scoring 55% on formal logic. Always report per-category breakdowns for meaningful comparisons.

SECTION 05

Limitations and criticisms

MMLU has attracted significant criticism as frontier models approached human-level scores:

SECTION 06

MMLU variants

Several successors address MMLU's limitations:

# Compare model scores across MMLU variants using lm-evaluation-harness
# pip install lm-eval

# CLI: evaluate a model on MMLU and MMLU-Pro
# lm_eval --model hf --model_args pretrained=mistralai/Mistral-7B-v0.1 #         --tasks mmlu,mmlu_pro --num_fewshot 5 --output_path results/

import json, subprocess

def run_eval(model_name, tasks=("mmlu", "mmlu_pro"), shots=5):
    cmd = [
        "lm_eval", "--model", "hf",
        "--model_args", f"pretrained={model_name}",
        "--tasks", ",".join(tasks),
        "--num_fewshot", str(shots),
        "--output_path", "results/",
        "--log_samples",
    ]
    result = subprocess.run(cmd, capture_output=True, text=True)
    return result

# Parse results
def parse_mmlu_results(results_dir):
    import glob, os
    results = {}
    for f in glob.glob(f"{results_dir}/**/*.json", recursive=True):
        data = json.load(open(f))
        for task, metrics in data.get("results", {}).items():
            if "mmlu" in task:
                results[task] = metrics.get("acc,none", metrics.get("acc", "?"))
    return results

# Typical score ranges (5-shot, 2024 models):
# GPT-4o:        MMLU=88.7%  MMLU-Pro=72.6%
# Claude 3.5:    MMLU=88.3%  MMLU-Pro=~70%
# Mistral-7B:    MMLU=64.2%  MMLU-Pro=~36%
# Llama-3-8B:    MMLU=66.6%  MMLU-Pro=~41%
SECTION 07

Gotchas

MMLU Benchmark Interpretation Guide

MMLU (Massive Multitask Language Understanding) evaluates language models across 57 academic subjects spanning STEM, humanities, social sciences, and professional domains. Each subject contains multiple-choice questions derived from practice exams, textbooks, and standardized tests, making MMLU a broad measure of the factual knowledge and reasoning capability encoded in a model's parameters.

Score RangeInterpretationComparable Human Level
25–40%Near-random (chance = 25%)Below layperson
40–60%Basic comprehensionGeneral public
60–75%Solid knowledgeCollege graduate
75–85%Expert-level in many subjectsDomain expert average
85%+Near-expert across domainsTop domain experts

MMLU score inflation from data contamination is a significant concern when comparing models. If a model's pre-training corpus includes MMLU questions or their answer keys — even indirectly through web crawls of exam prep sites — the benchmark scores may reflect memorization rather than generalization. Newer contamination-controlled benchmarks like MMLU-Pro (harder questions) and MMLU-Redux (manually verified questions) are increasingly used alongside standard MMLU to provide more reliable capability estimates.

Subject-level MMLU scores reveal capability profiles that aggregate scores obscure. A model might score 85% on Clinical Knowledge and 60% on Formal Logic, or 90% on High School Mathematics and 55% on Moral Scenarios. For applications targeting a specific domain, subject-level analysis identifies whether the model has sufficient knowledge depth for the task. Models with high aggregate MMLU scores can have significant gaps in specific subfields that matter for particular deployment contexts.

# Evaluate model on MMLU using lm-evaluation-harness
# pip install lm-eval

lm_eval --model hf   --model_args pretrained=meta-llama/Meta-Llama-3-8B   --tasks mmlu   --num_fewshot 5   --output_path ./results/llama3-8b-mmlu.json

Few-shot prompting for MMLU evaluation uses 5 in-context examples per subject to guide the model's response format. These examples are drawn from a separate development set (not the test set) and are prepended to each evaluation question. The 5-shot format is the standard configuration reported in most papers, but performance at 0-shot (no examples) has become increasingly relevant as instruction-tuned models improve at following the multiple-choice format without explicit examples. Reporting both 0-shot and 5-shot scores provides a more complete picture of model capability and instruction-following quality.

MMLU subject groupings reveal which knowledge domains are most and least represented. The professional categories (Professional Medicine, Professional Law, Professional Accounting, Professional Psychology) have more questions than general STEM or humanities subjects, giving them disproportionate weight in subject-averaged scores. A model that excels at professional knowledge may appear stronger on MMLU than a model that excels at fundamental STEM reasoning, even if the latter is more useful for most production applications. Weighted scoring that adjusts for subject representation provides a more calibrated capability estimate for specific deployment contexts.

MMLU macro-average versus micro-average scoring produces different rankings for models with uneven subject performance. Macro-average weights each subject equally; micro-average weights each question equally, which gives more weight to subjects with more questions. Since professional categories have more questions than some academic subjects, micro-average scores tend to favor models strong on professional knowledge. Reporting both averages alongside the standard subject-grouped breakdown provides the most complete picture of a model's academic knowledge distribution.