HumanEval Benchmark

What HumanEval measures
pass@k metric explained
Running HumanEval
Interpreting scores
Limitations and contamination
Alternatives: MBPP, SWE-Bench
Gotchas

SECTION 01

What HumanEval measures

HumanEval (Chen et al. 2021, OpenAI) consists of 164 hand-written Python programming problems. Each problem has a function signature, docstring, and a set of hidden unit tests. The model must complete the function body. Problems range from simple string manipulation to algorithmic tasks (sorting, search, dynamic programming). It tests whether a model can write functionally correct code — not just code that looks plausible.

Unlike MMLU (which tests recall), HumanEval is graded by execution: the model's code is run against the unit tests, and it passes or fails deterministically. This makes it one of the cleaner benchmarks in terms of ground truth.

SECTION 02

pass@k metric explained

pass@k measures the probability that at least one of k independent samples from the model passes all unit tests. For pass@1, you generate one completion and check if it passes. For pass@10, you generate 10 samples and the problem "passes" if any of them do.

The unbiased estimator used in the original paper avoids running all n samples for every problem:

import numpy as np
from math import comb

def pass_at_k(n: int, c: int, k: int) -> float:
    # n = total samples, c = correct samples, k = budget
    # P(at least one of k passes) = 1 - P(all k fail)
    if n - c < k:
        return 1.0
    return 1.0 - np.prod([(n - c - i) / (n - i) for i in range(k)])

# Example: generated 20 samples, 8 passed, what is pass@1?
print(pass_at_k(n=20, c=8, k=1))   # 0.40
print(pass_at_k(n=20, c=8, k=10))  # 0.83

SECTION 03

Running HumanEval

from datasets import load_dataset
import openai, subprocess, tempfile, os

client = openai.OpenAI()
dataset = load_dataset("openai_humaneval")["test"]

def generate_completion(prompt: str) -> str:
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Complete this Python function:\n\n{prompt}"}],
        max_tokens=512, temperature=0.2,
        stop=["\ndef ", "\nclass "],
    )
    return resp.choices[0].message.content

def run_tests(full_code: str, test_code: str) -> bool:
    with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as f:
        f.write(full_code + "\n\n" + test_code + "\ncheck(candidate)")
        fname = f.name
    try:
        result = subprocess.run(["python", fname], timeout=5,
                                capture_output=True, text=True)
        return result.returncode == 0
    except subprocess.TimeoutExpired:
        return False
    finally:
        os.unlink(fname)

# Evaluate on first 10 problems
correct = 0
for problem in list(dataset)[:10]:
    completion = generate_completion(problem["prompt"])
    full_code = problem["prompt"] + completion
    passed = run_tests(full_code, problem["test"])
    correct += int(passed)

print(f"pass@1 estimate: {correct / 10:.1%}")

SECTION 04

Interpreting scores

Reference pass@1 scores (0-shot or with chain-of-thought):

GPT-4o: 90.2%
Claude 3.5 Sonnet: 92.0%
Llama 3 70B Instruct: 81.7%
Llama 3 8B Instruct: 62.2%
Codex (original 2021): 28.8%

Scores above ~85% are approaching saturation — differences between frontier models at this level may not reflect real-world coding ability meaningfully.

SECTION 05

Limitations and contamination

HumanEval has well-known contamination problems. The 164 problems are publicly available, and models trained after 2021 have almost certainly seen them. Pass@1 scores of 90%+ on frontier models likely reflect memorisation as much as generalisation.

The benchmark is also narrow: 164 Python problems covering basic algorithms. It doesn't test: multi-file projects, debugging, reading existing code, API usage, or the kinds of programming tasks that actually matter in production. A model that scores 90% on HumanEval may still struggle with real-world coding tasks.

SECTION 06

Alternatives: MBPP, SWE-Bench

MBPP (Mostly Basic Python Problems): 974 problems, similar format but broader coverage. Less contaminated but also less commonly reported.
SWE-Bench: Real GitHub issues from open-source Python repos. Models must write patches to fix actual bugs. Much harder (frontier models ~50% on verified subset) and tests realistic coding ability. Contamination-resistant because it uses real commit history.
HumanEval+: Extended HumanEval with more comprehensive test cases — catches solutions that pass the original sparse tests but are subtly wrong.
LiveCodeBench: Continuously updated with new competitive programming problems to combat contamination.

SECTION 07

Gotchas

Execution environment: Running model-generated code requires sandboxing. Use Docker containers or subprocess with timeout and resource limits. Never execute untrusted code directly.
Stop tokens: HumanEval completions need careful stop token selection — stop at the next function definition or class to avoid the model writing its own tests.
Temperature for pass@k: For pass@1 use temperature 0–0.2; for pass@k use temperature 0.8 to get diverse samples. Reporting pass@1 at temperature 0 systematically underestimates the model's actual coding ability.
Python version: Some HumanEval solutions use Python 3.8+ features. Use a consistent environment across comparisons.

HumanEval benchmark interpretation

HumanEval scores are not directly comparable between models evaluated with different sampling parameters. Pass@1 with temperature 0 (greedy decoding) is consistently lower than pass@1 with temperature 0.8 and n=10 samples due to the greedy decoding's inability to explore diverse solution strategies. Most published scores use pass@1 with temperature sampling and multiple samples, then estimate the probability that at least one sample passes all tests. Comparing a model's greedy pass@1 against another model's sampled pass@1 understates the greedy model's true capability.

Model tier	HumanEval pass@1	Typical capability
State of the art	>85%	GPT-4, Claude 3.5, Gemini 1.5 Pro
Strong	70–85%	GPT-3.5-turbo, Codestral
Capable	50–70%	CodeLlama-34B, DeepSeek-Coder
Limited	<50%	Small base models, no code tuning

from datasets import load_dataset
from human_eval.evaluation import evaluate_functional_correctness

# Load HumanEval dataset
dataset = load_dataset("openai_humaneval")["test"]

# Generate completions for each problem
def generate_completion(prompt: str) -> str:
    # Replace with your model call
    from openai import OpenAI
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"Complete this Python function:
{prompt}"}],
        temperature=0.2
    )
    return response.choices[0].message.content

# Save completions in HumanEval format
samples = [{"task_id": ex["task_id"],
            "completion": generate_completion(ex["prompt"])}
           for ex in dataset]

# Evaluate pass@1
results = evaluate_functional_correctness(samples)
print(f"pass@1: {results['pass@1']:.3f}")

Evaluating code generation quality

HumanEval's 164 problems span algorithmic tasks (list reversal, prime checking, Fibonacci), string manipulation, and mathematics, intentionally avoiding problems requiring external libraries. This scope tests fundamental programming ability but misses practical challenges: handling edge cases in real APIs, writing production-grade error handling, or integrating with complex data structures. Extensions like HumanEval+ (longer, more complex versions) and custom domain-specific benchmarks fill gaps. When evaluating code models in production, teams construct supplementary benchmarks aligned with their use cases: if the model will write SQL, benchmark SQL generation explicitly; if it will write test cases, include a benchmark of test quality (coverage, mutation killing). The pass@k metric—whether any of k independent generations solves the problem—reveals the variance in code quality: a model might generate correct solutions 30% of the time pass@1 but 90% of the time pass@10, suggesting sampling diversity but low initial quality. Understanding this distribution informs deployment decisions: should the application return a single generation or multiple samples for human review?

Code execution sandboxing and timeout management

HumanEval evaluation requires executing generated Python code—inherently dangerous if code contains bugs or infinite loops. The standard evaluation harness uses subprocess isolation with strict timeout limits (5 seconds typical) to prevent runaway processes. For production deployments evaluating user-generated or model-generated code, sandboxing becomes critical: an attacker could craft inputs that cause exponential-time algorithms to consume all system resources. Solutions include: lightweight containers (Docker) for stronger isolation, restrictive syscall filters (seccomp) preventing file/network access, and resource limits (cgroups) capping CPU and memory per execution. Even with these measures, carefully-crafted denial-of-service inputs can consume timeouts. Advanced evaluation systems employ fuzzing-style approaches: generating diverse inputs to code submissions automatically, watching for crashes, timeouts, or incorrect answers. This catches bugs that single test cases miss and reveals performance characteristics (does this algorithm degrade on large inputs?).

Measuring progress in code generation

HumanEval pass rate correlates imperfectly with real-world code quality: passing HumanEval requires correct logic but doesn't guarantee readable, maintainable, or efficient code. Production systems measure additional signals: code length (correct solutions tend toward reasonable concision, unlike naive solutions), cyclomatic complexity (simpler control flow correlates with correctness), and compliance with style guides. Some teams augment HumanEval with rubrics: code reviewers manually grade a random sample of generated solutions on criteria like clarity, efficiency, and adherence to requirements. These rubrics identify failure modes that automated metrics miss (e.g., correct algorithm but variable names so cryptic that the code is unmaintainable). Tools like Codex evaluation papers publish correlation matrices showing which metrics predict human preference, revealing that a combination of pass rate (correctness), code length, and human review captures code quality better than any single metric. This multi-signal approach enables confident model selection without manual review of every output.