164 Python programming problems with unit tests. Measures functional code correctness with pass@k metric. The standard benchmark for LLM coding ability since GPT-4.
HumanEval (Chen et al. 2021, OpenAI) consists of 164 hand-written Python programming problems. Each problem has a function signature, docstring, and a set of hidden unit tests. The model must complete the function body. Problems range from simple string manipulation to algorithmic tasks (sorting, search, dynamic programming). It tests whether a model can write functionally correct code — not just code that looks plausible.
Unlike MMLU (which tests recall), HumanEval is graded by execution: the model's code is run against the unit tests, and it passes or fails deterministically. This makes it one of the cleaner benchmarks in terms of ground truth.
pass@k measures the probability that at least one of k independent samples from the model passes all unit tests. For pass@1, you generate one completion and check if it passes. For pass@10, you generate 10 samples and the problem "passes" if any of them do.
The unbiased estimator used in the original paper avoids running all n samples for every problem:
import numpy as np
from math import comb
def pass_at_k(n: int, c: int, k: int) -> float:
# n = total samples, c = correct samples, k = budget
# P(at least one of k passes) = 1 - P(all k fail)
if n - c < k:
return 1.0
return 1.0 - np.prod([(n - c - i) / (n - i) for i in range(k)])
# Example: generated 20 samples, 8 passed, what is pass@1?
print(pass_at_k(n=20, c=8, k=1)) # 0.40
print(pass_at_k(n=20, c=8, k=10)) # 0.83
from datasets import load_dataset
import openai, subprocess, tempfile, os
client = openai.OpenAI()
dataset = load_dataset("openai_humaneval")["test"]
def generate_completion(prompt: str) -> str:
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Complete this Python function:\n\n{prompt}"}],
max_tokens=512, temperature=0.2,
stop=["\ndef ", "\nclass "],
)
return resp.choices[0].message.content
def run_tests(full_code: str, test_code: str) -> bool:
with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as f:
f.write(full_code + "\n\n" + test_code + "\ncheck(candidate)")
fname = f.name
try:
result = subprocess.run(["python", fname], timeout=5,
capture_output=True, text=True)
return result.returncode == 0
except subprocess.TimeoutExpired:
return False
finally:
os.unlink(fname)
# Evaluate on first 10 problems
correct = 0
for problem in list(dataset)[:10]:
completion = generate_completion(problem["prompt"])
full_code = problem["prompt"] + completion
passed = run_tests(full_code, problem["test"])
correct += int(passed)
print(f"pass@1 estimate: {correct / 10:.1%}")
Reference pass@1 scores (0-shot or with chain-of-thought):
Scores above ~85% are approaching saturation — differences between frontier models at this level may not reflect real-world coding ability meaningfully.
HumanEval has well-known contamination problems. The 164 problems are publicly available, and models trained after 2021 have almost certainly seen them. Pass@1 scores of 90%+ on frontier models likely reflect memorisation as much as generalisation.
The benchmark is also narrow: 164 Python problems covering basic algorithms. It doesn't test: multi-file projects, debugging, reading existing code, API usage, or the kinds of programming tasks that actually matter in production. A model that scores 90% on HumanEval may still struggle with real-world coding tasks.
HumanEval scores are not directly comparable between models evaluated with different sampling parameters. Pass@1 with temperature 0 (greedy decoding) is consistently lower than pass@1 with temperature 0.8 and n=10 samples due to the greedy decoding's inability to explore diverse solution strategies. Most published scores use pass@1 with temperature sampling and multiple samples, then estimate the probability that at least one sample passes all tests. Comparing a model's greedy pass@1 against another model's sampled pass@1 understates the greedy model's true capability.
| Model tier | HumanEval pass@1 | Typical capability |
|---|---|---|
| State of the art | >85% | GPT-4, Claude 3.5, Gemini 1.5 Pro |
| Strong | 70–85% | GPT-3.5-turbo, Codestral |
| Capable | 50–70% | CodeLlama-34B, DeepSeek-Coder |
| Limited | <50% | Small base models, no code tuning |
from datasets import load_dataset
from human_eval.evaluation import evaluate_functional_correctness
# Load HumanEval dataset
dataset = load_dataset("openai_humaneval")["test"]
# Generate completions for each problem
def generate_completion(prompt: str) -> str:
# Replace with your model call
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": f"Complete this Python function:
{prompt}"}],
temperature=0.2
)
return response.choices[0].message.content
# Save completions in HumanEval format
samples = [{"task_id": ex["task_id"],
"completion": generate_completion(ex["prompt"])}
for ex in dataset]
# Evaluate pass@1
results = evaluate_functional_correctness(samples)
print(f"pass@1: {results['pass@1']:.3f}")
HumanEval's 164 problems span algorithmic tasks (list reversal, prime checking, Fibonacci), string manipulation, and mathematics, intentionally avoiding problems requiring external libraries. This scope tests fundamental programming ability but misses practical challenges: handling edge cases in real APIs, writing production-grade error handling, or integrating with complex data structures. Extensions like HumanEval+ (longer, more complex versions) and custom domain-specific benchmarks fill gaps. When evaluating code models in production, teams construct supplementary benchmarks aligned with their use cases: if the model will write SQL, benchmark SQL generation explicitly; if it will write test cases, include a benchmark of test quality (coverage, mutation killing). The pass@k metric—whether any of k independent generations solves the problem—reveals the variance in code quality: a model might generate correct solutions 30% of the time pass@1 but 90% of the time pass@10, suggesting sampling diversity but low initial quality. Understanding this distribution informs deployment decisions: should the application return a single generation or multiple samples for human review?
HumanEval evaluation requires executing generated Python code—inherently dangerous if code contains bugs or infinite loops. The standard evaluation harness uses subprocess isolation with strict timeout limits (5 seconds typical) to prevent runaway processes. For production deployments evaluating user-generated or model-generated code, sandboxing becomes critical: an attacker could craft inputs that cause exponential-time algorithms to consume all system resources. Solutions include: lightweight containers (Docker) for stronger isolation, restrictive syscall filters (seccomp) preventing file/network access, and resource limits (cgroups) capping CPU and memory per execution. Even with these measures, carefully-crafted denial-of-service inputs can consume timeouts. Advanced evaluation systems employ fuzzing-style approaches: generating diverse inputs to code submissions automatically, watching for crashes, timeouts, or incorrect answers. This catches bugs that single test cases miss and reveals performance characteristics (does this algorithm degrade on large inputs?).
HumanEval pass rate correlates imperfectly with real-world code quality: passing HumanEval requires correct logic but doesn't guarantee readable, maintainable, or efficient code. Production systems measure additional signals: code length (correct solutions tend toward reasonable concision, unlike naive solutions), cyclomatic complexity (simpler control flow correlates with correctness), and compliance with style guides. Some teams augment HumanEval with rubrics: code reviewers manually grade a random sample of generated solutions on criteria like clarity, efficiency, and adherence to requirements. These rubrics identify failure modes that automated metrics miss (e.g., correct algorithm but variable names so cryptic that the code is unmaintainable). Tools like Codex evaluation papers publish correlation matrices showing which metrics predict human preference, revealing that a combination of pass rate (correctness), code length, and human review captures code quality better than any single metric. This multi-signal approach enables confident model selection without manual review of every output.