80 multi-turn dialogue questions judged by GPT-4. Tests instruction following in realistic chat scenarios across writing, reasoning, math, and coding. The standard benchmark for chat model quality.
MT-Bench (Zheng et al. 2023, LMSYS) is an evaluation benchmark for chat and instruction-following models. Unlike MMLU (knowledge recall) or HumanEval (code execution), MT-Bench asks GPT-4 to judge the quality of a model's responses on a 1–10 scale. It evaluates conversational ability: multi-turn coherence, instruction compliance, and response quality across diverse domains.
MT-Bench popularised the LLM-as-judge paradigm: using a capable model (GPT-4) to evaluate another model's outputs at scale, correlating well with human preference ratings while being far cheaper and faster than human annotation.
Each category has 10 questions, each with a first turn and a follow-up second turn (testing multi-turn coherence).
For each response, GPT-4 receives: the question, the model's response, and a scoring rubric. It rates the response 1–10 and provides a brief justification. The final MT-Bench score is the average across all 80 first-turn and 80 second-turn scores.
Three judge variants are used: single-answer (just rate the response), pairwise (prefer A or B), and reference-guided (when a reference answer exists, e.g. for math). The LMSYS team found GPT-4 judge scores correlate at 0.80+ with human annotator preferences — high enough for benchmarking but not for safety-critical evaluation.
import openai, json
from pathlib import Path
client = openai.OpenAI()
# MT-Bench questions (subset example)
questions = [
{"id": "writing_1", "category": "writing",
"turns": [
"Compose a haiku about the passage of time.",
"Now write it from the perspective of an AI."
]},
{"id": "math_1", "category": "math",
"turns": [
"What is 15% of 240? Show your work.",
"If I add 30 to that result, what percentage of 500 is the new number?"
]},
]
def evaluate_question(question: dict, model: str = "gpt-4o-mini") -> dict:
messages = []
responses = []
for turn in question["turns"]:
messages.append({"role": "user", "content": turn})
resp = client.chat.completions.create(
model=model, messages=messages, max_tokens=1024
)
answer = resp.choices[0].message.content
messages.append({"role": "assistant", "content": answer})
responses.append(answer)
return {"id": question["id"], "responses": responses}
def judge_response(question: str, response: str) -> int:
judge_prompt = (
f"Rate the following response to this question on a scale of 1-10:\n"
f"Question: {question}\nResponse: {response}\n"
f"Output only a single integer from 1 to 10."
)
resp = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": judge_prompt}],
max_tokens=5, temperature=0.0,
)
return int(resp.choices[0].message.content.strip())
# Run and score
results = [evaluate_question(q) for q in questions]
scores = [judge_response(q["turns"][0], r["responses"][0])
for q, r in zip(questions, results)]
print(f"Average MT-Bench score: {sum(scores)/len(scores):.2f}/10")
Reference MT-Bench scores (from LMSYS leaderboard):
Differences of <0.2 are not statistically meaningful given variance in GPT-4 judge scoring. Category-level breakdowns reveal model strengths: some models excel at coding but struggle with multi-turn reasoning.
MT-Bench scores are calibrated against GPT-4 which typically scores 8.9–9.0 on the benchmark. Scores above 8.5 indicate models competitive with GPT-4 on multi-turn instruction following. Scores in the 7.0–8.5 range represent capable assistant models suitable for most production applications. Scores below 6.0 indicate significant deficiencies in instruction following or multi-turn coherence that would impact user experience. The score distribution is not uniform — most frontier models cluster between 8.0 and 9.2, making fine-grained discrimination between top models difficult with MT-Bench alone.
| Score range | Tier | Example models | Use case fit |
|---|---|---|---|
| 9.0+ | Frontier | GPT-4, Claude 3 Opus | Complex reasoning, agentic tasks |
| 8.0–8.9 | Strong | GPT-3.5-turbo, Claude Haiku | General assistant, chat |
| 7.0–7.9 | Capable | Mistral-7B-Instruct | Simple tasks, cost-sensitive |
| <7.0 | Limited | Small instruct models | Narrow, constrained applications |
import subprocess, json
# Run MT-Bench eval (requires FastChat installation)
# pip install fschat[model_worker,llm_judge]
# python -m fastchat.llm_judge.gen_model_answer # --model-path your-model --model-id your-model-id
# After generating answers, run the judge:
# python -m fastchat.llm_judge.gen_judgment # --judge-model gpt-4 --model-list your-model-id
# Parse results
with open("data/mt_bench/model_judgment/gpt-4_single.jsonl") as f:
scores = [json.loads(l) for l in f]
# Average score per category
from collections import defaultdict
by_cat = defaultdict(list)
for s in scores:
by_cat[s["category"]].append(s["score"])
for cat, vals in sorted(by_cat.items()):
print(f"{cat}: {sum(vals)/len(vals):.2f}")
MT-Bench's LLM-as-judge methodology introduces a position bias that affects single-model absolute scoring. GPT-4 as a judge tends to prefer the first response in pairwise comparisons and assigns slightly higher scores to verbose responses that use structured formatting. The FastChat MT-Bench implementation applies a position swap correction for pairwise comparisons (evaluating each pair in both orders) but the absolute scoring mode (assigning scores from 1–10) cannot easily correct for the judge's formatting preferences. Teams comparing models with different response styles should normalize MT-Bench scores against the known judge biases or use human evaluation as a validation check for close-scoring model comparisons.
MT-Bench's second-turn questions are specifically designed to test whether models can maintain instruction-following coherence across turns rather than treating each turn independently. Models that answer the first turn correctly but ignore the second-turn constraint (for example, answering in the wrong format after being asked to change formats) receive lower second-turn scores. This second-turn coherence measurement distinguishes models with genuine multi-turn understanding from models that independently handle each message without true conversation state tracking, making MT-Bench a better proxy for production chatbot quality than single-turn benchmarks.
Category-level MT-Bench analysis often reveals specialization patterns that aggregate scores obscure. A model that scores 9.2 on the coding category but 6.8 on the roleplay category is a different deployment fit than a model with 8.0 across all categories, even if their aggregate scores are similar. For teams building specialized applications, running the full MT-Bench evaluation and analyzing category breakdowns before making model selection decisions provides materially better guidance than relying on aggregate scores reported in papers or leaderboards.
MT-Bench's fixed question set of 80 questions across 8 categories means the benchmark is vulnerable to overfitting once model developers are aware of the specific questions. Unlike dynamic benchmarks that sample questions from a distribution, MT-Bench's public questions enable targeted optimization — training on data similar to the specific questions. The LMSys team has noted this limitation and developed dynamic evaluation approaches, but MT-Bench's fixed format remains valuable for historical comparisons and for teams who want reproducible evaluation results without managing a question generation pipeline.
MT-Bench question categories are intentionally diverse to prevent models from gaming the benchmark through narrow task specialization. The eight categories — writing, roleplay, reasoning, math, coding, extraction, STEM, and humanities — were selected to require different capability profiles, making it difficult to achieve high overall scores without broad competence. Teams using MT-Bench for model selection should pay particular attention to the coding and math categories, which show the highest variance between models and are often the most predictive of performance on structured reasoning tasks common in enterprise applications.
MT-Bench's multi-turn evaluation mirrors real chatbot interaction patterns by testing whether models can maintain coherent, relevant responses when conversation builds on previous turns. The second turn in each MT-Bench question is designed to require referencing or modifying the first answer — for example, asking the model to rewrite its explanation at a different level of difficulty, or to apply the same reasoning method to a different example. Models that treat each turn independently rather than tracking conversation state consistently underperform on MT-Bench second turns, making the benchmark an effective discriminator for multi-turn conversation capability.
MT-Bench's multi-turn evaluation mirrors real chatbot interaction patterns by testing whether models can maintain coherent, relevant responses when conversation builds on previous turns. The second turn in each MT-Bench question is designed to require referencing or modifying the first answer — for example, asking the model to rewrite its explanation at a different level of difficulty, or to apply the same reasoning method to a different example. Models that treat each turn independently rather than tracking conversation state consistently underperform on MT-Bench second turns, making the benchmark an effective discriminator for multi-turn conversation capability.