MT-Bench

What MT-Bench measures
Eight question categories
LLM-as-judge methodology
Running MT-Bench
Score interpretation
Limitations
Gotchas

SECTION 01

What MT-Bench measures

MT-Bench (Zheng et al. 2023, LMSYS) is an evaluation benchmark for chat and instruction-following models. Unlike MMLU (knowledge recall) or HumanEval (code execution), MT-Bench asks GPT-4 to judge the quality of a model's responses on a 1–10 scale. It evaluates conversational ability: multi-turn coherence, instruction compliance, and response quality across diverse domains.

MT-Bench popularised the LLM-as-judge paradigm: using a capable model (GPT-4) to evaluate another model's outputs at scale, correlating well with human preference ratings while being far cheaper and faster than human annotation.

SECTION 02

Each category has 10 questions, each with a first turn and a follow-up second turn (testing multi-turn coherence).

SECTION 03

LLM-as-judge methodology

For each response, GPT-4 receives: the question, the model's response, and a scoring rubric. It rates the response 1–10 and provides a brief justification. The final MT-Bench score is the average across all 80 first-turn and 80 second-turn scores.

Three judge variants are used: single-answer (just rate the response), pairwise (prefer A or B), and reference-guided (when a reference answer exists, e.g. for math). The LMSYS team found GPT-4 judge scores correlate at 0.80+ with human annotator preferences — high enough for benchmarking but not for safety-critical evaluation.

SECTION 04

Running MT-Bench

import openai, json
from pathlib import Path

client = openai.OpenAI()

# MT-Bench questions (subset example)
questions = [
    {"id": "writing_1", "category": "writing",
     "turns": [
         "Compose a haiku about the passage of time.",
         "Now write it from the perspective of an AI."
     ]},
    {"id": "math_1", "category": "math",
     "turns": [
         "What is 15% of 240? Show your work.",
         "If I add 30 to that result, what percentage of 500 is the new number?"
     ]},
]

def evaluate_question(question: dict, model: str = "gpt-4o-mini") -> dict:
    messages = []
    responses = []
    for turn in question["turns"]:
        messages.append({"role": "user", "content": turn})
        resp = client.chat.completions.create(
            model=model, messages=messages, max_tokens=1024
        )
        answer = resp.choices[0].message.content
        messages.append({"role": "assistant", "content": answer})
        responses.append(answer)
    return {"id": question["id"], "responses": responses}

def judge_response(question: str, response: str) -> int:
    judge_prompt = (
        f"Rate the following response to this question on a scale of 1-10:\n"
        f"Question: {question}\nResponse: {response}\n"
        f"Output only a single integer from 1 to 10."
    )
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": judge_prompt}],
        max_tokens=5, temperature=0.0,
    )
    return int(resp.choices[0].message.content.strip())

# Run and score
results = [evaluate_question(q) for q in questions]
scores = [judge_response(q["turns"][0], r["responses"][0])
          for q, r in zip(questions, results)]
print(f"Average MT-Bench score: {sum(scores)/len(scores):.2f}/10")

SECTION 05

Score interpretation

Reference MT-Bench scores (from LMSYS leaderboard):

GPT-4: 8.99
Claude 3 Opus: 9.0
GPT-3.5-turbo: 7.94
Llama-2-70B-chat: 6.27
Vicuna-13B: 6.57

Differences of <0.2 are not statistically meaningful given variance in GPT-4 judge scoring. Category-level breakdowns reveal model strengths: some models excel at coding but struggle with multi-turn reasoning.

SECTION 06

Limitations

Judge bias: GPT-4 is known to favour responses that resemble its own style — long, structured, with caveats. This disadvantages concise, direct responses that may be equally or more correct.
Saturation: With frontier models scoring 8.5–9.0, discriminative power at the top is limited. A 0.1 score difference has little practical significance.
80 questions: Small sample size, especially per category (10 questions). High variance — a single bad math response has a large effect on the math category score.
Not safety-testing: MT-Bench evaluates helpfulness, not safety or honesty. A model that confidently gives wrong information may score well.

SECTION 07

Gotchas

Reproducibility: GPT-4 judge scores are not deterministic — run at temperature 0 and average over multiple judge runs for stable scores.
Prompt format matters: MT-Bench assumes a standard chat format. Models that use different system prompt conventions (e.g. Llama 3's specific instruct format) may score lower than their true capability if prompting is done incorrectly.
Use FastChat: The reference implementation is in the LMSYS FastChat repo — it handles model-specific prompting, parallel evaluation, and GPT-4 judge calls correctly. Don't roll your own if you need comparable numbers.

MT-Bench scores in context

MT-Bench scores are calibrated against GPT-4 which typically scores 8.9–9.0 on the benchmark. Scores above 8.5 indicate models competitive with GPT-4 on multi-turn instruction following. Scores in the 7.0–8.5 range represent capable assistant models suitable for most production applications. Scores below 6.0 indicate significant deficiencies in instruction following or multi-turn coherence that would impact user experience. The score distribution is not uniform — most frontier models cluster between 8.0 and 9.2, making fine-grained discrimination between top models difficult with MT-Bench alone.

Score range	Tier	Example models	Use case fit
9.0+	Frontier	GPT-4, Claude 3 Opus	Complex reasoning, agentic tasks
8.0–8.9	Strong	GPT-3.5-turbo, Claude Haiku	General assistant, chat
7.0–7.9	Capable	Mistral-7B-Instruct	Simple tasks, cost-sensitive
<7.0	Limited	Small instruct models	Narrow, constrained applications

import subprocess, json

# Run MT-Bench eval (requires FastChat installation)
# pip install fschat[model_worker,llm_judge]
# python -m fastchat.llm_judge.gen_model_answer #   --model-path your-model --model-id your-model-id

# After generating answers, run the judge:
# python -m fastchat.llm_judge.gen_judgment #   --judge-model gpt-4 --model-list your-model-id

# Parse results
with open("data/mt_bench/model_judgment/gpt-4_single.jsonl") as f:
    scores = [json.loads(l) for l in f]

# Average score per category
from collections import defaultdict
by_cat = defaultdict(list)
for s in scores:
    by_cat[s["category"]].append(s["score"])
for cat, vals in sorted(by_cat.items()):
    print(f"{cat}: {sum(vals)/len(vals):.2f}")

MT-Bench's LLM-as-judge methodology introduces a position bias that affects single-model absolute scoring. GPT-4 as a judge tends to prefer the first response in pairwise comparisons and assigns slightly higher scores to verbose responses that use structured formatting. The FastChat MT-Bench implementation applies a position swap correction for pairwise comparisons (evaluating each pair in both orders) but the absolute scoring mode (assigning scores from 1–10) cannot easily correct for the judge's formatting preferences. Teams comparing models with different response styles should normalize MT-Bench scores against the known judge biases or use human evaluation as a validation check for close-scoring model comparisons.

MT-Bench's second-turn questions are specifically designed to test whether models can maintain instruction-following coherence across turns rather than treating each turn independently. Models that answer the first turn correctly but ignore the second-turn constraint (for example, answering in the wrong format after being asked to change formats) receive lower second-turn scores. This second-turn coherence measurement distinguishes models with genuine multi-turn understanding from models that independently handle each message without true conversation state tracking, making MT-Bench a better proxy for production chatbot quality than single-turn benchmarks.

Category-level MT-Bench analysis often reveals specialization patterns that aggregate scores obscure. A model that scores 9.2 on the coding category but 6.8 on the roleplay category is a different deployment fit than a model with 8.0 across all categories, even if their aggregate scores are similar. For teams building specialized applications, running the full MT-Bench evaluation and analyzing category breakdowns before making model selection decisions provides materially better guidance than relying on aggregate scores reported in papers or leaderboards.

MT-Bench's fixed question set of 80 questions across 8 categories means the benchmark is vulnerable to overfitting once model developers are aware of the specific questions. Unlike dynamic benchmarks that sample questions from a distribution, MT-Bench's public questions enable targeted optimization — training on data similar to the specific questions. The LMSys team has noted this limitation and developed dynamic evaluation approaches, but MT-Bench's fixed format remains valuable for historical comparisons and for teams who want reproducible evaluation results without managing a question generation pipeline.

MT-Bench question categories are intentionally diverse to prevent models from gaming the benchmark through narrow task specialization. The eight categories — writing, roleplay, reasoning, math, coding, extraction, STEM, and humanities — were selected to require different capability profiles, making it difficult to achieve high overall scores without broad competence. Teams using MT-Bench for model selection should pay particular attention to the coding and math categories, which show the highest variance between models and are often the most predictive of performance on structured reasoning tasks common in enterprise applications.

MT-Bench's multi-turn evaluation mirrors real chatbot interaction patterns by testing whether models can maintain coherent, relevant responses when conversation builds on previous turns. The second turn in each MT-Bench question is designed to require referencing or modifying the first answer — for example, asking the model to rewrite its explanation at a different level of difficulty, or to apply the same reasoning method to a different example. Models that treat each turn independently rather than tracking conversation state consistently underperform on MT-Bench second turns, making the benchmark an effective discriminator for multi-turn conversation capability.