Chatbot Arena

How Chatbot Arena works
Bradley-Terry Elo model
What Arena measures
Accessing Arena data
Building your own arena
Limitations
Gotchas

SECTION 01

How Chatbot Arena works

Chatbot Arena (LMSYS, launched May 2023) is a human preference evaluation platform. Users submit prompts to a chat interface that simultaneously queries two anonymous models. After seeing both responses, users click "Model A is better", "Model B is better", "Tie", or "Both are bad". Only after voting are the model identities revealed. This blind setup prevents model popularity bias from influencing ratings. Votes are aggregated via a Bradley-Terry statistical model to produce Elo-style rankings.

SECTION 02

Bradley-Terry Elo model

The Arena uses the Bradley-Terry model (a generalization of Elo) to estimate each model's skill from pairwise comparisons. Model i's strength is θ_i; the probability that model i beats model j is P(i > j) = 1 / (1 + exp(θ_j - θ_i)). Given all pairwise vote outcomes, maximum likelihood estimation finds the θ values that best explain the observed comparisons. The resulting scores are converted to Elo scores (e.g. baseline at 1000) for interpretability.

import numpy as np
from scipy.optimize import minimize

def fit_bradley_terry(outcomes: list[tuple[int, int, float]]) -> np.ndarray:
    # outcomes: list of (model_i, model_j, fraction_i_wins)
    n_models = max(max(i, j) for i, j, _ in outcomes) + 1
    theta = np.zeros(n_models)

    def neg_log_likelihood(theta):
        nll = 0
        for i, j, frac in outcomes:
            p_i_wins = 1 / (1 + np.exp(theta[j] - theta[i]))
            nll -= frac * np.log(p_i_wins + 1e-10) + (1 - frac) * np.log(1 - p_i_wins + 1e-10)
        return nll

    result = minimize(neg_log_likelihood, theta, method="L-BFGS-B")
    scores = result.x
    # Normalise: set mean to 0, scale to Elo (400 per 10× odds)
    scores -= scores.mean()
    return scores * (400 / np.log(10))

SECTION 03

What Arena measures

Arena rankings capture real human preference in open-ended chat — the closest proxy to "which model do users actually prefer". This is distinct from:

MMLU / academic benchmarks: Measure knowledge recall, not chat quality
MT-Bench: GPT-4 as judge has its own biases; 80 questions is a small sample
Safety benchmarks: Arena doesn't specifically test refusals or harmful outputs

Arena correlates well with deployment satisfaction for consumer chat use cases. It's less useful for specialized domains (coding, science) unless filtered to domain-specific prompts.

SECTION 04

Accessing Arena data

from datasets import load_dataset

# LMSYS releases anonymized vote data
dataset = load_dataset("lmsys/chatbot_arena_conversations")
print(dataset["train"][0].keys())
# ['question_id', 'model_a', 'model_b', 'winner', 'judge',
#  'conversation_a', 'conversation_b', 'turn', 'language', 'tstamp']

# Filter to English single-turn conversations
english_single = dataset["train"].filter(
    lambda x: x["language"] == "English" and x["turn"] == 1
)
print(f"{len(english_single)} conversations")

# Analyse win rates
win_counts = {}
for row in english_single:
    winner = row["winner"]  # "model_a", "model_b", "tie", "tie (bothbad)"
    model = row[winner] if winner in ("model_a", "model_b") else None
    if model:
        win_counts[model] = win_counts.get(model, 0) + 1

for model, wins in sorted(win_counts.items(), key=lambda x: -x[1])[:5]:
    print(f"{model}: {wins} wins")

SECTION 05

Building your own arena

import openai, anthropic, random

def get_model_response(model_id: str, prompt: str) -> str:
    if "gpt" in model_id:
        client = openai.OpenAI()
        resp = client.chat.completions.create(
            model=model_id,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=1024,
        )
        return resp.choices[0].message.content
    elif "claude" in model_id:
        client = anthropic.Anthropic()
        resp = client.messages.create(
            model=model_id,
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}],
        )
        return resp.content[0].text
    raise ValueError(f"Unknown model: {model_id}")

def arena_battle(prompt: str, models: list[str]) -> dict:
    model_a, model_b = random.sample(models, 2)
    resp_a = get_model_response(model_a, prompt)
    resp_b = get_model_response(model_b, prompt)
    # Blind presentation — randomise which is shown first
    return {
        "prompt": prompt,
        "responses": [resp_a, resp_b],
        "model_ids": [model_a, model_b],  # hidden from rater
    }

SECTION 06

Limitations

Popularity bias despite blinding: Users who know model response styles may identify models even when anonymous. GPT-4-style verbose structured responses are recognisable.
Prompt distribution: Arena prompts skew toward casual conversation and English. Code-heavy or technical prompts are underrepresented. Category-filtered leaderboards (Arena-Hard, coding-specific) address this partially.
Evaluator quality: Anyone can vote. Non-expert raters may prefer confidently-wrong responses over hedged-but-accurate ones. Safety is not rewarded — models that refuse more often tend to rank lower.

SECTION 07

Gotchas

Elo confidence intervals: Models with few battles have wide confidence intervals. A model ranked #5 may be statistically indistinguishable from #3–#10. Look at bootstrap confidence intervals, not just point estimates.
Arena-Hard vs Arena: Arena-Hard uses a fixed set of 500 challenging prompts with GPT-4 as judge, specifically to reduce noise. It correlates well with full Arena rankings and is reproducible. Use Arena-Hard for internal evaluations.
Model updates: When OpenAI or Anthropic release updated model versions, the Arena entry may not clearly distinguish versions. "GPT-4" in early Arena data ≠ "GPT-4o" — check the model version strings carefully.

Arena results and model benchmarking

Chatbot Arena's Elo ratings have become the most trusted real-world capability benchmark for LLMs because they measure human preference on actual user-generated conversations rather than curated benchmark datasets. Models that score highly on traditional NLP benchmarks sometimes rank unexpectedly on Arena due to the distribution shift between benchmark tasks and genuine user needs. Conversely, models optimized for chat interaction and instruction following sometimes outperform technically superior models in Arena rankings, reflecting the gap between benchmark performance and user-perceived quality.

Metric	What it measures	Limitation
Elo rating	Relative win probability against other models	Rating depends on comparison pool
Win rate	Fraction of battles where model is preferred	Confounded by opponent quality
Confidence interval	Statistical uncertainty in Elo estimate	Wide CIs for newer/fewer-battle models
Category score	Performance within topic (coding, math, etc.)	Category definitions vary over time

Arena data is periodically released as public datasets, enabling external analysis of model quality trends and failure modes. Researchers have used the battle datasets to study mode collapse in LLM preferences (certain response styles being consistently preferred regardless of quality), verbosity bias (longer responses winning disproportionately), and category-specific capability gaps. Running custom analysis on the Arena dataset provides richer insights than reading top-line Elo rankings, particularly for understanding which failure modes are most prevalent for specific use cases.

Chatbot Arena's evaluation methodology has important selection biases that consumers of Arena rankings should account for. Arena users are predominantly English-speaking, technically sophisticated individuals who evaluate responses based on their own preferences and expertise. This user population over-represents queries about programming, mathematics, and technical domains compared to the full distribution of LLM use cases. Models optimized for the coding and reasoning tasks that Arena users disproportionately submit perform better in Arena rankings than their overall capability distribution would predict, making Arena less predictive for non-technical applications.

Arena's category breakdown provides more actionable quality signals than the aggregate Elo for teams with specific use cases. Filtering to "Coding" category battles when selecting a model for a software development assistant, or "Creative Writing" for a content generation application, reveals capability differences between models that the aggregate ranking obscures. Models that rank similarly in aggregate may differ by 50–100 Elo points in specific categories, making category-specific Arena scores the most relevant decision input for use-case-matched model selection.

Building an internal arena for private model evaluation follows the same statistical methodology as Chatbot Arena but applied to proprietary prompts and evaluation panels. A minimum of 500–1,000 battles between each pair of model versions is typically needed to achieve Elo confidence intervals narrow enough to make reliable deployment decisions. Using a stratified sampling approach — ensuring battles cover the full distribution of query types rather than sampling uniformly from available prompts — produces more representative Elo estimates for complex real-world workloads than convenience sampling from a readily available prompt pool.

Arena's tie resolution methodology affects ranking interpretation. When both models in a battle receive a tie vote, neither model receives a positive ranking signal. The frequency of ties varies significantly by model quality and query difficulty — well-matched model pairs produce more ties, while quality-mismatched pairs produce clear winners. High tie rates for a specific model often indicate that it performs consistently but without distinctive strengths, rather than indicating poor quality. Treating tie rate as a separate quality signal alongside win rate provides a more complete picture of model consistency than win rate alone.

Arena's open-source battle dataset enables reproducibility research and meta-analysis of LLM evaluation methodology. Published battle logs with anonymized model identifiers allow external researchers to recompute Elo ratings, test alternative statistical models, and analyze systematic preferences in human evaluation. This dataset transparency distinguishes Arena from proprietary benchmark providers and has enabled peer-reviewed research on evaluation biases, verbosity preferences, and cultural variation in quality judgments across different user populations who contribute battles to the platform.