Model Selection

The selection framework
Frontier API models
Open-weight self-hosted models
Running your own benchmarks
Cost modelling
Latency requirements
Gotchas

SECTION 01

The selection framework

Model selection should be driven by systematic evaluation, not brand preference or benchmark rankings. The framework:

Define requirements first: What task (classification, generation, reasoning, code)? What quality bar (human parity, 90% of GPT-4, etc.)? What latency budget (real-time <1s, interactive <5s, batch <60s)? What cost budget (per-query, per-month)?
Build a golden test set: 100–500 representative inputs with expected outputs or quality criteria. This is non-negotiable — you cannot select a model without a domain-specific eval.
Shortlist candidates: 3–5 models that plausibly meet your requirements. Include the obvious frontrunners and one or two cheaper alternatives.
Benchmark systematically: Run all candidates on your golden test set. Score automatically where possible, sample-human-review otherwise.
Calculate total cost of ownership: API cost + infrastructure + maintenance + model update costs.

SECTION 02

Frontier API models

The current frontier model landscape (as of early 2025):

Model	Provider	Strengths	Input $/1M
GPT-4o	OpenAI	Balanced, vision, function calling	$2.50
Claude 3.5 Sonnet	Anthropic	Coding, long context, instruction following	$3.00
Gemini 1.5 Pro	Google	1M context, multimodal, cost-efficient at scale	$3.50
GPT-4o-mini	OpenAI	Fast, cheap, surprisingly capable	$0.15
Claude 3 Haiku	Anthropic	Fastest Anthropic model, cheap	$0.25
Gemini 1.5 Flash	Google	Multimodal, long context at low cost	$0.35

For most production use cases: start with GPT-4o-mini or Claude Haiku for high-volume/low-stakes tasks, GPT-4o or Claude 3.5 Sonnet for quality-sensitive tasks.

SECTION 03

Open-weight self-hosted models

Self-hosted open models are cost-effective at high volume and necessary for data privacy requirements:

Model	Size	VRAM (4-bit)	Best for
Llama 3.1 8B Instruct	8B	5 GB	General tasks, budget
Llama 3.1 70B Instruct	70B	40 GB	Quality + privacy
Mistral 7B v0.3	7B	4 GB	Apache 2.0 licence
Qwen 2.5 72B	72B	40 GB	Multilingual, coding
Phi-4 14B	14B	9 GB	Reasoning, on-device

Break-even point for self-hosting: typically 1–5M tokens/day, where hosting cost (GPU cloud) becomes cheaper than API calls. Below this, API is almost always more economical when you include engineering overhead.

SECTION 04

Running your own benchmarks

import openai
from anthropic import Anthropic
import json, asyncio

async def benchmark_models(test_cases: list, models: list) -> dict:
    results = {}
    for model_config in models:
        scores = []
        for tc in test_cases:
            # Run the model
            output = await call_model(model_config, tc["input"])
            # Score the output
            score = await llm_judge(tc["input"], output, tc.get("expected"))
            scores.append({"input": tc["input"], "output": output, "score": score})
        results[model_config["name"]] = {
            "scores": scores,
            "mean_score": sum(s["score"] for s in scores) / len(scores),
            "pass_rate": sum(1 for s in scores if s["score"] >= 0.7) / len(scores),
            "avg_latency_ms": ...,
            "avg_cost_usd": ...,
        }
    return results

# Sort by quality/cost ratio
ranked = sorted(results.items(), key=lambda x: x[1]["mean_score"] / x[1]["avg_cost_usd"], reverse=True)
for name, metrics in ranked:
    print(f"{name}: score={metrics['mean_score']:.2f}, cost=${metrics['avg_cost_usd']:.4f}")

SECTION 05

Cost modelling

def monthly_cost_estimate(
    daily_queries: int,
    avg_input_tokens: int,
    avg_output_tokens: int,
    input_price_per_m: float,
    output_price_per_m: float,
) -> float:
    monthly_queries = daily_queries * 30
    input_cost = monthly_queries * avg_input_tokens / 1_000_000 * input_price_per_m
    output_cost = monthly_queries * avg_output_tokens / 1_000_000 * output_price_per_m
    return input_cost + output_cost

# Example: customer support bot
daily_queries = 10_000
avg_input = 1500   # system prompt + conversation history + user message
avg_output = 300   # typical response

print("GPT-4o:", monthly_cost_estimate(daily_queries, avg_input, avg_output, 2.50, 10.00))
print("GPT-4o-mini:", monthly_cost_estimate(daily_queries, avg_input, avg_output, 0.15, 0.60))
print("Claude Haiku:", monthly_cost_estimate(daily_queries, avg_input, avg_output, 0.25, 1.25))

# GPT-4o:     $12,900/month
# GPT-4o-mini:  $846/month  — if quality is acceptable, huge savings
# Claude Haiku: $1,275/month

SECTION 06

Latency requirements

Latency is often the deciding factor that overrides quality considerations:

Real-time (<500ms): Voice assistants, autocomplete, live search. Only fast API models (GPT-4o-mini, Gemini Flash) or self-hosted small models (7B, quantised). No 70B+ models.
Interactive (<5s): Chat interfaces, form assistance. Most API models qualify. Consider streaming for perceived latency — first token in 500ms feels fast even if full response takes 4s.
Batch (>10s acceptable): Document processing, overnight analysis. Use larger, higher-quality models. Batch APIs (OpenAI, Anthropic) give 50% cost reduction at 24h turnaround.

Streaming is critical for user-facing applications: stream=True in API calls means users see tokens as they're generated, dramatically improving perceived responsiveness even when total latency is unchanged.

SECTION 07

Gotchas

Public benchmarks don't predict your task performance: MMLU, HumanEval, and MT-Bench measure specific capabilities. A model that ranks 5th on MMLU may rank 1st on your classification task. Always evaluate on your domain data.

Model versions change silently: OpenAI and Anthropic update models behind the same API endpoint. gpt-4o in March 2024 is not the same model as gpt-4o in December 2024. Pin to specific model versions (e.g., gpt-4o-2024-08-06) in production to avoid silent quality regressions.

Rate limits matter at scale: Enterprise tier rate limits are 10–100× higher than free/standard tiers. Factor in the lead time to upgrade limits — OpenAI rate limit increases can take days or weeks to approve.

Model Selection Decision Framework

Choosing the right model for a production application involves balancing capability, cost, latency, and deployment constraints. No single model is optimal for all use cases — the right choice depends on the specific task requirements, traffic volume, and acceptable quality thresholds for the application.

Criterion	Small Model (≤8B)	Mid Model (8–70B)	Large Model (70B+)
Cost per 1M tokens	$0.10–$0.50	$0.50–$3.00	$3–$15
TTFT (API)	100–300ms	300ms–1s	1–3s
Reasoning quality	Good for simple tasks	Strong	Best
Self-hosting cost	1–2 GPUs	2–4 GPUs	4–8+ GPUs
Context length	8K–128K	32K–128K	128K–1M+

Task complexity is the primary determinant of required model size. Simple classification, extraction, and templated generation tasks often reach production-quality accuracy with 7–8B parameter models, especially when fine-tuned on domain-specific data. Complex reasoning chains, multi-step planning, nuanced instruction following, and tasks requiring broad world knowledge are where larger models provide clear advantages that fine-tuning a smaller model cannot fully bridge.

Routing architectures allow mixing model tiers dynamically. A lightweight classifier routes simple queries to a cheap small model and complex queries to a more capable model, achieving better cost-quality trade-offs than either model alone. The classifier itself can be a small model or even a rule-based system based on query length, complexity heuristics, or user tier. Measuring the quality-cost curve of the routing strategy against both pure-small and pure-large baselines quantifies the actual benefit of the routing overhead.

# Model router: select model tier based on query complexity
import re

def classify_complexity(query: str) -> str:
    word_count = len(query.split())
    has_reasoning = any(k in query.lower() for k in
        ["compare", "analyze", "explain why", "evaluate", "trade-off"])
    has_code = bool(re.search(r"```|def |class |function", query))

    if word_count > 100 or has_reasoning:
        return "large"   # e.g. claude-opus or gpt-4o
    elif has_code or word_count > 40:
        return "medium"  # e.g. claude-sonnet or gpt-4o-mini
    else:
        return "small"   # e.g. claude-haiku or gpt-3.5-turbo

Benchmark-to-production quality correlation is imperfect, and over-relying on public benchmarks for model selection leads to suboptimal choices. MMLU and HumanEval measure academic reasoning and code generation; they may poorly predict performance on your specific task — customer support triage, document summarization, structured extraction from PDFs — which has different linguistic characteristics. Maintaining a domain-specific evaluation dataset and running new model releases against it before switching providers is the most reliable way to make model selection decisions grounded in actual use-case performance.

Total cost of ownership for model selection extends beyond per-token API pricing to include evaluation cost, integration engineering time, prompt migration effort, and monitoring overhead when switching providers. A model that is 20% cheaper per token but requires a week of prompt re-engineering and new evaluation suite runs may not deliver net savings unless the traffic volume is high enough for the per-token savings to amortize the switching cost. Modeling the full TCO over a 6–12 month horizon produces more accurate build-vs-buy cost comparisons than token price alone.