Design Decisions

Model Selection

How to choose the right LLM for your production system. A structured framework covering task requirements, quality-cost-latency trade-offs, API vs self-hosted, frontier vs open models, and how to run your own benchmarks.

Quality/Cost/Latency
3-way trade-off
Benchmark first
Key principle
TCO matters
Beyond API price

Table of Contents

SECTION 01

The selection framework

Model selection should be driven by systematic evaluation, not brand preference or benchmark rankings. The framework:

  1. Define requirements first: What task (classification, generation, reasoning, code)? What quality bar (human parity, 90% of GPT-4, etc.)? What latency budget (real-time <1s, interactive <5s, batch <60s)? What cost budget (per-query, per-month)?
  2. Build a golden test set: 100–500 representative inputs with expected outputs or quality criteria. This is non-negotiable — you cannot select a model without a domain-specific eval.
  3. Shortlist candidates: 3–5 models that plausibly meet your requirements. Include the obvious frontrunners and one or two cheaper alternatives.
  4. Benchmark systematically: Run all candidates on your golden test set. Score automatically where possible, sample-human-review otherwise.
  5. Calculate total cost of ownership: API cost + infrastructure + maintenance + model update costs.
SECTION 02

Frontier API models

The current frontier model landscape (as of early 2025):

ModelProviderStrengthsInput $/1M
GPT-4oOpenAIBalanced, vision, function calling$2.50
Claude 3.5 SonnetAnthropicCoding, long context, instruction following$3.00
Gemini 1.5 ProGoogle1M context, multimodal, cost-efficient at scale$3.50
GPT-4o-miniOpenAIFast, cheap, surprisingly capable$0.15
Claude 3 HaikuAnthropicFastest Anthropic model, cheap$0.25
Gemini 1.5 FlashGoogleMultimodal, long context at low cost$0.35

For most production use cases: start with GPT-4o-mini or Claude Haiku for high-volume/low-stakes tasks, GPT-4o or Claude 3.5 Sonnet for quality-sensitive tasks.

SECTION 03

Open-weight self-hosted models

Self-hosted open models are cost-effective at high volume and necessary for data privacy requirements:

ModelSizeVRAM (4-bit)Best for
Llama 3.1 8B Instruct8B5 GBGeneral tasks, budget
Llama 3.1 70B Instruct70B40 GBQuality + privacy
Mistral 7B v0.37B4 GBApache 2.0 licence
Qwen 2.5 72B72B40 GBMultilingual, coding
Phi-4 14B14B9 GBReasoning, on-device

Break-even point for self-hosting: typically 1–5M tokens/day, where hosting cost (GPU cloud) becomes cheaper than API calls. Below this, API is almost always more economical when you include engineering overhead.

SECTION 04

Running your own benchmarks

import openai
from anthropic import Anthropic
import json, asyncio

async def benchmark_models(test_cases: list, models: list) -> dict:
    results = {}
    for model_config in models:
        scores = []
        for tc in test_cases:
            # Run the model
            output = await call_model(model_config, tc["input"])
            # Score the output
            score = await llm_judge(tc["input"], output, tc.get("expected"))
            scores.append({"input": tc["input"], "output": output, "score": score})
        results[model_config["name"]] = {
            "scores": scores,
            "mean_score": sum(s["score"] for s in scores) / len(scores),
            "pass_rate": sum(1 for s in scores if s["score"] >= 0.7) / len(scores),
            "avg_latency_ms": ...,
            "avg_cost_usd": ...,
        }
    return results

# Sort by quality/cost ratio
ranked = sorted(results.items(), key=lambda x: x[1]["mean_score"] / x[1]["avg_cost_usd"], reverse=True)
for name, metrics in ranked:
    print(f"{name}: score={metrics['mean_score']:.2f}, cost=${metrics['avg_cost_usd']:.4f}")
SECTION 05

Cost modelling

def monthly_cost_estimate(
    daily_queries: int,
    avg_input_tokens: int,
    avg_output_tokens: int,
    input_price_per_m: float,
    output_price_per_m: float,
) -> float:
    monthly_queries = daily_queries * 30
    input_cost = monthly_queries * avg_input_tokens / 1_000_000 * input_price_per_m
    output_cost = monthly_queries * avg_output_tokens / 1_000_000 * output_price_per_m
    return input_cost + output_cost

# Example: customer support bot
daily_queries = 10_000
avg_input = 1500   # system prompt + conversation history + user message
avg_output = 300   # typical response

print("GPT-4o:", monthly_cost_estimate(daily_queries, avg_input, avg_output, 2.50, 10.00))
print("GPT-4o-mini:", monthly_cost_estimate(daily_queries, avg_input, avg_output, 0.15, 0.60))
print("Claude Haiku:", monthly_cost_estimate(daily_queries, avg_input, avg_output, 0.25, 1.25))

# GPT-4o:     $12,900/month
# GPT-4o-mini:  $846/month  — if quality is acceptable, huge savings
# Claude Haiku: $1,275/month
SECTION 06

Latency requirements

Latency is often the deciding factor that overrides quality considerations:

Streaming is critical for user-facing applications: stream=True in API calls means users see tokens as they're generated, dramatically improving perceived responsiveness even when total latency is unchanged.

SECTION 07

Gotchas

Public benchmarks don't predict your task performance: MMLU, HumanEval, and MT-Bench measure specific capabilities. A model that ranks 5th on MMLU may rank 1st on your classification task. Always evaluate on your domain data.

Model versions change silently: OpenAI and Anthropic update models behind the same API endpoint. gpt-4o in March 2024 is not the same model as gpt-4o in December 2024. Pin to specific model versions (e.g., gpt-4o-2024-08-06) in production to avoid silent quality regressions.

Rate limits matter at scale: Enterprise tier rate limits are 10–100× higher than free/standard tiers. Factor in the lead time to upgrade limits — OpenAI rate limit increases can take days or weeks to approve.

Model Selection Decision Framework

Choosing the right model for a production application involves balancing capability, cost, latency, and deployment constraints. No single model is optimal for all use cases — the right choice depends on the specific task requirements, traffic volume, and acceptable quality thresholds for the application.

CriterionSmall Model (≤8B)Mid Model (8–70B)Large Model (70B+)
Cost per 1M tokens$0.10–$0.50$0.50–$3.00$3–$15
TTFT (API)100–300ms300ms–1s1–3s
Reasoning qualityGood for simple tasksStrongBest
Self-hosting cost1–2 GPUs2–4 GPUs4–8+ GPUs
Context length8K–128K32K–128K128K–1M+

Task complexity is the primary determinant of required model size. Simple classification, extraction, and templated generation tasks often reach production-quality accuracy with 7–8B parameter models, especially when fine-tuned on domain-specific data. Complex reasoning chains, multi-step planning, nuanced instruction following, and tasks requiring broad world knowledge are where larger models provide clear advantages that fine-tuning a smaller model cannot fully bridge.

Routing architectures allow mixing model tiers dynamically. A lightweight classifier routes simple queries to a cheap small model and complex queries to a more capable model, achieving better cost-quality trade-offs than either model alone. The classifier itself can be a small model or even a rule-based system based on query length, complexity heuristics, or user tier. Measuring the quality-cost curve of the routing strategy against both pure-small and pure-large baselines quantifies the actual benefit of the routing overhead.

# Model router: select model tier based on query complexity
import re

def classify_complexity(query: str) -> str:
    word_count = len(query.split())
    has_reasoning = any(k in query.lower() for k in
        ["compare", "analyze", "explain why", "evaluate", "trade-off"])
    has_code = bool(re.search(r"```|def |class |function", query))

    if word_count > 100 or has_reasoning:
        return "large"   # e.g. claude-opus or gpt-4o
    elif has_code or word_count > 40:
        return "medium"  # e.g. claude-sonnet or gpt-4o-mini
    else:
        return "small"   # e.g. claude-haiku or gpt-3.5-turbo

Benchmark-to-production quality correlation is imperfect, and over-relying on public benchmarks for model selection leads to suboptimal choices. MMLU and HumanEval measure academic reasoning and code generation; they may poorly predict performance on your specific task — customer support triage, document summarization, structured extraction from PDFs — which has different linguistic characteristics. Maintaining a domain-specific evaluation dataset and running new model releases against it before switching providers is the most reliable way to make model selection decisions grounded in actual use-case performance.

Total cost of ownership for model selection extends beyond per-token API pricing to include evaluation cost, integration engineering time, prompt migration effort, and monitoring overhead when switching providers. A model that is 20% cheaper per token but requires a week of prompt re-engineering and new evaluation suite runs may not deliver net savings unless the traffic volume is high enough for the per-token savings to amortize the switching cost. Modeling the full TCO over a 6–12 month horizon produces more accurate build-vs-buy cost comparisons than token price alone.