How to choose the right LLM for your production system. A structured framework covering task requirements, quality-cost-latency trade-offs, API vs self-hosted, frontier vs open models, and how to run your own benchmarks.
Model selection should be driven by systematic evaluation, not brand preference or benchmark rankings. The framework:
The current frontier model landscape (as of early 2025):
| Model | Provider | Strengths | Input $/1M |
|---|---|---|---|
| GPT-4o | OpenAI | Balanced, vision, function calling | $2.50 |
| Claude 3.5 Sonnet | Anthropic | Coding, long context, instruction following | $3.00 |
| Gemini 1.5 Pro | 1M context, multimodal, cost-efficient at scale | $3.50 | |
| GPT-4o-mini | OpenAI | Fast, cheap, surprisingly capable | $0.15 |
| Claude 3 Haiku | Anthropic | Fastest Anthropic model, cheap | $0.25 |
| Gemini 1.5 Flash | Multimodal, long context at low cost | $0.35 |
For most production use cases: start with GPT-4o-mini or Claude Haiku for high-volume/low-stakes tasks, GPT-4o or Claude 3.5 Sonnet for quality-sensitive tasks.
Self-hosted open models are cost-effective at high volume and necessary for data privacy requirements:
| Model | Size | VRAM (4-bit) | Best for |
|---|---|---|---|
| Llama 3.1 8B Instruct | 8B | 5 GB | General tasks, budget |
| Llama 3.1 70B Instruct | 70B | 40 GB | Quality + privacy |
| Mistral 7B v0.3 | 7B | 4 GB | Apache 2.0 licence |
| Qwen 2.5 72B | 72B | 40 GB | Multilingual, coding |
| Phi-4 14B | 14B | 9 GB | Reasoning, on-device |
Break-even point for self-hosting: typically 1–5M tokens/day, where hosting cost (GPU cloud) becomes cheaper than API calls. Below this, API is almost always more economical when you include engineering overhead.
import openai
from anthropic import Anthropic
import json, asyncio
async def benchmark_models(test_cases: list, models: list) -> dict:
results = {}
for model_config in models:
scores = []
for tc in test_cases:
# Run the model
output = await call_model(model_config, tc["input"])
# Score the output
score = await llm_judge(tc["input"], output, tc.get("expected"))
scores.append({"input": tc["input"], "output": output, "score": score})
results[model_config["name"]] = {
"scores": scores,
"mean_score": sum(s["score"] for s in scores) / len(scores),
"pass_rate": sum(1 for s in scores if s["score"] >= 0.7) / len(scores),
"avg_latency_ms": ...,
"avg_cost_usd": ...,
}
return results
# Sort by quality/cost ratio
ranked = sorted(results.items(), key=lambda x: x[1]["mean_score"] / x[1]["avg_cost_usd"], reverse=True)
for name, metrics in ranked:
print(f"{name}: score={metrics['mean_score']:.2f}, cost=${metrics['avg_cost_usd']:.4f}")
def monthly_cost_estimate(
daily_queries: int,
avg_input_tokens: int,
avg_output_tokens: int,
input_price_per_m: float,
output_price_per_m: float,
) -> float:
monthly_queries = daily_queries * 30
input_cost = monthly_queries * avg_input_tokens / 1_000_000 * input_price_per_m
output_cost = monthly_queries * avg_output_tokens / 1_000_000 * output_price_per_m
return input_cost + output_cost
# Example: customer support bot
daily_queries = 10_000
avg_input = 1500 # system prompt + conversation history + user message
avg_output = 300 # typical response
print("GPT-4o:", monthly_cost_estimate(daily_queries, avg_input, avg_output, 2.50, 10.00))
print("GPT-4o-mini:", monthly_cost_estimate(daily_queries, avg_input, avg_output, 0.15, 0.60))
print("Claude Haiku:", monthly_cost_estimate(daily_queries, avg_input, avg_output, 0.25, 1.25))
# GPT-4o: $12,900/month
# GPT-4o-mini: $846/month — if quality is acceptable, huge savings
# Claude Haiku: $1,275/month
Latency is often the deciding factor that overrides quality considerations:
Streaming is critical for user-facing applications: stream=True in API calls means users see tokens as they're generated, dramatically improving perceived responsiveness even when total latency is unchanged.
Public benchmarks don't predict your task performance: MMLU, HumanEval, and MT-Bench measure specific capabilities. A model that ranks 5th on MMLU may rank 1st on your classification task. Always evaluate on your domain data.
Model versions change silently: OpenAI and Anthropic update models behind the same API endpoint. gpt-4o in March 2024 is not the same model as gpt-4o in December 2024. Pin to specific model versions (e.g., gpt-4o-2024-08-06) in production to avoid silent quality regressions.
Rate limits matter at scale: Enterprise tier rate limits are 10–100× higher than free/standard tiers. Factor in the lead time to upgrade limits — OpenAI rate limit increases can take days or weeks to approve.
Choosing the right model for a production application involves balancing capability, cost, latency, and deployment constraints. No single model is optimal for all use cases — the right choice depends on the specific task requirements, traffic volume, and acceptable quality thresholds for the application.
| Criterion | Small Model (≤8B) | Mid Model (8–70B) | Large Model (70B+) |
|---|---|---|---|
| Cost per 1M tokens | $0.10–$0.50 | $0.50–$3.00 | $3–$15 |
| TTFT (API) | 100–300ms | 300ms–1s | 1–3s |
| Reasoning quality | Good for simple tasks | Strong | Best |
| Self-hosting cost | 1–2 GPUs | 2–4 GPUs | 4–8+ GPUs |
| Context length | 8K–128K | 32K–128K | 128K–1M+ |
Task complexity is the primary determinant of required model size. Simple classification, extraction, and templated generation tasks often reach production-quality accuracy with 7–8B parameter models, especially when fine-tuned on domain-specific data. Complex reasoning chains, multi-step planning, nuanced instruction following, and tasks requiring broad world knowledge are where larger models provide clear advantages that fine-tuning a smaller model cannot fully bridge.
Routing architectures allow mixing model tiers dynamically. A lightweight classifier routes simple queries to a cheap small model and complex queries to a more capable model, achieving better cost-quality trade-offs than either model alone. The classifier itself can be a small model or even a rule-based system based on query length, complexity heuristics, or user tier. Measuring the quality-cost curve of the routing strategy against both pure-small and pure-large baselines quantifies the actual benefit of the routing overhead.
# Model router: select model tier based on query complexity
import re
def classify_complexity(query: str) -> str:
word_count = len(query.split())
has_reasoning = any(k in query.lower() for k in
["compare", "analyze", "explain why", "evaluate", "trade-off"])
has_code = bool(re.search(r"```|def |class |function", query))
if word_count > 100 or has_reasoning:
return "large" # e.g. claude-opus or gpt-4o
elif has_code or word_count > 40:
return "medium" # e.g. claude-sonnet or gpt-4o-mini
else:
return "small" # e.g. claude-haiku or gpt-3.5-turbo
Benchmark-to-production quality correlation is imperfect, and over-relying on public benchmarks for model selection leads to suboptimal choices. MMLU and HumanEval measure academic reasoning and code generation; they may poorly predict performance on your specific task — customer support triage, document summarization, structured extraction from PDFs — which has different linguistic characteristics. Maintaining a domain-specific evaluation dataset and running new model releases against it before switching providers is the most reliable way to make model selection decisions grounded in actual use-case performance.
Total cost of ownership for model selection extends beyond per-token API pricing to include evaluation cost, integration engineering time, prompt migration effort, and monitoring overhead when switching providers. A model that is 20% cheaper per token but requires a week of prompt re-engineering and new evaluation suite runs may not deliver net savings unless the traffic volume is high enough for the per-token savings to amortize the switching cost. Modeling the full TCO over a 6–12 month horizon produces more accurate build-vs-buy cost comparisons than token price alone.