Route queries to cheap vs expensive models based on complexity. Simple queries go to GPT-4o-mini; complex ones go to GPT-4o. RouteLLM, LiteLLM routing, and custom classifiers all implement this pattern.
In most LLM applications, not all queries are equally hard. "What is the capital of France?" can be answered correctly by gpt-4o-mini ($0.00015/1K tokens). "Analyse the legal implications of this 50-page contract" needs GPT-4o ($0.0025/1K). If you send all queries to GPT-4o, you're paying 16× more than necessary for the easy queries. LLM routing classifies each query by complexity and sends it to the cheapest model that can handle it correctly. Typical savings: 40–70% cost reduction with <2% quality degradation on overall traffic.
pip install routellm
from routellm.controller import Controller
# RouteLLM routes between a "strong" and "weak" model
controller = Controller(
routers=["mf"], # "mf" = matrix factorisation router (recommended)
strong_model="gpt-4o",
weak_model="gpt-4o-mini",
)
# threshold: fraction of queries sent to strong model
# 0.11618 = ~40% of queries go to strong model (from RouteLLM paper)
response = controller.chat.completions.create(
model="router-mf-0.11618",
messages=[{"role": "user", "content": "What is the integral of x^2?"}],
)
print(response.choices[0].message.content)
# Check which model was used:
# response.model == "gpt-4o" or "gpt-4o-mini"
# Calibrate threshold on your own data:
from routellm.evals.calibration import calibrate_threshold
threshold = calibrate_threshold(
data=your_eval_data,
target_strong_model_calls=0.3, # want 30% of queries to use strong model
)
import openai
client = openai.OpenAI()
ROUTER_PROMPT = '''Classify this user query as "simple" or "complex".
Simple: factual lookups, basic calculations, short responses, greetings.
Complex: multi-step reasoning, code generation, analysis, synthesis, long content.
Respond with only one word: simple or complex.
Query: {query}'''
def route_query(query: str) -> str:
# Use a very cheap model for routing (tiny cost)
classification = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": ROUTER_PROMPT.format(query=query)}],
max_tokens=5,
temperature=0.0,
).choices[0].message.content.strip().lower()
return "gpt-4o" if classification == "complex" else "gpt-4o-mini"
def routed_completion(query: str, **kwargs) -> str:
model = route_query(query)
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": query}],
**kwargs,
)
return resp.choices[0].message.content, model
answer, used_model = routed_completion("What is 2+2?")
print(f"Answer: {answer} (used: {used_model})")
# Answer: 4 (used: gpt-4o-mini)
answer, used_model = routed_completion("Explain the proof of Fermat's Last Theorem")
print(f"Answer: {answer[:100]}... (used: {used_model})")
# ... (used: gpt-4o)
from litellm import Router
router = Router(
model_list=[
{"model_name": "cheap", "litellm_params": {"model": "gpt-4o-mini"}},
{"model_name": "expensive", "litellm_params": {"model": "gpt-4o"}},
],
# Routing strategies: "simple-shuffle", "latency-based-routing", "usage-based-routing"
routing_strategy="usage-based-routing", # route to model with most capacity
)
# Or use fallback routing (try cheap first, fallback to expensive on error)
response = router.completion(
model="cheap",
messages=[{"role": "user", "content": "Hello"}],
fallbacks=[{"model": "expensive"}], # fallback if cheap fails
)
def measure_router_roi(queries: list[str], ground_truth: list[str]) -> dict:
# Baseline: all queries to expensive model
baseline_cost = len(queries) * 0.01 # ~$0.01 per expensive query avg
routed_cost = 0
quality_maintained = 0
for query, expected in zip(queries, ground_truth):
answer, model = routed_completion(query)
cost = 0.001 if model == "gpt-4o-mini" else 0.01
routed_cost += cost
# Check quality (simplified)
if expected.lower() in answer.lower():
quality_maintained += 1
return {
"cost_savings_pct": (baseline_cost - routed_cost) / baseline_cost * 100,
"quality_maintained_pct": quality_maintained / len(queries) * 100,
"cheap_model_pct": sum(1 for _ in queries if route_query(_) == "gpt-4o-mini") / len(queries) * 100,
}
LLM router evaluation requires a labeled dataset of queries paired with the minimum model tier needed to answer them correctly. Building this calibration dataset typically starts with routing all queries to the strongest model, collecting responses, and then testing each query against cheaper models to determine the minimum acceptable tier. The fraction of queries routable to cheaper models without quality loss is the key metric for estimating cost savings before deploying a router, and the calibration dataset becomes the training data for learned routing classifiers.
| Router type | Latency overhead | Accuracy | Maintenance |
|---|---|---|---|
| Rule-based (length/keyword) | ~0ms | Low | Manual rule updates |
| Embedding classifier | 5–20ms | Medium | Periodic retraining |
| Small LLM judge | 50–200ms | High | Prompt maintenance |
| RouteLLM (trained) | 10–50ms | High | Dataset required |
Router confidence thresholds require separate calibration for different error costs. A router that incorrectly routes a complex query to a weak model produces a wrong answer, while a router that incorrectly routes a simple query to a strong model wastes money but produces a correct answer. The cost asymmetry means the decision boundary should be set conservatively — routing to the stronger model on uncertainty — with the threshold adjusted based on empirical quality measurements at different confidence levels. Regularly auditing routed queries by sampling from each routing decision and comparing outputs validates that the router continues to make correct decisions as query distributions shift.
Cost modeling for LLM routing requires tracking actual per-query costs across model tiers and measuring the quality differential between tiers. A routing system that routes 60% of queries to a model that costs 10x less than the premium tier while maintaining 95% of quality on those queries achieves roughly a 6x cost reduction on routed queries. Tracking routing decisions alongside quality metrics in production enables continuous calibration of the quality threshold — adjusting the classifier confidence boundary to route more queries when quality headroom exists, or fewer when recent misrouting incidents indicate the threshold is too aggressive.
Fallback routing handles cases where the primary routing decision proves incorrect. When a query routed to a cheaper model produces a response that fails a quality check — either an automated judge score below threshold or an explicit user dissatisfaction signal — the fallback path routes the same query to the premium model and returns the higher-quality response. Logging fallback events provides a continuous stream of "hard cases" that the classifier got wrong, forming a natural training signal for router retraining. The fallback latency penalty (two model calls) is acceptable for applications where most queries are correctly classified and fallback is rare.
LLM routing interacts with prompt caching in non-obvious ways. If the primary router sends similar queries to different model tiers, prompt cache hit rates on each tier may be lower than a single-model deployment because the cached context is split across models. Architectures that cache the system prompt and conversation history on the premium model independently from the cheap model need separate cache warming strategies for each tier. Consistent routing of similar queries to the same model tier maximizes cache utilization but may conflict with routing decisions based purely on query complexity scores.
Monitoring for router distribution shift detects when the query distribution has changed enough that the router's training data is no longer representative. An embedding-based distribution monitor computes the centroid of query embeddings over a rolling window and flags when the average cosine distance from the training centroid exceeds a threshold. Distribution shifts often precede quality degradation — when user behavior changes or new use cases emerge, the router may misclassify the new query patterns because they fall outside its training distribution. Triggering router retraining when distribution shift is detected maintains routing accuracy as the application evolves.
Router interpretability tools help identify systematic misrouting patterns that degrade overall quality. Visualizing the embedding space of routed queries — plotting easy and hard query embeddings with routing decision labels — reveals whether the classifier has learned a meaningful decision boundary or is routing based on superficial features like query length or keyword presence. Queries near the decision boundary that are incorrectly routed to cheaper models are the highest-priority candidates for adding to the router training dataset, as they represent the cases where the classifier is least confident and most prone to error.