LLM Router

Why routing saves money
Routing strategies
RouteLLM open-source router
Building a custom classifier router
LiteLLM routing
Measuring router effectiveness
Gotchas

SECTION 01

Why routing saves money

In most LLM applications, not all queries are equally hard. "What is the capital of France?" can be answered correctly by gpt-4o-mini ($0.00015/1K tokens). "Analyse the legal implications of this 50-page contract" needs GPT-4o ($0.0025/1K). If you send all queries to GPT-4o, you're paying 16× more than necessary for the easy queries. LLM routing classifies each query by complexity and sends it to the cheapest model that can handle it correctly. Typical savings: 40–70% cost reduction with <2% quality degradation on overall traffic.

SECTION 02

Routing strategies

Cascade routing: Try the cheap model first. If its confidence is low (e.g. logprob below threshold), escalate to the expensive model. Works well for classification and QA tasks where confidence is measurable.
Classifier routing: A small model classifies each query as "easy" or "hard" before routing. The classifier can be a fine-tuned BERT or a prompt-based LLM call that's much cheaper than the main task.
Rule-based routing: Simple heuristics — query length, presence of specific keywords, detected task type (math → strong model; casual chat → cheap model).
RouteLLM: Open-source library that uses a trained preference matrix from human feedback to route queries. Achieves strong results on MMLU-style benchmarks.

SECTION 03

RouteLLM open-source router

pip install routellm

from routellm.controller import Controller

# RouteLLM routes between a "strong" and "weak" model
controller = Controller(
    routers=["mf"],         # "mf" = matrix factorisation router (recommended)
    strong_model="gpt-4o",
    weak_model="gpt-4o-mini",
)

# threshold: fraction of queries sent to strong model
# 0.11618 = ~40% of queries go to strong model (from RouteLLM paper)
response = controller.chat.completions.create(
    model="router-mf-0.11618",
    messages=[{"role": "user", "content": "What is the integral of x^2?"}],
)
print(response.choices[0].message.content)

# Check which model was used:
# response.model == "gpt-4o" or "gpt-4o-mini"

# Calibrate threshold on your own data:
from routellm.evals.calibration import calibrate_threshold
threshold = calibrate_threshold(
    data=your_eval_data,
    target_strong_model_calls=0.3,  # want 30% of queries to use strong model
)

SECTION 04

Building a custom classifier router

import openai

client = openai.OpenAI()

ROUTER_PROMPT = '''Classify this user query as "simple" or "complex".
Simple: factual lookups, basic calculations, short responses, greetings.
Complex: multi-step reasoning, code generation, analysis, synthesis, long content.
Respond with only one word: simple or complex.

Query: {query}'''

def route_query(query: str) -> str:
    # Use a very cheap model for routing (tiny cost)
    classification = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": ROUTER_PROMPT.format(query=query)}],
        max_tokens=5,
        temperature=0.0,
    ).choices[0].message.content.strip().lower()

    return "gpt-4o" if classification == "complex" else "gpt-4o-mini"

def routed_completion(query: str, **kwargs) -> str:
    model = route_query(query)
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": query}],
        **kwargs,
    )
    return resp.choices[0].message.content, model

answer, used_model = routed_completion("What is 2+2?")
print(f"Answer: {answer} (used: {used_model})")
# Answer: 4 (used: gpt-4o-mini)

answer, used_model = routed_completion("Explain the proof of Fermat's Last Theorem")
print(f"Answer: {answer[:100]}... (used: {used_model})")
# ... (used: gpt-4o)

SECTION 05

LiteLLM routing

from litellm import Router

router = Router(
    model_list=[
        {"model_name": "cheap", "litellm_params": {"model": "gpt-4o-mini"}},
        {"model_name": "expensive", "litellm_params": {"model": "gpt-4o"}},
    ],
    # Routing strategies: "simple-shuffle", "latency-based-routing", "usage-based-routing"
    routing_strategy="usage-based-routing",  # route to model with most capacity
)

# Or use fallback routing (try cheap first, fallback to expensive on error)
response = router.completion(
    model="cheap",
    messages=[{"role": "user", "content": "Hello"}],
    fallbacks=[{"model": "expensive"}],  # fallback if cheap fails
)

SECTION 06

Measuring router effectiveness

def measure_router_roi(queries: list[str], ground_truth: list[str]) -> dict:
    # Baseline: all queries to expensive model
    baseline_cost = len(queries) * 0.01  # ~$0.01 per expensive query avg

    routed_cost = 0
    quality_maintained = 0

    for query, expected in zip(queries, ground_truth):
        answer, model = routed_completion(query)
        cost = 0.001 if model == "gpt-4o-mini" else 0.01
        routed_cost += cost
        # Check quality (simplified)
        if expected.lower() in answer.lower():
            quality_maintained += 1

    return {
        "cost_savings_pct": (baseline_cost - routed_cost) / baseline_cost * 100,
        "quality_maintained_pct": quality_maintained / len(queries) * 100,
        "cheap_model_pct": sum(1 for _ in queries if route_query(_) == "gpt-4o-mini") / len(queries) * 100,
    }

SECTION 07

Gotchas

Router accuracy matters: A router that sends complex queries to the cheap model reduces quality. Measure false negatives (complex queries classified as simple) carefully — these are more damaging than false positives.
Task-specific calibration: RouteLLM's default thresholds are calibrated on LMSYS Arena data (general chat). For specialised domains (legal, medical, code), recalibrate on your own query distribution.
Cascading latency: Cascade routing (cheap first, escalate on low confidence) adds latency for the queries that need escalation. If p99 latency matters, prefer classifier routing (one call to decide, one call to answer) over cascade routing.

Router design and evaluation

LLM router evaluation requires a labeled dataset of queries paired with the minimum model tier needed to answer them correctly. Building this calibration dataset typically starts with routing all queries to the strongest model, collecting responses, and then testing each query against cheaper models to determine the minimum acceptable tier. The fraction of queries routable to cheaper models without quality loss is the key metric for estimating cost savings before deploying a router, and the calibration dataset becomes the training data for learned routing classifiers.

Router type	Latency overhead	Accuracy	Maintenance
Rule-based (length/keyword)	~0ms	Low	Manual rule updates
Embedding classifier	5–20ms	Medium	Periodic retraining
Small LLM judge	50–200ms	High	Prompt maintenance
RouteLLM (trained)	10–50ms	High	Dataset required

Router confidence thresholds require separate calibration for different error costs. A router that incorrectly routes a complex query to a weak model produces a wrong answer, while a router that incorrectly routes a simple query to a strong model wastes money but produces a correct answer. The cost asymmetry means the decision boundary should be set conservatively — routing to the stronger model on uncertainty — with the threshold adjusted based on empirical quality measurements at different confidence levels. Regularly auditing routed queries by sampling from each routing decision and comparing outputs validates that the router continues to make correct decisions as query distributions shift.

Cost modeling for LLM routing requires tracking actual per-query costs across model tiers and measuring the quality differential between tiers. A routing system that routes 60% of queries to a model that costs 10x less than the premium tier while maintaining 95% of quality on those queries achieves roughly a 6x cost reduction on routed queries. Tracking routing decisions alongside quality metrics in production enables continuous calibration of the quality threshold — adjusting the classifier confidence boundary to route more queries when quality headroom exists, or fewer when recent misrouting incidents indicate the threshold is too aggressive.

Fallback routing handles cases where the primary routing decision proves incorrect. When a query routed to a cheaper model produces a response that fails a quality check — either an automated judge score below threshold or an explicit user dissatisfaction signal — the fallback path routes the same query to the premium model and returns the higher-quality response. Logging fallback events provides a continuous stream of "hard cases" that the classifier got wrong, forming a natural training signal for router retraining. The fallback latency penalty (two model calls) is acceptable for applications where most queries are correctly classified and fallback is rare.

LLM routing interacts with prompt caching in non-obvious ways. If the primary router sends similar queries to different model tiers, prompt cache hit rates on each tier may be lower than a single-model deployment because the cached context is split across models. Architectures that cache the system prompt and conversation history on the premium model independently from the cheap model need separate cache warming strategies for each tier. Consistent routing of similar queries to the same model tier maximizes cache utilization but may conflict with routing decisions based purely on query complexity scores.

Monitoring for router distribution shift detects when the query distribution has changed enough that the router's training data is no longer representative. An embedding-based distribution monitor computes the centroid of query embeddings over a rolling window and flags when the average cosine distance from the training centroid exceeds a threshold. Distribution shifts often precede quality degradation — when user behavior changes or new use cases emerge, the router may misclassify the new query patterns because they fall outside its training distribution. Triggering router retraining when distribution shift is detected maintains routing accuracy as the application evolves.

Router interpretability tools help identify systematic misrouting patterns that degrade overall quality. Visualizing the embedding space of routed queries — plotting easy and hard query embeddings with routing decision labels — reveals whether the classifier has learned a meaningful decision boundary or is routing based on superficial features like query length or keyword presence. Queries near the decision boundary that are incorrectly routed to cheaper models are the highest-priority candidates for adding to the router training dataset, as they represent the cases where the classifier is least confident and most prone to error.