Dynamically routing requests to cheaper models when they are sufficient, and expensive models when quality demands it — automatically optimising the cost/quality tradeoff at scale.
At scale, every unnecessary GPT-4o call is waste. A 7B local model costs 50–200× less per token than GPT-4o. The challenge: not all queries need GPT-4o. Simple factual lookups, classification tasks, and short reformulations can be handled by smaller models. Cost-aware routing captures these savings without degrading quality on hard queries.
Query complexity proxies: token count (long = harder), presence of reasoning keywords ('why', 'explain', 'compare'), detected task type (classification = easy, open-ended = hard), user tier (premium users always get the best model), and historical accuracy of small model on similar queries. Combine multiple signals for a robust routing score.
A lightweight classifier (logistic regression or small BERT) predicts whether the small model " "will produce acceptable quality. Train on labelled examples where human raters compared " "small vs large model outputs.
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
# Assume labels: 1 = small model sufficient, 0 = needs large model
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
classifier = LogisticRegression(C=1.0)
def train_router(queries: list[str], labels: list[int]):
X = vectorizer.fit_transform(queries)
classifier.fit(X, labels)
def route(query: str, threshold: float = 0.75) -> str:
X = vectorizer.transform([query])
prob_small_ok = classifier.predict_proba(X)[0][1]
if prob_small_ok >= threshold:
return "small" # e.g. gpt-4o-mini or local 7B
return "large" # e.g. gpt-4o
Simple rules often capture 80% of the savings with zero training overhead. " "Route to small if: query < 50 tokens, task type is classification/sentiment/extraction, " "or user is on free tier. Always use large for: multi-step reasoning, code generation, " "medical/legal domains, or when the user explicitly requests it.
def rule_router(query: str, task_type: str, user_tier: str) -> str:
# Hard overrides
if user_tier == "premium":
return "large"
if task_type in ("classification", "sentiment", "extraction", "translation"):
return "small"
# Heuristic complexity
if len(query.split()) < 30 and "?" not in query:
return "small"
keywords = {"explain", "compare", "analyze", "why", "how does", "what would happen"}
if any(kw in query.lower() for kw in keywords):
return "large"
return "small" # default to cheap
Before deploying a new routing policy, run a shadow A/B test: route 5% of traffic to the new policy and compare quality scores (LLM judge + user signals). Only cut over if quality delta is within your tolerance (typically <5% CSAT drop). Keep the old policy as a fallback for 2 weeks after cutover.
Track: cost per request by model, routing distribution (% to each model), quality score by routed model, and cost savings vs always-large baseline. Alert if: cost per request increases >20% (routing shifted to large), or quality score for small-routed requests drops >5% (router over-routing).
Not all requests need the most capable model. Simple classification tasks work fine with a 7B model, while complex reasoning requires 70B or larger. A smart router evaluates the request complexity (token count, task type) and selects the cheapest model that can handle it, saving 80–90% on costs for simple queries.
# Cost-based routing logic
def route_request(user_query):
# Estimate task complexity
complexity = estimate_complexity(user_query) # 0.0 to 1.0
token_count = count_tokens(user_query)
# Routing decision table
if complexity < 0.3 and token_count < 500:
return "claude-3-haiku" # Cheapest, ~$0.25 per 1M input tokens
elif complexity < 0.6 and token_count < 2000:
return "claude-3-sonnet" # Balanced, ~$3 per 1M input tokens
else:
return "claude-3-opus" # Most capable, ~$15 per 1M input tokens
# Example usage
user_queries = [
"What's the capital of France?", # Haiku sufficient
"Summarize this 5-page report", # Sonnet recommended
"Design a novel ML architecture" # Opus required
]
for query in user_queries:
model = route_request(query)
response = call_model(model, query)
print(f"Query: '{query[:30]}...' → {model}")Intelligent routing trades latency for cost. Validate that downgraded models still produce acceptable output quality. A/B test by routing 5% of traffic to the upgraded model and comparing latency, cost, and quality metrics (accuracy on evals, user satisfaction).
# A/B testing framework for model routing
import random
def route_with_testing(query, control_model="claude-3-opus", test_pct=0.05):
# 95% normal routing, 5% forced to best model for quality check
if random.random() < test_pct:
selected_model = control_model
is_test = True
else:
selected_model = route_request(query)
is_test = False
# Log for analysis
response = call_model(selected_model, query)
log_routing_decision({
"query": query,
"selected_model": selected_model,
"is_test": is_test,
"latency_ms": response.latency_ms,
"cost": response.cost
})
return responseMore sophisticated routers incorporate user context (VIP users always get premium models), request history (cache hits on previous queries), and real-time model availability (failover if a model is rate-limited). Some routers even learn from outcomes: if Haiku fails on a query, mark the query type and always upgrade future similar queries.
| Routing Strategy | Cost Saving | Complexity | Best For |
|---|---|---|---|
| Naive (always use Opus) | 0% | Low | Baseline, high quality |
| Token-based | 40–50% | Low | Simple complexity detection |
| Learned classifier | 60–75% | Medium | Production, proven reliable |
| Dynamic with fallback | 70–85% | High | Mixed workloads, SLA critical |
| User-segmented + caching | 80–90% | Very High | Mature platforms, high volume |
Common Pitfall: Routing that's too aggressive (using cheap models too often) may degrade user experience. Monitor quality metrics closely: if user satisfaction drops even 2–3%, the cost savings are often not worth the reputation damage. Implement a feedback loop where users can report low-quality responses, which should trigger an immediate upgrade for similar future queries.
Consider time-of-day based routing: use cheaper models during off-peak hours when latency is less critical, and reserve expensive models for peak times. This smooths load and reduces costs without sacrificing peak experience.
Multi-Modal Routing & Fallback Chains: Routing isn't limited to LLMs; route across modalities too. If the user's query is simple text, use a text LLM. If they upload an image with text, route to a vision model. If they ask for code generation, route to a code-tuned model. Implement fallback chains: if Haiku fails (e.g., timeout), escalate to Sonnet. Log all fallback instances; high fallback rates indicate your thresholds are too aggressive. Balance cost against reliability: a slightly more expensive model with higher success rate is preferable to cheaper model with fallbacks.
Implement A/B testing frameworks: route 1% of queries to different model combinations and measure latency/cost/quality. Data from A/B tests informs routing policy improvements. For long-term trends, use contextual bandits (multi-armed bandit algorithms) to dynamically optimize routing probabilities in real-time based on feedback, without requiring manual policy updates.
Monitoring and observability are essential for production systems. Set up comprehensive logging at every layer: API requests, model predictions, database queries, cache hits/misses. Use structured logging (JSON) to enable filtering and aggregation across thousands of servers. For production deployments, track not just errors but also latency percentiles (p50, p95, p99); if p99 latency suddenly doubles, something is wrong even if error rates are normal. Set up alerting based on SLO violations: if a service is supposed to have 99.9% availability and it drops to 99.5%, alert immediately. Use distributed tracing (Jaeger, Lightstep) to track requests across multiple services; a slow end-to-end latency might be hidden in one deep service call, invisible in aggregate metrics.
For long-running ML jobs (training, batch inference), implement checkpoint recovery and graceful degradation. If a training job crashes after 2 weeks, you want to resume from the last checkpoint, not restart from scratch. Implement job orchestration with Kubernetes or Airflow to handle retries, resource allocation, and dependency management. Use feature flags for safe deployment: deploy new model versions behind a flag that's off by default, gradually roll out to 1% of users, 10%, then 100%, monitoring metrics at each step. If something goes wrong, flip the flag back instantly. This approach reduces risk and enables fast rollback.
Finally, build a culture of incident response and post-mortems. When something breaks (and it will), document the incident: timeline, root cause, mitigation steps, and preventive measures. Use incidents as learning opportunities; blameless post-mortems focus on systems, not people. Share findings across teams to prevent repeat incidents. A well-documented incident history is an organization's institutional knowledge about system failures and how to avoid them.
The rapid evolution of AI infrastructure requires continuous learning and adaptation. Teams should establish regular tech talks and knowledge-sharing sessions where engineers present lessons learned from production deployments, performance optimization work, and incident postmortems. Create internal wiki pages documenting best practices specific to your organization: how to debug common failure modes, performance tuning guides for your hardware, and checklists for safe deployments. This prevents repeating mistakes and accelerates onboarding of new team members.
Build relationships with vendors and open-source communities. If you encounter bugs in frameworks (PyTorch, JAX), file detailed reports. If you have questions, ask on forums; community members often have encountered similar issues. For mission-critical infrastructure, consider purchasing support contracts with vendors (PyTorch, HuggingFace, cloud providers). Support gives you direct access to engineers who understand your system and can prioritize fixes. This is insurance against production outages caused by third-party software bugs.
Finally, remember that optimization is a journey, not a destination. Today's cutting-edge technique becomes tomorrow's baseline. Allocate 10-15% of engineering time to exploration and experimentation. Some experiments will fail, but successful ones compound into significant efficiency gains. Foster a culture of continuous improvement: measure, analyze, iterate, and share results. The teams that stay ahead are those that invest in understanding their systems deeply and adapting proactively to new technologies and changing demands.
Key Takeaway: Success in GenAI infrastructure depends on mastering fundamentals: understand your hardware constraints, profile your workloads, measure everything, and iterate. The most sophisticated techniques (dynamic batching, mixed precision, distributed training) build on solid foundations of clear thinking and empirical validation. Avoid cargo-cult engineering: if you don't understand why a technique helps your specific use case, it probably won't. Invest time in understanding root causes, not just applying trendy solutions. Over time, this rigor will compound into significant competitive advantage.