Production Engineering

Cost-Aware Routing

Dynamically routing requests to cheaper models when they are sufficient, and expensive models when quality demands it — automatically optimising the cost/quality tradeoff at scale.

Typical savings
40–70%
Quality delta
<3%
Key signal
query complexity

Table of Contents

SECTION 01

The Cost Problem

At scale, every unnecessary GPT-4o call is waste. A 7B local model costs 50–200× less per token than GPT-4o. The challenge: not all queries need GPT-4o. Simple factual lookups, classification tasks, and short reformulations can be handled by smaller models. Cost-aware routing captures these savings without degrading quality on hard queries.

SECTION 02

Routing Signals

Query complexity proxies: token count (long = harder), presence of reasoning keywords ('why', 'explain', 'compare'), detected task type (classification = easy, open-ended = hard), user tier (premium users always get the best model), and historical accuracy of small model on similar queries. Combine multiple signals for a robust routing score.

SECTION 03

Classifier-Based Router

A lightweight classifier (logistic regression or small BERT) predicts whether the small model " "will produce acceptable quality. Train on labelled examples where human raters compared " "small vs large model outputs.

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
# Assume labels: 1 = small model sufficient, 0 = needs large model
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
classifier = LogisticRegression(C=1.0)
def train_router(queries: list[str], labels: list[int]):
    X = vectorizer.fit_transform(queries)
    classifier.fit(X, labels)
def route(query: str, threshold: float = 0.75) -> str:
    X = vectorizer.transform([query])
    prob_small_ok = classifier.predict_proba(X)[0][1]
    if prob_small_ok >= threshold:
        return "small"   # e.g. gpt-4o-mini or local 7B
    return "large"       # e.g. gpt-4o
SECTION 04

Rule-Based Routing

Simple rules often capture 80% of the savings with zero training overhead. " "Route to small if: query < 50 tokens, task type is classification/sentiment/extraction, " "or user is on free tier. Always use large for: multi-step reasoning, code generation, " "medical/legal domains, or when the user explicitly requests it.

def rule_router(query: str, task_type: str, user_tier: str) -> str:
    # Hard overrides
    if user_tier == "premium":
        return "large"
    if task_type in ("classification", "sentiment", "extraction", "translation"):
        return "small"
    # Heuristic complexity
    if len(query.split()) < 30 and "?" not in query:
        return "small"
    keywords = {"explain", "compare", "analyze", "why", "how does", "what would happen"}
    if any(kw in query.lower() for kw in keywords):
        return "large"
    return "small"  # default to cheap
SECTION 05

A/B Testing Routes

Before deploying a new routing policy, run a shadow A/B test: route 5% of traffic to the new policy and compare quality scores (LLM judge + user signals). Only cut over if quality delta is within your tolerance (typically <5% CSAT drop). Keep the old policy as a fallback for 2 weeks after cutover.

SECTION 06

Monitoring Costs

Track: cost per request by model, routing distribution (% to each model), quality score by routed model, and cost savings vs always-large baseline. Alert if: cost per request increases >20% (routing shifted to large), or quality score for small-routed requests drops >5% (router over-routing).

SECTION 07

Cost-Based Model Selection Strategy

Not all requests need the most capable model. Simple classification tasks work fine with a 7B model, while complex reasoning requires 70B or larger. A smart router evaluates the request complexity (token count, task type) and selects the cheapest model that can handle it, saving 80–90% on costs for simple queries.

# Cost-based routing logic
def route_request(user_query):
    # Estimate task complexity
    complexity = estimate_complexity(user_query)  # 0.0 to 1.0
    token_count = count_tokens(user_query)

    # Routing decision table
    if complexity < 0.3 and token_count < 500:
        return "claude-3-haiku"  # Cheapest, ~$0.25 per 1M input tokens
    elif complexity < 0.6 and token_count < 2000:
        return "claude-3-sonnet"  # Balanced, ~$3 per 1M input tokens
    else:
        return "claude-3-opus"  # Most capable, ~$15 per 1M input tokens

# Example usage
user_queries = [
    "What's the capital of France?",  # Haiku sufficient
    "Summarize this 5-page report",   # Sonnet recommended
    "Design a novel ML architecture"  # Opus required
]

for query in user_queries:
    model = route_request(query)
    response = call_model(model, query)
    print(f"Query: '{query[:30]}...' → {model}")

A/B Testing & Quality Verification

Intelligent routing trades latency for cost. Validate that downgraded models still produce acceptable output quality. A/B test by routing 5% of traffic to the upgraded model and comparing latency, cost, and quality metrics (accuracy on evals, user satisfaction).

# A/B testing framework for model routing
import random

def route_with_testing(query, control_model="claude-3-opus", test_pct=0.05):
    # 95% normal routing, 5% forced to best model for quality check
    if random.random() < test_pct:
        selected_model = control_model
        is_test = True
    else:
        selected_model = route_request(query)
        is_test = False

    # Log for analysis
    response = call_model(selected_model, query)
    log_routing_decision({
        "query": query,
        "selected_model": selected_model,
        "is_test": is_test,
        "latency_ms": response.latency_ms,
        "cost": response.cost
    })

    return response
SECTION 08

Advanced Routing Techniques

More sophisticated routers incorporate user context (VIP users always get premium models), request history (cache hits on previous queries), and real-time model availability (failover if a model is rate-limited). Some routers even learn from outcomes: if Haiku fails on a query, mark the query type and always upgrade future similar queries.

Routing Strategy Cost Saving Complexity Best For
Naive (always use Opus) 0% Low Baseline, high quality
Token-based 40–50% Low Simple complexity detection
Learned classifier 60–75% Medium Production, proven reliable
Dynamic with fallback 70–85% High Mixed workloads, SLA critical
User-segmented + caching 80–90% Very High Mature platforms, high volume

Common Pitfall: Routing that's too aggressive (using cheap models too often) may degrade user experience. Monitor quality metrics closely: if user satisfaction drops even 2–3%, the cost savings are often not worth the reputation damage. Implement a feedback loop where users can report low-quality responses, which should trigger an immediate upgrade for similar future queries.

Consider time-of-day based routing: use cheaper models during off-peak hours when latency is less critical, and reserve expensive models for peak times. This smooths load and reduces costs without sacrificing peak experience.

Multi-Modal Routing & Fallback Chains: Routing isn't limited to LLMs; route across modalities too. If the user's query is simple text, use a text LLM. If they upload an image with text, route to a vision model. If they ask for code generation, route to a code-tuned model. Implement fallback chains: if Haiku fails (e.g., timeout), escalate to Sonnet. Log all fallback instances; high fallback rates indicate your thresholds are too aggressive. Balance cost against reliability: a slightly more expensive model with higher success rate is preferable to cheaper model with fallbacks.

Implement A/B testing frameworks: route 1% of queries to different model combinations and measure latency/cost/quality. Data from A/B tests informs routing policy improvements. For long-term trends, use contextual bandits (multi-armed bandit algorithms) to dynamically optimize routing probabilities in real-time based on feedback, without requiring manual policy updates.

Monitoring and observability are essential for production systems. Set up comprehensive logging at every layer: API requests, model predictions, database queries, cache hits/misses. Use structured logging (JSON) to enable filtering and aggregation across thousands of servers. For production deployments, track not just errors but also latency percentiles (p50, p95, p99); if p99 latency suddenly doubles, something is wrong even if error rates are normal. Set up alerting based on SLO violations: if a service is supposed to have 99.9% availability and it drops to 99.5%, alert immediately. Use distributed tracing (Jaeger, Lightstep) to track requests across multiple services; a slow end-to-end latency might be hidden in one deep service call, invisible in aggregate metrics.

For long-running ML jobs (training, batch inference), implement checkpoint recovery and graceful degradation. If a training job crashes after 2 weeks, you want to resume from the last checkpoint, not restart from scratch. Implement job orchestration with Kubernetes or Airflow to handle retries, resource allocation, and dependency management. Use feature flags for safe deployment: deploy new model versions behind a flag that's off by default, gradually roll out to 1% of users, 10%, then 100%, monitoring metrics at each step. If something goes wrong, flip the flag back instantly. This approach reduces risk and enables fast rollback.

Finally, build a culture of incident response and post-mortems. When something breaks (and it will), document the incident: timeline, root cause, mitigation steps, and preventive measures. Use incidents as learning opportunities; blameless post-mortems focus on systems, not people. Share findings across teams to prevent repeat incidents. A well-documented incident history is an organization's institutional knowledge about system failures and how to avoid them.

The rapid evolution of AI infrastructure requires continuous learning and adaptation. Teams should establish regular tech talks and knowledge-sharing sessions where engineers present lessons learned from production deployments, performance optimization work, and incident postmortems. Create internal wiki pages documenting best practices specific to your organization: how to debug common failure modes, performance tuning guides for your hardware, and checklists for safe deployments. This prevents repeating mistakes and accelerates onboarding of new team members.

Build relationships with vendors and open-source communities. If you encounter bugs in frameworks (PyTorch, JAX), file detailed reports. If you have questions, ask on forums; community members often have encountered similar issues. For mission-critical infrastructure, consider purchasing support contracts with vendors (PyTorch, HuggingFace, cloud providers). Support gives you direct access to engineers who understand your system and can prioritize fixes. This is insurance against production outages caused by third-party software bugs.

Finally, remember that optimization is a journey, not a destination. Today's cutting-edge technique becomes tomorrow's baseline. Allocate 10-15% of engineering time to exploration and experimentation. Some experiments will fail, but successful ones compound into significant efficiency gains. Foster a culture of continuous improvement: measure, analyze, iterate, and share results. The teams that stay ahead are those that invest in understanding their systems deeply and adapting proactively to new technologies and changing demands.

Key Takeaway: Success in GenAI infrastructure depends on mastering fundamentals: understand your hardware constraints, profile your workloads, measure everything, and iterate. The most sophisticated techniques (dynamic batching, mixed precision, distributed training) build on solid foundations of clear thinking and empirical validation. Avoid cargo-cult engineering: if you don't understand why a technique helps your specific use case, it probably won't. Invest time in understanding root causes, not just applying trendy solutions. Over time, this rigor will compound into significant competitive advantage.