Production & Infra

Cost–Quality–Speed Triangle

The fundamental tension between LLM performance, output quality, and cost. Optimize one, trade off others. Strategy over brute force.

3
Tradeoff Dimensions
Cascade
Key Pattern
10–100×
Cost Variance Between Models

Table of Contents

SECTION 01

The Iron Triangle

Like the project management iron triangle (scope, time, cost), LLM systems face an impossible trade-off: cost, quality, and speed are in fundamental tension. Optimize all three simultaneously—you can't.

The Three Dimensions

The Tensions

Visual: The Triangle

Imagine a triangle with Cost, Quality, Speed at corners. You can optimize for two, not all three:

Core Strategy: You don't optimize one—you optimize the whole system. Use the right model for each request type. Route cheap tasks to small models, complex tasks to GPT-4, batch analysis to local inference.
SECTION 02

Model Tiering Strategy

Instead of one model for everything, use a three-tier system: frontier (best), mid-tier (balanced), cheap (volume).

Tier 1: Frontier Models (Complex Tasks)

Tier 2: Mid-Tier Models (Balanced)

Tier 3: Cheap Models (Volume)

Routing Logic

def choose_model(task_complexity: str) -> str: """Route to appropriate model tier.""" if task_complexity == "complex": # Code generation, deep reasoning return "gpt-4o" elif task_complexity == "medium": # Summarization, categorization return "gpt-4o-mini" else: # Spam filter, simple classification return "local-llama-3" # Example: Email classification task = "classify_email_spam" model = choose_model("low") # → llama-3 response = call_model(model, email_text)

Cost Breakdown Example

1M API requests/month distributed by tier:

Best Practice: Start with all requests on mid-tier (good cost/quality balance). Monitor error rates. Escalate failures to frontier, demote success to cheap. Let data guide routing.
SECTION 03

Prompt Efficiency

Every token costs money. Optimize prompts to reduce tokens without losing quality.

Token Reduction Techniques

Example: Before vs After

# BEFORE (inefficient): 847 tokens prompt = """ Please provide a detailed and comprehensive analysis of the following customer feedback. You should carefully consider all aspects, including but not limited to sentiment, key themes, actionable insights, and recommendations for improvement. Please be thorough and detailed in your response. Customer feedback: {feedback} """ # AFTER (optimized): 285 tokens (66% reduction!) prompt = """Analyze customer feedback: {feedback} Output: sentiment, themes, insights, recommendations"""

RAG Token Optimization

In RAG systems, context is expensive. Optimize retrieval:

# Bad: Retrieve full 10 documents, insert all docs = retrieve_documents(query, top_k=10) prompt = f"Context:\n{docs}\n\nQuestion: {query}" # Cost: ~1K tokens just for context # Good: Retrieve only most relevant, compress docs = retrieve_documents(query, top_k=3) summaries = [compress_doc(doc, 100) for doc in docs] prompt = f"Context:\n{summaries}\n\nQuestion: {query}" # Cost: ~300 tokens for context

Cost Calculation per Task

def estimate_cost(prompt_tokens: int, completion_tokens: int, model: str = "gpt-4o") -> float: """Estimate cost of a single API call.""" if model == "gpt-4o": input_cost = prompt_tokens * (0.03 / 1000) output_cost = completion_tokens * (0.06 / 1000) elif model == "gpt-4o-mini": input_cost = prompt_tokens * (0.00015 / 1000) output_cost = completion_tokens * (0.0006 / 1000) return input_cost + output_cost # Example: 500-token prompt + 200-token response cost_gpt4 = estimate_cost(500, 200, "gpt-4o") # $0.0195 cost_mini = estimate_cost(500, 200, "gpt-4o-mini") # $0.00015 # Using mini saves 99% on this call!
ROI of Optimization: Spend 1 hour optimizing a prompt, save $1/day in API costs. Over a year, that's $365. Worth it for any production system.
SECTION 04

LLM Cascade Pattern

The most powerful pattern: try cheap model first, escalate to expensive on failure. Combines cost, quality, and speed.

The Logic

  1. Request comes in
  2. Try cheap model (fast, cheap)
  3. Check confidence/quality
  4. If good enough → return (saved money!)
  5. If not → try expensive model
  6. Return best result

Implementation with Confidence Scoring

def classify_with_cascade(text: str) -> str: """Cascade: cheap first, expensive fallback.""" # Tier 1: Fast cheap model response_cheap = call_model( model="gpt-4o-mini", prompt=f"Classify: {text}" ) classification = parse_json(response_cheap) confidence = classification.get("confidence", 0) # Check confidence if confidence >= 0.9: # High confidence, use result return classification["label"] # Tier 2: Expensive accurate model response_expensive = call_model( model="gpt-4o", prompt=f"Carefully classify: {text}" ) classification = parse_json(response_expensive) return classification["label"]

Cost-Benefit Analysis

Assume 1M classification requests/month:

Cascading for Different Task Types

def route_to_model(task_type: str, confidence_threshold: float): """Cascade strategy by task.""" if task_type == "spam_detection": # Cheap classifier is 95% accurate—use cascade return {"primary": "cheap", "threshold": 0.85} elif task_type == "legal_analysis": # High stakes—always use expensive return {"primary": "expensive", "threshold": None} elif task_type == "summarization": # Quality matters but cheap is OK—cascade return {"primary": "cheap", "threshold": 0.7}
The Cascade Mindset: Don't assume expensive = always better. Measure. For 80% of requests, the cheap model works fine. Only pay for expensive intelligence when needed.
SECTION 05

Caching Strategies

Avoid API calls entirely with caching. The cheapest request is the one you don't make.

Exact-Match Cache

Cache identical prompts:

import hashlib cache = {} # Or Redis for prod def cached_completion(prompt: str, model: str) -> str: # Hash prompt for cache key key = f"{model}:{hashlib.md5(prompt.encode()).hexdigest()}" # Check cache first if key in cache: print("Cache hit!") return cache[key] # Cache miss, call API response = client.chat.completions.create( model=model, messages=[{"role": "user", "content": prompt}] ) result = response.choices[0].message.content cache[key] = result return result # Usage response = cached_completion("What is photosynthesis?", "gpt-4o") response = cached_completion("What is photosynthesis?", "gpt-4o") # 2nd call hits cache, no API charge

Prompt Prefix Caching (OpenAI/Anthropic Native)

Cache common prompt prefixes (like system message):

system_prompt = """You are an expert assistant. Answer questions accurately and concisely.""" response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": system_prompt, "cache_control": {"type": "ephemeral"}}, {"role": "user", "content": "What is AI?"} ] ) # 2nd request with same system prefix hits cache response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": system_prompt, "cache_control": {"type": "ephemeral"}}, {"role": "user", "content": "What is ML?"} ] ) # system_prompt cached, only user query charged

Semantic Caching (Vector-Based)

Cache responses to semantically similar prompts:

from sentence_transformers import SentenceTransformer model = SentenceTransformer("all-MiniLM-L6-v2") cache = {} # {embedding_id: response} def semantic_cached_completion(prompt: str) -> str: # Embed prompt embedding = model.encode(prompt) # Find similar cached prompts for cached_embedding, cached_response in cache.items(): similarity = cosine_similarity(embedding, cached_embedding) if similarity > 0.95: print("Semantic cache hit!") return cached_response # No match, call API response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}] ) result = response.choices[0].message.content cache[embedding] = result return result # Usage response1 = semantic_cached_completion("What is photosynthesis?") response2 = semantic_cached_completion("Explain photosynthesis") # 2nd query is 95%+ similar—hits semantic cache
Cache Strategy: Use exact-match for identical requests. Use semantic for similar questions. Use prompt prefix caching for system messages. Combined, reduce API calls by 60–80%.
SECTION 06

Evaluation-Driven Optimization

Measure quality before/after model changes. Don't degrade quality blindly to save cost.

Build a Benchmark

# Example: Customer support classification (in/out of scope) benchmark = [ {"text": "How do I reset my password?", "expected": "in-scope"}, {"text": "Can you tell me the meaning of life?", "expected": "out-of-scope"}, {"text": "My order hasn't arrived", "expected": "in-scope"}, # ... 100+ examples ] def evaluate_model(model_name: str, benchmark: list) -> dict: """Test model on benchmark, measure accuracy.""" correct = 0 for item in benchmark: response = call_model(model_name, item["text"]) classification = parse_json(response) if classification["label"] == item["expected"]: correct += 1 accuracy = correct / len(benchmark) return {"model": model_name, "accuracy": accuracy} # Compare models eval_gpt4 = evaluate_model("gpt-4o", benchmark) eval_mini = evaluate_model("gpt-4o-mini", benchmark) print(f"GPT-4: {eval_gpt4['accuracy']:.1%}") print(f"Mini: {eval_mini['accuracy']:.1%}") # If mini is 98% accurate (vs GPT-4's 99%), use mini + save money

Regression Testing

Before switching models in production, run A/B tests:

# Deploy new cheap model to 10% of traffic def get_model_for_request(user_id: str) -> str: if hash(user_id) % 10 < 1: # 10% of users return "gpt-4o-mini" # Cheaper model else: return "gpt-4o" # Current baseline # Monitor: is error rate of 10% cohort > baseline? # After 1 week: mini users have 2% error rate (vs 1.5% baseline) # 0.5% degradation acceptable? Use everywhere and save 66%
Measurement Drives Decisions: Never optimize blindly. Measure quality first. Only switch models if you have data showing acceptable quality trade-off.
SECTION 07

Cost Estimation Framework

Calculate total cost of ownership for your LLM pipeline.

Python Cost Calculator

class CostCalculator: PRICING = { "gpt-4o": {"input": 0.03 / 1000, "output": 0.06 / 1000}, "gpt-4o-mini": {"input": 0.00015 / 1000, "output": 0.0006 / 1000}, "claude-3.5-sonnet": {"input": 3 / 1_000_000, "output": 15 / 1_000_000}, "local-llama": {"input": 0, "output": 0}, } def estimate_per_request(self, prompt_tokens: int, completion_tokens: int, model: str) -> float: """Cost of single request.""" pricing = self.PRICING[model] return (prompt_tokens * pricing["input"] + completion_tokens * pricing["output"]) def monthly_cost(self, requests_per_month: int, model: str, avg_prompt: int = 500, avg_completion: int = 200) -> float: """Estimated monthly cost.""" cost_per_req = self.estimate_per_request(avg_prompt, avg_completion, model) return requests_per_month * cost_per_req def cascade_cost(self, requests: int, cheap_model: str, expensive_model: str, cheap_success_rate: float = 0.8): """Cost of cascade strategy.""" cheap_cost = self.monthly_cost(requests * cheap_success_rate, cheap_model) expensive_cost = self.monthly_cost(requests * (1 - cheap_success_rate), expensive_model) return cheap_cost + expensive_cost # Usage calc = CostCalculator() single_model = calc.monthly_cost(1_000_000, "gpt-4o") cascade = calc.cascade_cost(1_000_000, "gpt-4o-mini", "gpt-4o", cheap_success_rate=0.8) print(f"Single GPT-4: ${single_model:,.0f}/month") print(f"Cascade: ${cascade:,.0f}/month") print(f"Savings: {(1 - cascade/single_model)*100:.0f}%")

Monitoring Dashboard

Track cost over time:

def log_api_call(model: str, prompt_tokens: int, completion_tokens: int, cost: float): """Log every API call.""" db.insert("api_calls", { "timestamp": datetime.now(), "model": model, "prompt_tokens": prompt_tokens, "completion_tokens": completion_tokens, "cost": cost }) # Daily report daily_cost = db.query(""" SELECT SUM(cost) as total, model, COUNT(*) as calls FROM api_calls WHERE DATE(timestamp) = CURDATE() GROUP BY model """) print("Daily Cost Report:") for row in daily_cost: print(f"{row['model']}: ${row['total']:.2f} ({row['calls']} calls)")
Forecast & Budget: Use historical cost data to forecast monthly spend. Set budget alerts. If spend > projected 20%, investigate what changed (more traffic? worse model success rate?).
SECTION 08

Model Selection Matrix

Picking the right model tier for each task type is the highest-leverage cost optimisation. The matrix below maps common GenAI task types to the recommended model tier, expected quality retention when downgrading, and the primary failure mode to watch for.

Task TypeRecommended TierQuality at Lower TierPrimary Failure Mode
Simple classification / routingNano / Haiku95%+ with good promptsEdge-case misclassification
Structured extraction (JSON)Haiku / Flash90%+ with schema hintsSchema non-compliance on complex fields
RAG answer synthesisSonnet / Pro80% — noticeable quality dropHallucination on gaps in context
Multi-step reasoning / mathSonnet / Pro60–70% — significant dropLogical errors in reasoning chain
Long-document summarisationHaiku (with map-reduce)85%+ if chunked wellLosing key details in reduce step
Code generation / reviewSonnet / Pro70% for simple code onlySubtle logic errors, wrong APIs

Use this matrix as a starting point, then validate on your actual data distribution — quality retention percentages vary significantly by domain. For regulated applications where quality failure has real consequences (medical, legal, financial), always benchmark the downgraded model on a representative sample before committing to a tier change.