Cost-Quality-Speed Triangle

The Iron Triangle
Model Tiering Strategy
Prompt Efficiency
LLM Cascade Pattern
Caching Strategies
Evaluation-Driven Optimization
Cost Estimation Framework

SECTION 01

The Iron Triangle

Like the project management iron triangle (scope, time, cost), LLM systems face an impossible trade-off: cost, quality, and speed are in fundamental tension. Optimize all three simultaneously—you can't.

The Three Dimensions

Cost: $ per request. GPT-4 ($0.03/1K in) vs Haiku ($0.80/1M in) is a 40× difference
Quality: Accuracy on your task. Correctness, nuance, reasoning capability
Speed: Latency. GPT-4 takes 2s, local Llama takes 10s, batched inference takes 60s

The Tensions

Expensive = Better Quality: GPT-4 > GPT-4o-mini. But higher cost per request.
Cheap = Slower: Local Llama is free but inference slow. SaaS has better speed but costs money.
Speed = High Cost: Real-time streaming requires API calls (expensive). Batch processing is cheaper but slow.
Quality = Speed Sacrifice: Complex reasoning takes longer. Simple heuristics are fast but low quality.

Visual: The Triangle

Imagine a triangle with Cost, Quality, Speed at corners. You can optimize for two, not all three:

Quality + Speed: Use GPT-4 in real-time. Expensive.
Quality + Cost: Use local fine-tuned models. Slow inference.
Speed + Cost: Use cheap small models. Low quality.

Core Strategy: You don't optimize one—you optimize the whole system. Use the right model for each request type. Route cheap tasks to small models, complex tasks to GPT-4, batch analysis to local inference.

SECTION 02

Model Tiering Strategy

Instead of one model for everything, use a three-tier system: frontier (best), mid-tier (balanced), cheap (volume).

Tier 1: Frontier Models (Complex Tasks)

Examples: GPT-4o ($0.03/1K input), Claude 3.5 Sonnet ($3/1M input)
Cost: High ($0.01–0.10 per request depending on input size)
Quality: Excellent reasoning, nuance, code, long context
Use For: High-value tasks where errors are costly. Code generation, complex analysis, reasoning

Tier 2: Mid-Tier Models (Balanced)

Examples: GPT-4o-mini ($0.00015/1K input), Claude 3 Haiku ($0.80/1M input)
Cost: 10–100× cheaper than frontier
Quality: Good for routine tasks, classification, moderate-complexity reasoning
Use For: Volume tasks where cost matters. Customer support, content generation, summarization

Tier 3: Cheap Models (Volume)

Examples: Llama 2/3 local ($0, time cost only), Mixtral MoE ($0.27/1M)
Cost: Essentially free (local) or pennies
Quality: Adequate for simple tasks, low reasoning
Use For: High-volume, low-cost operations. Spam filtering, keyword extraction, simple classification

Routing Logic

def choose_model(task_complexity: str) -> str: """Route to appropriate model tier.""" if task_complexity == "complex": # Code generation, deep reasoning return "gpt-4o" elif task_complexity == "medium": # Summarization, categorization return "gpt-4o-mini" else: # Spam filter, simple classification return "local-llama-3" # Example: Email classification task = "classify_email_spam" model = choose_model("low") # → llama-3 response = call_model(model, email_text)

Cost Breakdown Example

1M API requests/month distributed by tier:

5% (50K) complex → GPT-4o @ $0.03/1K = $1,500
35% (350K) medium → mini @ $0.00015/1K = $52.50
60% (600K) simple → local (free) = $0
Total: ~$1,550/month (vs $30,000+ if all GPT-4o)

Best Practice: Start with all requests on mid-tier (good cost/quality balance). Monitor error rates. Escalate failures to frontier, demote success to cheap. Let data guide routing.

SECTION 03

Prompt Efficiency

Every token costs money. Optimize prompts to reduce tokens without losing quality.

Token Reduction Techniques

Remove filler: "Please provide a comprehensive analysis" → "Analyze"
Use examples sparingly: 1 example >> 5 examples. Include only if critical.
Compress context: Summarize documents before inserting. RAG: retrieve only relevant chunks.
Template prompts: Pre-compile system message into single token-efficient block.
Structured input: JSON or YAML is more token-efficient than prose descriptions.

Example: Before vs After

# BEFORE (inefficient): 847 tokens prompt = """ Please provide a detailed and comprehensive analysis of the following customer feedback. You should carefully consider all aspects, including but not limited to sentiment, key themes, actionable insights, and recommendations for improvement. Please be thorough and detailed in your response. Customer feedback: {feedback} """ # AFTER (optimized): 285 tokens (66% reduction!) prompt = """Analyze customer feedback: {feedback} Output: sentiment, themes, insights, recommendations"""

RAG Token Optimization

In RAG systems, context is expensive. Optimize retrieval:

# Bad: Retrieve full 10 documents, insert all docs = retrieve_documents(query, top_k=10) prompt = f"Context:\n{docs}\n\nQuestion: {query}" # Cost: ~1K tokens just for context # Good: Retrieve only most relevant, compress docs = retrieve_documents(query, top_k=3) summaries = [compress_doc(doc, 100) for doc in docs] prompt = f"Context:\n{summaries}\n\nQuestion: {query}" # Cost: ~300 tokens for context

Cost Calculation per Task

def estimate_cost(prompt_tokens: int, completion_tokens: int, model: str = "gpt-4o") -> float: """Estimate cost of a single API call.""" if model == "gpt-4o": input_cost = prompt_tokens * (0.03 / 1000) output_cost = completion_tokens * (0.06 / 1000) elif model == "gpt-4o-mini": input_cost = prompt_tokens * (0.00015 / 1000) output_cost = completion_tokens * (0.0006 / 1000) return input_cost + output_cost # Example: 500-token prompt + 200-token response cost_gpt4 = estimate_cost(500, 200, "gpt-4o") # $0.0195 cost_mini = estimate_cost(500, 200, "gpt-4o-mini") # $0.00015 # Using mini saves 99% on this call!

ROI of Optimization: Spend 1 hour optimizing a prompt, save $1/day in API costs. Over a year, that's $365. Worth it for any production system.

SECTION 04

LLM Cascade Pattern

The most powerful pattern: try cheap model first, escalate to expensive on failure. Combines cost, quality, and speed.

The Logic

Request comes in
Try cheap model (fast, cheap)
Check confidence/quality
If good enough → return (saved money!)
If not → try expensive model
Return best result

Implementation with Confidence Scoring

def classify_with_cascade(text: str) -> str: """Cascade: cheap first, expensive fallback.""" # Tier 1: Fast cheap model response_cheap = call_model( model="gpt-4o-mini", prompt=f"Classify: {text}" ) classification = parse_json(response_cheap) confidence = classification.get("confidence", 0) # Check confidence if confidence >= 0.9: # High confidence, use result return classification["label"] # Tier 2: Expensive accurate model response_expensive = call_model( model="gpt-4o", prompt=f"Carefully classify: {text}" ) classification = parse_json(response_expensive) return classification["label"]

Cost-Benefit Analysis

Assume 1M classification requests/month:

No cascade: All on gpt-4o = $0.0015/req = $1,500/month
With cascade: 80% on mini ($0.00015) + 20% on gpt-4o = 0.00051/req = $510/month
Savings: $990/month (66% reduction!) with no quality loss

Cascading for Different Task Types

def route_to_model(task_type: str, confidence_threshold: float): """Cascade strategy by task.""" if task_type == "spam_detection": # Cheap classifier is 95% accurate—use cascade return {"primary": "cheap", "threshold": 0.85} elif task_type == "legal_analysis": # High stakes—always use expensive return {"primary": "expensive", "threshold": None} elif task_type == "summarization": # Quality matters but cheap is OK—cascade return {"primary": "cheap", "threshold": 0.7}

The Cascade Mindset: Don't assume expensive = always better. Measure. For 80% of requests, the cheap model works fine. Only pay for expensive intelligence when needed.

SECTION 05

Caching Strategies

Avoid API calls entirely with caching. The cheapest request is the one you don't make.

Exact-Match Cache

Cache identical prompts:

import hashlib cache = {} # Or Redis for prod def cached_completion(prompt: str, model: str) -> str: # Hash prompt for cache key key = f"{model}:{hashlib.md5(prompt.encode()).hexdigest()}" # Check cache first if key in cache: print("Cache hit!") return cache[key] # Cache miss, call API response = client.chat.completions.create( model=model, messages=[{"role": "user", "content": prompt}] ) result = response.choices[0].message.content cache[key] = result return result # Usage response = cached_completion("What is photosynthesis?", "gpt-4o") response = cached_completion("What is photosynthesis?", "gpt-4o") # 2nd call hits cache, no API charge

Prompt Prefix Caching (OpenAI/Anthropic Native)

Cache common prompt prefixes (like system message):

system_prompt = """You are an expert assistant. Answer questions accurately and concisely.""" response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": system_prompt, "cache_control": {"type": "ephemeral"}}, {"role": "user", "content": "What is AI?"} ] ) # 2nd request with same system prefix hits cache response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": system_prompt, "cache_control": {"type": "ephemeral"}}, {"role": "user", "content": "What is ML?"} ] ) # system_prompt cached, only user query charged

Semantic Caching (Vector-Based)

Cache responses to semantically similar prompts:

from sentence_transformers import SentenceTransformer model = SentenceTransformer("all-MiniLM-L6-v2") cache = {} # {embedding_id: response} def semantic_cached_completion(prompt: str) -> str: # Embed prompt embedding = model.encode(prompt) # Find similar cached prompts for cached_embedding, cached_response in cache.items(): similarity = cosine_similarity(embedding, cached_embedding) if similarity > 0.95: print("Semantic cache hit!") return cached_response # No match, call API response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}] ) result = response.choices[0].message.content cache[embedding] = result return result # Usage response1 = semantic_cached_completion("What is photosynthesis?") response2 = semantic_cached_completion("Explain photosynthesis") # 2nd query is 95%+ similar—hits semantic cache

Cache Strategy: Use exact-match for identical requests. Use semantic for similar questions. Use prompt prefix caching for system messages. Combined, reduce API calls by 60–80%.

SECTION 06

Evaluation-Driven Optimization

Measure quality before/after model changes. Don't degrade quality blindly to save cost.

Build a Benchmark

# Example: Customer support classification (in/out of scope) benchmark = [ {"text": "How do I reset my password?", "expected": "in-scope"}, {"text": "Can you tell me the meaning of life?", "expected": "out-of-scope"}, {"text": "My order hasn't arrived", "expected": "in-scope"}, # ... 100+ examples ] def evaluate_model(model_name: str, benchmark: list) -> dict: """Test model on benchmark, measure accuracy.""" correct = 0 for item in benchmark: response = call_model(model_name, item["text"]) classification = parse_json(response) if classification["label"] == item["expected"]: correct += 1 accuracy = correct / len(benchmark) return {"model": model_name, "accuracy": accuracy} # Compare models eval_gpt4 = evaluate_model("gpt-4o", benchmark) eval_mini = evaluate_model("gpt-4o-mini", benchmark) print(f"GPT-4: {eval_gpt4['accuracy']:.1%}") print(f"Mini: {eval_mini['accuracy']:.1%}") # If mini is 98% accurate (vs GPT-4's 99%), use mini + save money

Regression Testing

Before switching models in production, run A/B tests:

# Deploy new cheap model to 10% of traffic def get_model_for_request(user_id: str) -> str: if hash(user_id) % 10 < 1: # 10% of users return "gpt-4o-mini" # Cheaper model else: return "gpt-4o" # Current baseline # Monitor: is error rate of 10% cohort > baseline? # After 1 week: mini users have 2% error rate (vs 1.5% baseline) # 0.5% degradation acceptable? Use everywhere and save 66%

Measurement Drives Decisions: Never optimize blindly. Measure quality first. Only switch models if you have data showing acceptable quality trade-off.

SECTION 07

Cost Estimation Framework

Calculate total cost of ownership for your LLM pipeline.

Python Cost Calculator

class CostCalculator: PRICING = { "gpt-4o": {"input": 0.03 / 1000, "output": 0.06 / 1000}, "gpt-4o-mini": {"input": 0.00015 / 1000, "output": 0.0006 / 1000}, "claude-3.5-sonnet": {"input": 3 / 1_000_000, "output": 15 / 1_000_000}, "local-llama": {"input": 0, "output": 0}, } def estimate_per_request(self, prompt_tokens: int, completion_tokens: int, model: str) -> float: """Cost of single request.""" pricing = self.PRICING[model] return (prompt_tokens * pricing["input"] + completion_tokens * pricing["output"]) def monthly_cost(self, requests_per_month: int, model: str, avg_prompt: int = 500, avg_completion: int = 200) -> float: """Estimated monthly cost.""" cost_per_req = self.estimate_per_request(avg_prompt, avg_completion, model) return requests_per_month * cost_per_req def cascade_cost(self, requests: int, cheap_model: str, expensive_model: str, cheap_success_rate: float = 0.8): """Cost of cascade strategy.""" cheap_cost = self.monthly_cost(requests * cheap_success_rate, cheap_model) expensive_cost = self.monthly_cost(requests * (1 - cheap_success_rate), expensive_model) return cheap_cost + expensive_cost # Usage calc = CostCalculator() single_model = calc.monthly_cost(1_000_000, "gpt-4o") cascade = calc.cascade_cost(1_000_000, "gpt-4o-mini", "gpt-4o", cheap_success_rate=0.8) print(f"Single GPT-4: ${single_model:,.0f}/month") print(f"Cascade: ${cascade:,.0f}/month") print(f"Savings: {(1 - cascade/single_model)*100:.0f}%")

Monitoring Dashboard

Track cost over time:

def log_api_call(model: str, prompt_tokens: int, completion_tokens: int, cost: float): """Log every API call.""" db.insert("api_calls", { "timestamp": datetime.now(), "model": model, "prompt_tokens": prompt_tokens, "completion_tokens": completion_tokens, "cost": cost }) # Daily report daily_cost = db.query(""" SELECT SUM(cost) as total, model, COUNT(*) as calls FROM api_calls WHERE DATE(timestamp) = CURDATE() GROUP BY model """) print("Daily Cost Report:") for row in daily_cost: print(f"{row['model']}: ${row['total']:.2f} ({row['calls']} calls)")

Forecast & Budget: Use historical cost data to forecast monthly spend. Set budget alerts. If spend > projected 20%, investigate what changed (more traffic? worse model success rate?).

Task Type	Recommended Tier	Quality at Lower Tier	Primary Failure Mode
Simple classification / routing	Nano / Haiku	95%+ with good prompts	Edge-case misclassification
Structured extraction (JSON)	Haiku / Flash	90%+ with schema hints	Schema non-compliance on complex fields
RAG answer synthesis	Sonnet / Pro	80% — noticeable quality drop	Hallucination on gaps in context
Multi-step reasoning / math	Sonnet / Pro	60–70% — significant drop	Logical errors in reasoning chain
Long-document summarisation	Haiku (with map-reduce)	85%+ if chunked well	Losing key details in reduce step
Code generation / review	Sonnet / Pro	70% for simple code only	Subtle logic errors, wrong APIs

Cost–Quality–Speed Triangle

Table of Contents

The Iron Triangle

Model Tiering Strategy

Prompt Efficiency

LLM Cascade Pattern

Caching Strategies

Evaluation-Driven Optimization

Cost Estimation Framework

Model Selection Matrix

Cost–Quality–Speed Triangle

Table of Contents

The Iron Triangle

Model Tiering Strategy

Prompt Efficiency

LLM Cascade Pattern

Caching Strategies

Evaluation-Driven Optimization

Cost Estimation Framework

Model Selection Matrix

Related concepts