SECTION 01
The Iron Triangle
Like the project management iron triangle (scope, time, cost), LLM systems face an impossible trade-off: cost, quality, and speed are in fundamental tension. Optimize all three simultaneouslyâyou can't.
The Three Dimensions
- Cost: $ per request. GPT-4 ($0.03/1K in) vs Haiku ($0.80/1M in) is a 40Ă difference
- Quality: Accuracy on your task. Correctness, nuance, reasoning capability
- Speed: Latency. GPT-4 takes 2s, local Llama takes 10s, batched inference takes 60s
The Tensions
- Expensive = Better Quality: GPT-4 > GPT-4o-mini. But higher cost per request.
- Cheap = Slower: Local Llama is free but inference slow. SaaS has better speed but costs money.
- Speed = High Cost: Real-time streaming requires API calls (expensive). Batch processing is cheaper but slow.
- Quality = Speed Sacrifice: Complex reasoning takes longer. Simple heuristics are fast but low quality.
Visual: The Triangle
Imagine a triangle with Cost, Quality, Speed at corners. You can optimize for two, not all three:
- Quality + Speed: Use GPT-4 in real-time. Expensive.
- Quality + Cost: Use local fine-tuned models. Slow inference.
- Speed + Cost: Use cheap small models. Low quality.
Core Strategy: You don't optimize oneâyou optimize the whole system. Use the right model for each request type. Route cheap tasks to small models, complex tasks to GPT-4, batch analysis to local inference.
SECTION 02
Model Tiering Strategy
Instead of one model for everything, use a three-tier system: frontier (best), mid-tier (balanced), cheap (volume).
Tier 1: Frontier Models (Complex Tasks)
- Examples: GPT-4o ($0.03/1K input), Claude 3.5 Sonnet ($3/1M input)
- Cost: High ($0.01â0.10 per request depending on input size)
- Quality: Excellent reasoning, nuance, code, long context
- Use For: High-value tasks where errors are costly. Code generation, complex analysis, reasoning
Tier 2: Mid-Tier Models (Balanced)
- Examples: GPT-4o-mini ($0.00015/1K input), Claude 3 Haiku ($0.80/1M input)
- Cost: 10â100Ă cheaper than frontier
- Quality: Good for routine tasks, classification, moderate-complexity reasoning
- Use For: Volume tasks where cost matters. Customer support, content generation, summarization
Tier 3: Cheap Models (Volume)
- Examples: Llama 2/3 local ($0, time cost only), Mixtral MoE ($0.27/1M)
- Cost: Essentially free (local) or pennies
- Quality: Adequate for simple tasks, low reasoning
- Use For: High-volume, low-cost operations. Spam filtering, keyword extraction, simple classification
Routing Logic
def choose_model(task_complexity: str) -> str:
"""Route to appropriate model tier."""
if task_complexity == "complex":
# Code generation, deep reasoning
return "gpt-4o"
elif task_complexity == "medium":
# Summarization, categorization
return "gpt-4o-mini"
else:
# Spam filter, simple classification
return "local-llama-3"
# Example: Email classification
task = "classify_email_spam"
model = choose_model("low") # â llama-3
response = call_model(model, email_text)
Cost Breakdown Example
1M API requests/month distributed by tier:
- 5% (50K) complex â GPT-4o @ $0.03/1K = $1,500
- 35% (350K) medium â mini @ $0.00015/1K = $52.50
- 60% (600K) simple â local (free) = $0
- Total: ~$1,550/month (vs $30,000+ if all GPT-4o)
Best Practice: Start with all requests on mid-tier (good cost/quality balance). Monitor error rates. Escalate failures to frontier, demote success to cheap. Let data guide routing.
SECTION 03
Prompt Efficiency
Every token costs money. Optimize prompts to reduce tokens without losing quality.
Token Reduction Techniques
- Remove filler: "Please provide a comprehensive analysis" â "Analyze"
- Use examples sparingly: 1 example >> 5 examples. Include only if critical.
- Compress context: Summarize documents before inserting. RAG: retrieve only relevant chunks.
- Template prompts: Pre-compile system message into single token-efficient block.
- Structured input: JSON or YAML is more token-efficient than prose descriptions.
Example: Before vs After
# BEFORE (inefficient): 847 tokens
prompt = """
Please provide a detailed and comprehensive analysis of the following
customer feedback. You should carefully consider all aspects, including
but not limited to sentiment, key themes, actionable insights, and
recommendations for improvement. Please be thorough and detailed in
your response.
Customer feedback: {feedback}
"""
# AFTER (optimized): 285 tokens (66% reduction!)
prompt = """Analyze customer feedback:
{feedback}
Output: sentiment, themes, insights, recommendations"""
RAG Token Optimization
In RAG systems, context is expensive. Optimize retrieval:
# Bad: Retrieve full 10 documents, insert all
docs = retrieve_documents(query, top_k=10)
prompt = f"Context:\n{docs}\n\nQuestion: {query}"
# Cost: ~1K tokens just for context
# Good: Retrieve only most relevant, compress
docs = retrieve_documents(query, top_k=3)
summaries = [compress_doc(doc, 100) for doc in docs]
prompt = f"Context:\n{summaries}\n\nQuestion: {query}"
# Cost: ~300 tokens for context
Cost Calculation per Task
def estimate_cost(prompt_tokens: int, completion_tokens: int,
model: str = "gpt-4o") -> float:
"""Estimate cost of a single API call."""
if model == "gpt-4o":
input_cost = prompt_tokens * (0.03 / 1000)
output_cost = completion_tokens * (0.06 / 1000)
elif model == "gpt-4o-mini":
input_cost = prompt_tokens * (0.00015 / 1000)
output_cost = completion_tokens * (0.0006 / 1000)
return input_cost + output_cost
# Example: 500-token prompt + 200-token response
cost_gpt4 = estimate_cost(500, 200, "gpt-4o") # $0.0195
cost_mini = estimate_cost(500, 200, "gpt-4o-mini") # $0.00015
# Using mini saves 99% on this call!
ROI of Optimization: Spend 1 hour optimizing a prompt, save $1/day in API costs. Over a year, that's $365. Worth it for any production system.
SECTION 04
LLM Cascade Pattern
The most powerful pattern: try cheap model first, escalate to expensive on failure. Combines cost, quality, and speed.
The Logic
- Request comes in
- Try cheap model (fast, cheap)
- Check confidence/quality
- If good enough â return (saved money!)
- If not â try expensive model
- Return best result
Implementation with Confidence Scoring
def classify_with_cascade(text: str) -> str:
"""Cascade: cheap first, expensive fallback."""
# Tier 1: Fast cheap model
response_cheap = call_model(
model="gpt-4o-mini",
prompt=f"Classify: {text}"
)
classification = parse_json(response_cheap)
confidence = classification.get("confidence", 0)
# Check confidence
if confidence >= 0.9:
# High confidence, use result
return classification["label"]
# Tier 2: Expensive accurate model
response_expensive = call_model(
model="gpt-4o",
prompt=f"Carefully classify: {text}"
)
classification = parse_json(response_expensive)
return classification["label"]
Cost-Benefit Analysis
Assume 1M classification requests/month:
- No cascade: All on gpt-4o = $0.0015/req = $1,500/month
- With cascade: 80% on mini ($0.00015) + 20% on gpt-4o = 0.00051/req = $510/month
- Savings: $990/month (66% reduction!) with no quality loss
Cascading for Different Task Types
def route_to_model(task_type: str, confidence_threshold: float):
"""Cascade strategy by task."""
if task_type == "spam_detection":
# Cheap classifier is 95% accurateâuse cascade
return {"primary": "cheap", "threshold": 0.85}
elif task_type == "legal_analysis":
# High stakesâalways use expensive
return {"primary": "expensive", "threshold": None}
elif task_type == "summarization":
# Quality matters but cheap is OKâcascade
return {"primary": "cheap", "threshold": 0.7}
The Cascade Mindset: Don't assume expensive = always better. Measure. For 80% of requests, the cheap model works fine. Only pay for expensive intelligence when needed.
SECTION 05
Caching Strategies
Avoid API calls entirely with caching. The cheapest request is the one you don't make.
Exact-Match Cache
Cache identical prompts:
import hashlib
cache = {} # Or Redis for prod
def cached_completion(prompt: str, model: str) -> str:
# Hash prompt for cache key
key = f"{model}:{hashlib.md5(prompt.encode()).hexdigest()}"
# Check cache first
if key in cache:
print("Cache hit!")
return cache[key]
# Cache miss, call API
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
result = response.choices[0].message.content
cache[key] = result
return result
# Usage
response = cached_completion("What is photosynthesis?", "gpt-4o")
response = cached_completion("What is photosynthesis?", "gpt-4o")
# 2nd call hits cache, no API charge
Prompt Prefix Caching (OpenAI/Anthropic Native)
Cache common prompt prefixes (like system message):
system_prompt = """You are an expert assistant.
Answer questions accurately and concisely."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt, "cache_control": {"type": "ephemeral"}},
{"role": "user", "content": "What is AI?"}
]
)
# 2nd request with same system prefix hits cache
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt, "cache_control": {"type": "ephemeral"}},
{"role": "user", "content": "What is ML?"}
]
)
# system_prompt cached, only user query charged
Semantic Caching (Vector-Based)
Cache responses to semantically similar prompts:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
cache = {} # {embedding_id: response}
def semantic_cached_completion(prompt: str) -> str:
# Embed prompt
embedding = model.encode(prompt)
# Find similar cached prompts
for cached_embedding, cached_response in cache.items():
similarity = cosine_similarity(embedding, cached_embedding)
if similarity > 0.95:
print("Semantic cache hit!")
return cached_response
# No match, call API
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
result = response.choices[0].message.content
cache[embedding] = result
return result
# Usage
response1 = semantic_cached_completion("What is photosynthesis?")
response2 = semantic_cached_completion("Explain photosynthesis")
# 2nd query is 95%+ similarâhits semantic cache
Cache Strategy: Use exact-match for identical requests. Use semantic for similar questions. Use prompt prefix caching for system messages. Combined, reduce API calls by 60â80%.
SECTION 06
Evaluation-Driven Optimization
Measure quality before/after model changes. Don't degrade quality blindly to save cost.
Build a Benchmark
# Example: Customer support classification (in/out of scope)
benchmark = [
{"text": "How do I reset my password?", "expected": "in-scope"},
{"text": "Can you tell me the meaning of life?", "expected": "out-of-scope"},
{"text": "My order hasn't arrived", "expected": "in-scope"},
# ... 100+ examples
]
def evaluate_model(model_name: str, benchmark: list) -> dict:
"""Test model on benchmark, measure accuracy."""
correct = 0
for item in benchmark:
response = call_model(model_name, item["text"])
classification = parse_json(response)
if classification["label"] == item["expected"]:
correct += 1
accuracy = correct / len(benchmark)
return {"model": model_name, "accuracy": accuracy}
# Compare models
eval_gpt4 = evaluate_model("gpt-4o", benchmark)
eval_mini = evaluate_model("gpt-4o-mini", benchmark)
print(f"GPT-4: {eval_gpt4['accuracy']:.1%}")
print(f"Mini: {eval_mini['accuracy']:.1%}")
# If mini is 98% accurate (vs GPT-4's 99%), use mini + save money
Regression Testing
Before switching models in production, run A/B tests:
# Deploy new cheap model to 10% of traffic
def get_model_for_request(user_id: str) -> str:
if hash(user_id) % 10 < 1: # 10% of users
return "gpt-4o-mini" # Cheaper model
else:
return "gpt-4o" # Current baseline
# Monitor: is error rate of 10% cohort > baseline?
# After 1 week: mini users have 2% error rate (vs 1.5% baseline)
# 0.5% degradation acceptable? Use everywhere and save 66%
Measurement Drives Decisions: Never optimize blindly. Measure quality first. Only switch models if you have data showing acceptable quality trade-off.
SECTION 07
Cost Estimation Framework
Calculate total cost of ownership for your LLM pipeline.
Python Cost Calculator
class CostCalculator:
PRICING = {
"gpt-4o": {"input": 0.03 / 1000, "output": 0.06 / 1000},
"gpt-4o-mini": {"input": 0.00015 / 1000, "output": 0.0006 / 1000},
"claude-3.5-sonnet": {"input": 3 / 1_000_000, "output": 15 / 1_000_000},
"local-llama": {"input": 0, "output": 0},
}
def estimate_per_request(self, prompt_tokens: int,
completion_tokens: int, model: str) -> float:
"""Cost of single request."""
pricing = self.PRICING[model]
return (prompt_tokens * pricing["input"] +
completion_tokens * pricing["output"])
def monthly_cost(self, requests_per_month: int,
model: str, avg_prompt: int = 500,
avg_completion: int = 200) -> float:
"""Estimated monthly cost."""
cost_per_req = self.estimate_per_request(avg_prompt, avg_completion, model)
return requests_per_month * cost_per_req
def cascade_cost(self, requests: int, cheap_model: str,
expensive_model: str, cheap_success_rate: float = 0.8):
"""Cost of cascade strategy."""
cheap_cost = self.monthly_cost(requests * cheap_success_rate, cheap_model)
expensive_cost = self.monthly_cost(requests * (1 - cheap_success_rate), expensive_model)
return cheap_cost + expensive_cost
# Usage
calc = CostCalculator()
single_model = calc.monthly_cost(1_000_000, "gpt-4o")
cascade = calc.cascade_cost(1_000_000, "gpt-4o-mini", "gpt-4o", cheap_success_rate=0.8)
print(f"Single GPT-4: ${single_model:,.0f}/month")
print(f"Cascade: ${cascade:,.0f}/month")
print(f"Savings: {(1 - cascade/single_model)*100:.0f}%")
Monitoring Dashboard
Track cost over time:
def log_api_call(model: str, prompt_tokens: int,
completion_tokens: int, cost: float):
"""Log every API call."""
db.insert("api_calls", {
"timestamp": datetime.now(),
"model": model,
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"cost": cost
})
# Daily report
daily_cost = db.query("""
SELECT SUM(cost) as total, model, COUNT(*) as calls
FROM api_calls
WHERE DATE(timestamp) = CURDATE()
GROUP BY model
""")
print("Daily Cost Report:")
for row in daily_cost:
print(f"{row['model']}: ${row['total']:.2f} ({row['calls']} calls)")
Forecast & Budget: Use historical cost data to forecast monthly spend. Set budget alerts. If spend > projected 20%, investigate what changed (more traffic? worse model success rate?).