Every AI system exists at the intersection of three competing constraints. Optimize for two; the third suffers. Know your tradeoffs explicitly.
Optimize For
Sacrifice
Technology Choice
Example Use Case
Quality + Cost
Latency (batch)
Batch API, overnight processing
Daily email summaries, report generation
Quality + Latency
Cost (expensive model)
GPT-4o with streaming
Real-time customer support, code review
Cost + Latency
Quality (degraded)
Smaller models, quantization
Classification, fast filtering
# Decision framework: cost-quality-latency
class SystemOptimizer:
def __init__(self, constraints):
self.max_cost_per_request = constraints["max_cost"]
self.max_latency_ms = constraints["max_latency"]
self.min_quality = constraints["min_quality"]
def choose_architecture(self):
# If latency is flexible, use batch API
if self.max_latency_ms > 3600000: # > 1 hour
return {
"model": "gpt-4o",
"mode": "batch",
"cost_per_token": 0.000003,
"latency": "12-24 hours"
}
# If cost is flexible, maximize quality
if self.max_cost_per_request > 0.10:
return {
"model": "gpt-4o",
"mode": "streaming",
"cost_per_token": 0.000015,
"latency": "2-5 seconds"
}
# Cost-constrained: use smaller model
return {
"model": "gpt-4o-mini",
"mode": "streaming",
"cost_per_token": 0.00000015,
"latency": "1-2 seconds"
}
02 — Knowledge Integration
RAG vs Fine-tuning Decision Matrix
Two approaches for incorporating domain knowledge. RAG is faster to implement and cheaper. Fine-tuning gives better quality but requires preparation and ongoing maintenance.
Dimension
RAG
Fine-tuning
Time to production
Weeks (build retrieval system)
Months (collect, label, train)
Latency impact
+500ms (vector search)
None (model behavior change)
Cost per inference
Baseline + retrieval
Baseline (potentially cheaper with smaller model)
Knowledge currency
Real-time (update documents)
Static (retrain for updates)
Quality ceiling
Limited by retrieval accuracy
Limited by training data quality
Best for
Rapidly changing domain, large knowledge base
Stable domain, specialized reasoning
Hybrid Approach: RAG + Fine-tuning
Use RAG to inject recent knowledge. Use fine-tuning for reasoning patterns. Combine both for optimal quality.
# Hybrid RAG + fine-tuned model
class HybridKnowledgeSystem:
def __init__(self, base_model, retriever, finetuned_model):
self.retriever = retriever
self.finetuned_model = finetuned_model
def answer(self, question):
# 1. Retrieve relevant context
docs = self.retriever.search(question, top_k=5)
context = "\n".join([d["text"] for d in docs])
# 2. Fine-tuned model has learned reasoning patterns
# Prompt format from fine-tuning
prompt = f"""Question: {question}
Context: {context}
Answer:"""
response = self.finetuned_model.generate(prompt)
return {
"answer": response,
"sources": [d["source"] for d in docs],
"confidence": self.estimate_confidence(response, context)
}
03 — Selection
Model Selection Framework
Evaluate models on capability, cost, speed, and fit for your specific task. Don't default to the largest model. Benchmark on your actual data.
Building custom models is expensive. Buying API access is simple. Choose based on scale, differentiation, and control requirements.
Factor
Build
Buy
Initial investment
High ($100K-1M)
Low ($5K-50K)
Time to production
6-18 months
Weeks
Operational cost at scale
Low (amortized)
High (per-request)
Control and IP
Complete ownership
Vendor dependent
Differentiation value
High (unique capability)
Commodity (everyone has access)
Break-Even Analysis
# Build vs buy cost comparison
def calculate_breakeven(api_cost_per_token, inference_volume_per_year, training_cost, serving_cost_per_year):
api_annual_cost = api_cost_per_token * inference_volume_per_year
custom_annual_cost = training_cost / 5 + serving_cost_per_year # Amortize training over 5 years
# Break-even when custom catches up
years_to_breakeven = training_cost / (api_annual_cost - serving_cost_per_year)
return {
"api_annual_cost": api_annual_cost,
"custom_annual_cost": custom_annual_cost,
"years_to_breakeven": years_to_breakeven,
"recommendation": "build" if years_to_breakeven < 3 else "buy"
}
# Example: billion inferences per year
breakeven = calculate_breakeven(
api_cost_per_token=0.000005,
inference_volume_per_year=500_000_000_000, # 500B tokens
training_cost=500_000,
serving_cost_per_year=100_000
)
05 — Patterns
Agent Use Cases and Patterns
Agents excel when tasks require planning, tool use, and iterative refinement. But not all AI tasks need agents. Start simple; add agentic patterns only when needed.
When to Use Agents
Good fit: Multi-step workflows (research, code generation), dynamic planning (unknown steps), tool use (API calls, file operations), adaptive reasoning (learn from mistakes).
# Simple agentic loop with ReAct pattern
class SimpleAgent:
def __init__(self, model_client, tools):
self.client = model_client
self.tools = {t.name: t for t in tools}
self.max_iterations = 10
async def run(self, task):
messages = [{"role": "user", "content": task}]
for _ in range(self.max_iterations):
# Think: LLM decides next action
response = await self.client.chat.completions.create(
model="gpt-4o",
messages=messages
)
# Check if done
if "done" in response.choices[0].message.content.lower():
return response.choices[0].message.content
# Act: execute tool
tool_call = self.parse_tool_call(response)
result = await self.tools[tool_call["name"]].execute(
**tool_call["args"]
)
# Observe: add to context
messages.append({"role": "assistant", "content": response.choices[0].message.content})
messages.append({"role": "user", "content": f"Tool result: {result}"})
return "Max iterations reached"
06 — Architecture
Architecture Patterns and Scaling
Three common patterns: monolithic (single model), modular (specialized models), and hierarchical (routing). Choose based on complexity, latency, and operational constraints.
Monolithic Pattern
Single large model handles all tasks. Simple to operate. Works well for general tasks. Limited scaling flexibility.
Modular Pattern
Specialized models for different tasks. Route based on input classification. Better cost control. Higher operational complexity.
# Modular architecture with routing
class ModularPipeline:
def __init__(self):
self.classifier = load_classifier()
self.models = {
"summarization": GPTClient(model="gpt-4o-mini"),
"coding": GPTClient(model="gpt-4o"),
"translation": GPTClient(model="gpt-4o-mini"),
"reasoning": GPTClient(model="gpt-4o")
}
async def process(self, request):
# Classify task
task_type = await self.classifier.predict(request["content"])
# Route to appropriate model
model = self.models.get(task_type, self.models["reasoning"])
response = await model.generate(request["content"])
return response
07 — Validation
Evaluation-Driven Design
Start with metrics and benchmarks. Design systems to maximize the metrics that matter. Measure continuously. Let data guide decisions, not intuition.
Defining Success Metrics
Task accuracy: BLEU, F1, exact match for classification/generation. User satisfaction: thumbs up/down, ratings, NPS. Business impact: conversion rate, retention, revenue. Operational: latency, cost, error rate.
# Evaluation framework
class EvaluationFramework:
def __init__(self, baseline_metrics):
self.baseline = baseline_metrics
self.experiments = []
def run_experiment(self, variant_name, variant_fn, test_data):
"""A/B test a change"""
results = []
for sample in test_data:
baseline_output = self.baseline_fn(sample)
variant_output = variant_fn(sample)
results.append({
"baseline": baseline_output,
"variant": variant_output,
"correct_baseline": baseline_output == sample["expected"],
"correct_variant": variant_output == sample["expected"]
})
improvement = {
"accuracy_lift": sum(r["correct_variant"] for r in results) / len(results) - self.baseline["accuracy"],
"latency": self.measure_latency(variant_fn, test_data),
"cost": self.measure_cost(variant_fn, test_data)
}
self.experiments.append({
"variant": variant_name,
"improvement": improvement,
"results": results
})
return improvement
Tools
Decision Support Stack
Tools for model evaluation, benchmarking, and architectural decisions.
HELM Benchmark
Comprehensive LLM evaluation across diverse tasks. Compare models on accuracy, efficiency, and bias metrics.
LangSmith
Debug and monitor LLM applications. Evaluate prompt changes before deployment. Track costs per feature.
DeepEval
Evaluation framework for LLM outputs. Metrics for hallucination, relevance, faithfulness, and bias.
Weights & Biases
Experiment tracking with decision logging. Compare model versions and architectures systematically.
MLflow
Model registry and versioning. Track experiments, parameters, and metrics. Compare variants.
LiteLLM
Multi-provider LLM router. A/B test models. Track costs and performance across providers.