Decision Frameworks for AI Systems

01 — Tradeoffs

The Cost-Quality-Latency Triangle

Every AI system exists at the intersection of three competing constraints. Optimize for two; the third suffers. Know your tradeoffs explicitly.

Optimize For	Sacrifice	Technology Choice	Example Use Case
Quality + Cost	Latency (batch)	Batch API, overnight processing	Daily email summaries, report generation
Quality + Latency	Cost (expensive model)	GPT-4o with streaming	Real-time customer support, code review
Cost + Latency	Quality (degraded)	Smaller models, quantization	Classification, fast filtering

# Decision framework: cost-quality-latency class SystemOptimizer: def __init__(self, constraints): self.max_cost_per_request = constraints["max_cost"] self.max_latency_ms = constraints["max_latency"] self.min_quality = constraints["min_quality"] def choose_architecture(self): # If latency is flexible, use batch API if self.max_latency_ms > 3600000: # > 1 hour return { "model": "gpt-4o", "mode": "batch", "cost_per_token": 0.000003, "latency": "12-24 hours" } # If cost is flexible, maximize quality if self.max_cost_per_request > 0.10: return { "model": "gpt-4o", "mode": "streaming", "cost_per_token": 0.000015, "latency": "2-5 seconds" } # Cost-constrained: use smaller model return { "model": "gpt-4o-mini", "mode": "streaming", "cost_per_token": 0.00000015, "latency": "1-2 seconds" }

02 — Knowledge Integration

RAG vs Fine-tuning Decision Matrix

Two approaches for incorporating domain knowledge. RAG is faster to implement and cheaper. Fine-tuning gives better quality but requires preparation and ongoing maintenance.

Dimension	RAG	Fine-tuning
Time to production	Weeks (build retrieval system)	Months (collect, label, train)
Latency impact	+500ms (vector search)	None (model behavior change)
Cost per inference	Baseline + retrieval	Baseline (potentially cheaper with smaller model)
Knowledge currency	Real-time (update documents)	Static (retrain for updates)
Quality ceiling	Limited by retrieval accuracy	Limited by training data quality
Best for	Rapidly changing domain, large knowledge base	Stable domain, specialized reasoning

Hybrid Approach: RAG + Fine-tuning

Use RAG to inject recent knowledge. Use fine-tuning for reasoning patterns. Combine both for optimal quality.

# Hybrid RAG + fine-tuned model class HybridKnowledgeSystem: def __init__(self, base_model, retriever, finetuned_model): self.retriever = retriever self.finetuned_model = finetuned_model def answer(self, question): # 1. Retrieve relevant context docs = self.retriever.search(question, top_k=5) context = "\n".join([d["text"] for d in docs]) # 2. Fine-tuned model has learned reasoning patterns # Prompt format from fine-tuning prompt = f"""Question: {question} Context: {context} Answer:""" response = self.finetuned_model.generate(prompt) return { "answer": response, "sources": [d["source"] for d in docs], "confidence": self.estimate_confidence(response, context) }

03 — Selection

Model Selection Framework

Evaluate models on capability, cost, speed, and fit for your specific task. Don't default to the largest model. Benchmark on your actual data.

# Model selection scorecard class ModelEvaluator: def score_model(self, model, benchmark_data, weights=None): # Default: quality > latency > cost weights = weights or { "quality": 0.50, "latency": 0.30, "cost": 0.20 } results = self.benchmark(model, benchmark_data) scores = { "quality": results["accuracy"] / 0.95, # Normalize "latency": 2000 / results["avg_latency_ms"], "cost": 0.001 / results["cost_per_token"] } final_score = sum( scores[key] * weights[key] for key in scores ) return { "model": model, "final_score": final_score, "component_scores": scores, "benchmark_results": results } def recommend(self, candidates, benchmark_data): evaluations = [ self.score_model(model, benchmark_data) for model in candidates ] return sorted(evaluations, key=lambda x: x["final_score"], reverse=True)[0]

04 — Make-or-Buy

Build vs Buy Decision Analysis

Building custom models is expensive. Buying API access is simple. Choose based on scale, differentiation, and control requirements.

Factor	Build	Buy
Initial investment	High ($100K-1M)	Low ($5K-50K)
Time to production	6-18 months	Weeks
Operational cost at scale	Low (amortized)	High (per-request)
Control and IP	Complete ownership	Vendor dependent
Differentiation value	High (unique capability)	Commodity (everyone has access)

Break-Even Analysis

# Build vs buy cost comparison def calculate_breakeven(api_cost_per_token, inference_volume_per_year, training_cost, serving_cost_per_year): api_annual_cost = api_cost_per_token * inference_volume_per_year custom_annual_cost = training_cost / 5 + serving_cost_per_year # Amortize training over 5 years # Break-even when custom catches up years_to_breakeven = training_cost / (api_annual_cost - serving_cost_per_year) return { "api_annual_cost": api_annual_cost, "custom_annual_cost": custom_annual_cost, "years_to_breakeven": years_to_breakeven, "recommendation": "build" if years_to_breakeven < 3 else "buy" } # Example: billion inferences per year breakeven = calculate_breakeven( api_cost_per_token=0.000005, inference_volume_per_year=500_000_000_000, # 500B tokens training_cost=500_000, serving_cost_per_year=100_000 )

05 — Patterns

Agent Use Cases and Patterns

Agents excel when tasks require planning, tool use, and iterative refinement. But not all AI tasks need agents. Start simple; add agentic patterns only when needed.

When to Use Agents

Good fit: Multi-step workflows (research, code generation), dynamic planning (unknown steps), tool use (API calls, file operations), adaptive reasoning (learn from mistakes).

Poor fit: Single-turn classification, simple generation, low-latency requirements (<1 second), cost-critical operations.

# Simple agentic loop with ReAct pattern class SimpleAgent: def __init__(self, model_client, tools): self.client = model_client self.tools = {t.name: t for t in tools} self.max_iterations = 10 async def run(self, task): messages = [{"role": "user", "content": task}] for _ in range(self.max_iterations): # Think: LLM decides next action response = await self.client.chat.completions.create( model="gpt-4o", messages=messages ) # Check if done if "done" in response.choices[0].message.content.lower(): return response.choices[0].message.content # Act: execute tool tool_call = self.parse_tool_call(response) result = await self.tools[tool_call["name"]].execute( **tool_call["args"] ) # Observe: add to context messages.append({"role": "assistant", "content": response.choices[0].message.content}) messages.append({"role": "user", "content": f"Tool result: {result}"}) return "Max iterations reached"

06 — Architecture

Architecture Patterns and Scaling

Three common patterns: monolithic (single model), modular (specialized models), and hierarchical (routing). Choose based on complexity, latency, and operational constraints.

Monolithic Pattern

Single large model handles all tasks. Simple to operate. Works well for general tasks. Limited scaling flexibility.

Modular Pattern

Specialized models for different tasks. Route based on input classification. Better cost control. Higher operational complexity.

# Modular architecture with routing class ModularPipeline: def __init__(self): self.classifier = load_classifier() self.models = { "summarization": GPTClient(model="gpt-4o-mini"), "coding": GPTClient(model="gpt-4o"), "translation": GPTClient(model="gpt-4o-mini"), "reasoning": GPTClient(model="gpt-4o") } async def process(self, request): # Classify task task_type = await self.classifier.predict(request["content"]) # Route to appropriate model model = self.models.get(task_type, self.models["reasoning"]) response = await model.generate(request["content"]) return response

07 — Validation

Evaluation-Driven Design

Start with metrics and benchmarks. Design systems to maximize the metrics that matter. Measure continuously. Let data guide decisions, not intuition.

Defining Success Metrics

Task accuracy: BLEU, F1, exact match for classification/generation. User satisfaction: thumbs up/down, ratings, NPS. Business impact: conversion rate, retention, revenue. Operational: latency, cost, error rate.

# Evaluation framework class EvaluationFramework: def __init__(self, baseline_metrics): self.baseline = baseline_metrics self.experiments = [] def run_experiment(self, variant_name, variant_fn, test_data): """A/B test a change""" results = [] for sample in test_data: baseline_output = self.baseline_fn(sample) variant_output = variant_fn(sample) results.append({ "baseline": baseline_output, "variant": variant_output, "correct_baseline": baseline_output == sample["expected"], "correct_variant": variant_output == sample["expected"] }) improvement = { "accuracy_lift": sum(r["correct_variant"] for r in results) / len(results) - self.baseline["accuracy"], "latency": self.measure_latency(variant_fn, test_data), "cost": self.measure_cost(variant_fn, test_data) } self.experiments.append({ "variant": variant_name, "improvement": improvement, "results": results }) return improvement

Tools

Decision Support Stack

Tools for model evaluation, benchmarking, and architectural decisions.

HELM Benchmark

Comprehensive LLM evaluation across diverse tasks. Compare models on accuracy, efficiency, and bias metrics.

LangSmith

Debug and monitor LLM applications. Evaluate prompt changes before deployment. Track costs per feature.

DeepEval

Evaluation framework for LLM outputs. Metrics for hallucination, relevance, faithfulness, and bias.

Weights & Biases

Experiment tracking with decision logging. Compare model versions and architectures systematically.

MLflow

Model registry and versioning. Track experiments, parameters, and metrics. Compare variants.

LiteLLM

Multi-provider LLM router. A/B test models. Track costs and performance across providers.

References

Learn More

HELM Benchmark

Comprehensive evaluation framework for language models. Compare across diverse tasks and metrics.

LangSmith

Debugging and monitoring platform for LLM applications. Evaluate and iterate on prompts.

DeepEval

Open-source evaluation framework for LLM outputs. Hallucination, relevance, and bias detection.

Weights & Biases

Experiment tracking and decision logging for AI systems. Compare model variants systematically.

MLflow

Model registry and versioning. Track experiments, parameters, and metrics.

LiteLLM

Multi-provider LLM router with cost tracking and A/B testing capabilities.

Decision Frameworks for AI Systems

Table of Contents

The Cost-Quality-Latency Triangle

RAG vs Fine-tuning Decision Matrix

Hybrid Approach: RAG + Fine-tuning

Model Selection Framework

Build vs Buy Decision Analysis

Break-Even Analysis

Agent Use Cases and Patterns

When to Use Agents

Architecture Patterns and Scaling

Monolithic Pattern

Modular Pattern

Evaluation-Driven Design

Defining Success Metrics

Decision Support Stack

HELM Benchmark

LangSmith

DeepEval

Weights & Biases

MLflow

LiteLLM

Learn More

Related concepts