Architecture · Strategy

Decision Frameworks for AI Systems

Strategic frameworks for model selection, build vs buy decisions, RAG vs fine-tuning tradeoffs, and evaluation-driven architecture patterns.
5
Decision frameworks
7
Sections
Python-first
Implementation

Table of Contents

01 — Tradeoffs

The Cost-Quality-Latency Triangle

Every AI system exists at the intersection of three competing constraints. Optimize for two; the third suffers. Know your tradeoffs explicitly.

Optimize For Sacrifice Technology Choice Example Use Case
Quality + Cost Latency (batch) Batch API, overnight processing Daily email summaries, report generation
Quality + Latency Cost (expensive model) GPT-4o with streaming Real-time customer support, code review
Cost + Latency Quality (degraded) Smaller models, quantization Classification, fast filtering
# Decision framework: cost-quality-latency class SystemOptimizer: def __init__(self, constraints): self.max_cost_per_request = constraints["max_cost"] self.max_latency_ms = constraints["max_latency"] self.min_quality = constraints["min_quality"] def choose_architecture(self): # If latency is flexible, use batch API if self.max_latency_ms > 3600000: # > 1 hour return { "model": "gpt-4o", "mode": "batch", "cost_per_token": 0.000003, "latency": "12-24 hours" } # If cost is flexible, maximize quality if self.max_cost_per_request > 0.10: return { "model": "gpt-4o", "mode": "streaming", "cost_per_token": 0.000015, "latency": "2-5 seconds" } # Cost-constrained: use smaller model return { "model": "gpt-4o-mini", "mode": "streaming", "cost_per_token": 0.00000015, "latency": "1-2 seconds" }
02 — Knowledge Integration

RAG vs Fine-tuning Decision Matrix

Two approaches for incorporating domain knowledge. RAG is faster to implement and cheaper. Fine-tuning gives better quality but requires preparation and ongoing maintenance.

Dimension RAG Fine-tuning
Time to production Weeks (build retrieval system) Months (collect, label, train)
Latency impact +500ms (vector search) None (model behavior change)
Cost per inference Baseline + retrieval Baseline (potentially cheaper with smaller model)
Knowledge currency Real-time (update documents) Static (retrain for updates)
Quality ceiling Limited by retrieval accuracy Limited by training data quality
Best for Rapidly changing domain, large knowledge base Stable domain, specialized reasoning

Hybrid Approach: RAG + Fine-tuning

Use RAG to inject recent knowledge. Use fine-tuning for reasoning patterns. Combine both for optimal quality.

# Hybrid RAG + fine-tuned model class HybridKnowledgeSystem: def __init__(self, base_model, retriever, finetuned_model): self.retriever = retriever self.finetuned_model = finetuned_model def answer(self, question): # 1. Retrieve relevant context docs = self.retriever.search(question, top_k=5) context = "\n".join([d["text"] for d in docs]) # 2. Fine-tuned model has learned reasoning patterns # Prompt format from fine-tuning prompt = f"""Question: {question} Context: {context} Answer:""" response = self.finetuned_model.generate(prompt) return { "answer": response, "sources": [d["source"] for d in docs], "confidence": self.estimate_confidence(response, context) }
03 — Selection

Model Selection Framework

Evaluate models on capability, cost, speed, and fit for your specific task. Don't default to the largest model. Benchmark on your actual data.

# Model selection scorecard class ModelEvaluator: def score_model(self, model, benchmark_data, weights=None): # Default: quality > latency > cost weights = weights or { "quality": 0.50, "latency": 0.30, "cost": 0.20 } results = self.benchmark(model, benchmark_data) scores = { "quality": results["accuracy"] / 0.95, # Normalize "latency": 2000 / results["avg_latency_ms"], "cost": 0.001 / results["cost_per_token"] } final_score = sum( scores[key] * weights[key] for key in scores ) return { "model": model, "final_score": final_score, "component_scores": scores, "benchmark_results": results } def recommend(self, candidates, benchmark_data): evaluations = [ self.score_model(model, benchmark_data) for model in candidates ] return sorted(evaluations, key=lambda x: x["final_score"], reverse=True)[0]
04 — Make-or-Buy

Build vs Buy Decision Analysis

Building custom models is expensive. Buying API access is simple. Choose based on scale, differentiation, and control requirements.

Factor Build Buy
Initial investment High ($100K-1M) Low ($5K-50K)
Time to production 6-18 months Weeks
Operational cost at scale Low (amortized) High (per-request)
Control and IP Complete ownership Vendor dependent
Differentiation value High (unique capability) Commodity (everyone has access)

Break-Even Analysis

# Build vs buy cost comparison def calculate_breakeven(api_cost_per_token, inference_volume_per_year, training_cost, serving_cost_per_year): api_annual_cost = api_cost_per_token * inference_volume_per_year custom_annual_cost = training_cost / 5 + serving_cost_per_year # Amortize training over 5 years # Break-even when custom catches up years_to_breakeven = training_cost / (api_annual_cost - serving_cost_per_year) return { "api_annual_cost": api_annual_cost, "custom_annual_cost": custom_annual_cost, "years_to_breakeven": years_to_breakeven, "recommendation": "build" if years_to_breakeven < 3 else "buy" } # Example: billion inferences per year breakeven = calculate_breakeven( api_cost_per_token=0.000005, inference_volume_per_year=500_000_000_000, # 500B tokens training_cost=500_000, serving_cost_per_year=100_000 )
05 — Patterns

Agent Use Cases and Patterns

Agents excel when tasks require planning, tool use, and iterative refinement. But not all AI tasks need agents. Start simple; add agentic patterns only when needed.

When to Use Agents

Good fit: Multi-step workflows (research, code generation), dynamic planning (unknown steps), tool use (API calls, file operations), adaptive reasoning (learn from mistakes).

Poor fit: Single-turn classification, simple generation, low-latency requirements (<1 second), cost-critical operations.

# Simple agentic loop with ReAct pattern class SimpleAgent: def __init__(self, model_client, tools): self.client = model_client self.tools = {t.name: t for t in tools} self.max_iterations = 10 async def run(self, task): messages = [{"role": "user", "content": task}] for _ in range(self.max_iterations): # Think: LLM decides next action response = await self.client.chat.completions.create( model="gpt-4o", messages=messages ) # Check if done if "done" in response.choices[0].message.content.lower(): return response.choices[0].message.content # Act: execute tool tool_call = self.parse_tool_call(response) result = await self.tools[tool_call["name"]].execute( **tool_call["args"] ) # Observe: add to context messages.append({"role": "assistant", "content": response.choices[0].message.content}) messages.append({"role": "user", "content": f"Tool result: {result}"}) return "Max iterations reached"
06 — Architecture

Architecture Patterns and Scaling

Three common patterns: monolithic (single model), modular (specialized models), and hierarchical (routing). Choose based on complexity, latency, and operational constraints.

Monolithic Pattern

Single large model handles all tasks. Simple to operate. Works well for general tasks. Limited scaling flexibility.

Modular Pattern

Specialized models for different tasks. Route based on input classification. Better cost control. Higher operational complexity.

# Modular architecture with routing class ModularPipeline: def __init__(self): self.classifier = load_classifier() self.models = { "summarization": GPTClient(model="gpt-4o-mini"), "coding": GPTClient(model="gpt-4o"), "translation": GPTClient(model="gpt-4o-mini"), "reasoning": GPTClient(model="gpt-4o") } async def process(self, request): # Classify task task_type = await self.classifier.predict(request["content"]) # Route to appropriate model model = self.models.get(task_type, self.models["reasoning"]) response = await model.generate(request["content"]) return response
07 — Validation

Evaluation-Driven Design

Start with metrics and benchmarks. Design systems to maximize the metrics that matter. Measure continuously. Let data guide decisions, not intuition.

Defining Success Metrics

Task accuracy: BLEU, F1, exact match for classification/generation. User satisfaction: thumbs up/down, ratings, NPS. Business impact: conversion rate, retention, revenue. Operational: latency, cost, error rate.

# Evaluation framework class EvaluationFramework: def __init__(self, baseline_metrics): self.baseline = baseline_metrics self.experiments = [] def run_experiment(self, variant_name, variant_fn, test_data): """A/B test a change""" results = [] for sample in test_data: baseline_output = self.baseline_fn(sample) variant_output = variant_fn(sample) results.append({ "baseline": baseline_output, "variant": variant_output, "correct_baseline": baseline_output == sample["expected"], "correct_variant": variant_output == sample["expected"] }) improvement = { "accuracy_lift": sum(r["correct_variant"] for r in results) / len(results) - self.baseline["accuracy"], "latency": self.measure_latency(variant_fn, test_data), "cost": self.measure_cost(variant_fn, test_data) } self.experiments.append({ "variant": variant_name, "improvement": improvement, "results": results }) return improvement
Tools

Decision Support Stack

Tools for model evaluation, benchmarking, and architectural decisions.

HELM Benchmark

Comprehensive LLM evaluation across diverse tasks. Compare models on accuracy, efficiency, and bias metrics.

LangSmith

Debug and monitor LLM applications. Evaluate prompt changes before deployment. Track costs per feature.

DeepEval

Evaluation framework for LLM outputs. Metrics for hallucination, relevance, faithfulness, and bias.

Weights & Biases

Experiment tracking with decision logging. Compare model versions and architectures systematically.

MLflow

Model registry and versioning. Track experiments, parameters, and metrics. Compare variants.

LiteLLM

Multi-provider LLM router. A/B test models. Track costs and performance across providers.

References

Learn More

HELM Benchmark

Comprehensive evaluation framework for language models. Compare across diverse tasks and metrics.

LangSmith

Debugging and monitoring platform for LLM applications. Evaluate and iterate on prompts.

DeepEval

Open-source evaluation framework for LLM outputs. Hallucination, relevance, and bias detection.

Weights & Biases

Experiment tracking and decision logging for AI systems. Compare model variants systematically.

MLflow

Model registry and versioning. Track experiments, parameters, and metrics.

LiteLLM

Multi-provider LLM router with cost tracking and A/B testing capabilities.