AI System Design

On this page

Production Layers
Core Decisions
Evals-First Dev
Reliability
Code Example
What's Next
References

⚠️ System design is the layer most tutorials skip. Individual tools are well documented; how to combine them into something reliable, cost-efficient, and maintainable in production is not.

01 — Architecture

The 6 Layers of a Production AI System

Building a production AI system is not just calling an LLM API. It requires six layers: an interface, orchestration logic, LLM calls and tool integrations, a data layer, reliability wrappers, and observability. Skipping any layer leads to fragile, unpredictable systems.

User-facing API / interface ↓ Orchestration (agent loop / chain) ↓ LLM calls | Tools & retrieval ↓ Data layer (chunks, metadata, vDB) ↓ Reliability (retry, fallback, cache) ↓ Eval & observability (Langfuse, Arize, etc.)

Layer Details

1. User-facing API/Interface: REST endpoint, CLI, chat UI, or webhook. This is how users interact with the system.

2. Orchestration: Agent loops (ReAct, tool-use loops) or deterministic pipelines (chain A → B → C). Controls whether LLM drives decisions or flow is pre-defined.

3. LLM + Tools: Claude, GPT-4, or open-source. Tools are functions: retrieval, calculation, code execution, external APIs.

4. Data Layer: Vector database, relational DB, file storage. Handles retrieval, metadata filtering, embeddings.

5. Reliability: Retry logic with exponential backoff, fallback endpoints, caching (semantic or exact-match), rate limiting.

6. Observability: Logging, tracing, eval metrics. If you cannot measure it, you cannot improve it.

✓ Design your eval harness before your system architecture. If you cannot measure quality, you cannot make reliable design decisions.

Python · System design: decision tree implementation

from enum import Enum
from dataclasses import dataclass

class PatternType(Enum):
    RAG = "rag"
    AGENT = "agent"
    STRUCTURED_OUTPUT = "structured_output"
    FINE_TUNED = "fine_tuned"
    SIMPLE_PROMPT = "simple_prompt"

@dataclass
class Requirements:
    needs_realtime_data: bool = False
    needs_multi_step_reasoning: bool = False
    output_is_structured: bool = False
    high_volume: bool = False
    domain_specific: bool = False
    latency_sensitive: bool = False

def choose_pattern(req: Requirements) -> PatternType:
    """Decision tree for selecting the right GenAI architecture."""
    if req.needs_realtime_data or req.domain_specific:
        return PatternType.RAG
    if req.needs_multi_step_reasoning:
        return PatternType.AGENT
    if req.output_is_structured:
        return PatternType.STRUCTURED_OUTPUT
    if req.high_volume and req.domain_specific:
        return PatternType.FINE_TUNED
    return PatternType.SIMPLE_PROMPT

# Examples
cases = [
    Requirements(needs_realtime_data=True),
    Requirements(needs_multi_step_reasoning=True),
    Requirements(output_is_structured=True, high_volume=True),
    Requirements(domain_specific=True, high_volume=True),
]
for r in cases:
    print(f"{choose_pattern(r).value}")
# rag, agent, structured_output, fine_tuned

02 — Decisions

The Three Core Decisions

Every production AI system makes three architecture decisions. Each has profound cost, quality, and latency trade-offs.

Decision 1: RAG vs Fine-Tuning vs Prompting Alone

Prompting alone: Fast, cheap, works for general knowledge. Fails when you have proprietary data or need up-to-date facts.

RAG (Retrieval-Augmented Generation): Retrieve relevant chunks, inject into context, generate. Best for knowledge bases, documentation, FAQ. Flexible (update documents without retraining), transparent (grounding is visible). Cost: retrieval latency + embedding overhead.

Fine-tuning: Train on your data to bake knowledge into weights. Best when you need reasoning over proprietary data or want to enforce style/format. Cost: training time, inference cost (usually higher), data preparation. Risk: outdated knowledge (static weights).

Approach	Speed	Cost	Flexibility	Best for
Prompting alone	Fast	Low	High (change prompt)	General Q&A, general knowledge
RAG	Medium	Medium	High (update docs)	Knowledge bases, docs, FAQ
Fine-tuning	Depends	High	Low (retrain to update)	Proprietary reasoning, style control
RAG + FT	Medium	High	Medium	Domain knowledge + reasoning

Decision 2: Agent Loop vs Deterministic Pipeline

Agent loop (ReAct): LLM decides when to use tools, what to query, whether to iterate. Flexible, can handle novel cases. Risk: unpredictable latency, can hallucinate about tool availability, can loop infinitely.

Deterministic pipeline: Fixed sequence: retrieve → rerank → generate. Fast, predictable, debuggable. Risk: brittle (breaks on edge cases the pipeline wasn't designed for).

Most production systems use a hybrid: 80% deterministic pipeline (fast path), 20% agent fallback for edge cases.

Decision 3: Build vs Buy vs Open-Source

Build: Custom implementation. Full control, longest time-to-market, ongoing maintenance.

Buy (SaaS): Hosted RAG/agent platforms (e.g., Anthropic's APIs, LangSmith, Vectara). Fast deployment, vendor lock-in risk.

Open-source: LlamaIndex, LangChain, DSPy. Flexible, but you own operations and updates.

03 — Quality

Evals-First Development

The single most important practice: design your eval harness before you design the system. Without metrics, all decisions are opinions.

Eval Dimensions

Correctness: Is the final answer right? Measured by comparing against ground truth (for factual Q&A) or human rating (for subjective tasks).

Grounding: Is the answer supported by retrieved context? Measure: # claims supported by context / total claims.

Latency: How fast is the response? P50, P99, tail latency matter for interactive systems.

Cost per query: Sum of API calls (LLM, embedding, reranking). For high-volume systems, this dominates.

Coverage: What % of queries does the system handle (vs. fallback)? Coverage vs accuracy trade-off is crucial.

Eval Tools

Framework

Langsmith

Tracing and evaluation for LLM chains. Integrates with LangChain.

Framework

Langfuse

Open-source LLM observability. Evals, traces, cost tracking.

Framework

RAGAS

Reference-free metrics for RAG: faithfulness, relevancy, precision, recall.

Framework

DeepEval

Python framework for synthetic test generation and LLM eval.

Observability

Arize Phoenix

Production monitoring, drift detection, performance tracking.

Python · Evaluation-first development: golden set + regression gate

import json, statistics
from openai import OpenAI

client = OpenAI()
GOLDEN_SET = [
    {"input": "What is RAG?",
     "must_contain": ["retrieval", "generation"],
     "must_not_contain": ["hallucination is fine"]},
    {"input": "Explain fine-tuning in one sentence.",
     "must_contain": ["training", "weights"],
     "must_not_contain": []},
]

def evaluate_system(system_prompt: str, model: str = "gpt-4o") -> dict:
    results = []
    for item in GOLDEN_SET:
        resp = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": item["input"]}
            ]
        ).choices[0].message.content.lower()

        passed = (
            all(kw in resp for kw in item["must_contain"]) and
            all(kw not in resp for kw in item["must_not_contain"])
        )
        results.append({"input": item["input"], "passed": passed, "response": resp[:100]})

    score = sum(r["passed"] for r in results) / len(results)
    return {"score": score, "details": results, "passed": score >= 0.8}

# Gate deploys on eval results
result = evaluate_system("You are a concise GenAI expert. Answer clearly.")
print(f"Score: {result['score']:.0%} → {'DEPLOY OK' if result['passed'] else 'BLOCK'}")

The Eval Harness Workflow

1. Create a test set (100–500 examples). For proprietary data, sample real user queries. 2. Define metrics (correctness, latency, cost). 3. Baseline current system. 4. Run evals before every architecture change. 5. Track metrics over time in a dashboard.

✓ Best practice: Run evals on every commit. If a change improves correctness but doubles latency, you'll know. If a change cuts cost by 20% but drops accuracy by 5%, that's a data-driven decision to make explicitly.

04 — Resilience

Reliability Patterns

Production systems fail. Networks are flaky. APIs timeout. Your job is to handle failures gracefully.

Retry with Exponential Backoff

On transient failure (timeout, 429, 503), retry with increasing wait: 1s, 2s, 4s, 8s, 16s. Jitter (add random noise) prevents thundering herd. Max retries: 3–5. Timeout per attempt: 10–30s depending on operation.

Fallback Strategies

Model fallback: If Claude fails, try GPT-4. If both fail, return cached result. Rank by cost and latency.

Endpoint fallback: Multiple regions. If us-east fails, try eu-west.

Feature fallback: If retrieval fails, still generate from prompt alone. Quality drops, but service is up.

Caching

Exact-match cache: Hash query, return cached result if query seen before. Simple, high hit rate on repeated questions.

Semantic cache: Embed query, find similar cached queries (cosine similarity > threshold), return their results. More flexible, captures paraphrases.

TTL (time-to-live): Invalidate cache after N hours or on data update. Balance freshness vs speed.

Rate Limiting & Quotas

Protect your system and wallet: per-user request limits, concurrent request limits, cost budgets. Transparent errors ("You've hit your monthly limit") are better than silent failures.

⚠️ Cost runaway: An agent that loops infinitely can drain your budget in minutes. Always set max_iterations, max_tokens, and cost budgets.

Python · Production AI system with observability and circuit breaker

import time, logging, functools
from openai import OpenAI
from dataclasses import dataclass, field

client = OpenAI()
logger = logging.getLogger(__name__)

@dataclass
class CircuitBreaker:
    failure_threshold: int = 5
    reset_timeout: float = 60.0
    failures: int = 0
    last_failure: float = 0
    state: str = "closed"  # closed, open, half-open

    def call(self, fn, *args, **kwargs):
        if self.state == "open":
            if time.time() - self.last_failure > self.reset_timeout:
                self.state = "half-open"
            else:
                raise RuntimeError("Circuit breaker OPEN — service unavailable")
        try:
            result = fn(*args, **kwargs)
            if self.state == "half-open":
                self.state = "closed"
                self.failures = 0
            return result
        except Exception as e:
            self.failures += 1
            self.last_failure = time.time()
            if self.failures >= self.failure_threshold:
                self.state = "open"
                logger.error(f"Circuit breaker opened after {self.failures} failures")
            raise

cb = CircuitBreaker()

def call_with_observability(prompt: str, **kwargs) -> str:
    start = time.perf_counter()
    try:
        def _call():
            return client.chat.completions.create(
                model="gpt-4o",
                messages=[{"role": "user", "content": prompt}],
                **kwargs
            ).choices[0].message.content
        result = cb.call(_call)
        latency = time.perf_counter() - start
        logger.info(f"LLM call ok latency={latency:.3f}s tokens={len(result.split())}")
        return result
    except Exception as e:
        latency = time.perf_counter() - start
        logger.error(f"LLM call failed latency={latency:.3f}s error={e}")
        raise

05 — Implementation

Working Code Example

A minimal production RAG system skeleton with caching, retry logic, and structured output:

# Production AI system skeleton
import anthropic
from sentence_transformers import SentenceTransformer
import chromadb
from typing import Optional
import logging
import time

logger = logging.getLogger(__name__)

class ProductionRAGSystem:
    def __init__(self, collection_name: str = "docs"):
        self.client = anthropic.Anthropic()
        self.embed_model = SentenceTransformer("BAAI/bge-small-en-v1.5")
        self.db = chromadb.Client()
        self.collection = self.db.get_or_create_collection(collection_name)
        self.semantic_cache: dict = {}
    
    def _get_cached(self, query: str) -> Optional[str]:
        # Simple semantic cache (production: use cosine similarity)
        return self.semantic_cache.get(query)
    
    def query(self, question: str, max_retries: int = 3) -> dict:
        # Check cache
        cached = self._get_cached(question)
        if cached:
            return {"answer": cached, "source": "cache"}
        
        # Retrieve with fallback
        for attempt in range(max_retries):
            try:
                q_emb = self.embed_model.encode([question]).tolist()
                results = self.collection.query(query_embeddings=q_emb, n_results=3)
                context = "\n".join(results["documents"][0]) if results["documents"][0] else ""
                
                resp = self.client.messages.create(
                    model="claude-haiku-4-5-20251001",
                    max_tokens=512,
                    messages=[{"role": "user", "content":
                        f"Context:\n{context}\n\nQuestion: {question}\nAnswer:"}]
                )
                answer = resp.content[0].text
                self.semantic_cache[question] = answer
                return {"answer": answer, "source": "rag", "context_used": len(context)}
            
            except Exception as e:
                logger.warning(f"Attempt {attempt+1} failed: {e}")
                if attempt < max_retries - 1:
                    time.sleep(2 ** attempt)  # exponential backoff
        
        return {"answer": "Service temporarily unavailable", "source": "fallback"}

system = ProductionRAGSystem()
result = system.query("What is the refund policy?")
print(result)

Key features: caching before retrieval, exponential backoff on failure, structured return (source tracking), and graceful fallback.

06 — Trade-offs

Cost, Latency & Quality Triangle

Every production AI system lives inside a triangle: cost, latency, and quality. You can optimise any two, but improving all three simultaneously requires architectural changes. Understanding this triangle prevents the most common mistake — chasing quality in development, then discovering the latency and cost are unacceptable at scale.

Typical levers: model size (quality vs cost/latency), quantization (cost/latency vs slight quality drop), caching (latency/cost vs freshness), retrieval (quality vs latency), streaming (perceived latency vs complexity). Map your requirements first, then choose the point in the triangle your use case demands.

Lever	Improves	Trade-off	Typical Savings
Smaller model (GPT-4o → 4o-mini)	Cost, latency	Quality drop on hard tasks	10–30× cheaper
Quantization (fp16 → int4)	Cost, latency	Slight accuracy loss	2–4× faster serving
Response caching	Cost, latency	Stale results on dynamic queries	Up to 60% cache hit
Streaming output	Perceived latency	More complex client code	Time-to-first-token -80%
Prompt compression	Cost, latency	Possible information loss	30–50% token reduction

06 — Go Deeper

What to Explore Next

This overview covered decision frameworks and architecture. For deeper dives:

Concept

Decision Frameworks

Structured thinking for RAG vs FT vs agent vs pipeline choices.

Concept

Evals-First Dev

Build your eval harness before your system. The #1 practice for production quality.

Concept

Compound AI Systems

Pipelines of models, retrievers, routers, caches. How to orchestrate complex systems.

Concept

Frontier Implications

How test-time compute and long context change system design.

07 — Further Reading

References

Blog Posts

Blog Zaharia, M. et al. (2024). The Shift from Models to Compound AI Systems. — bair.berkeley.edu ↗
Blog Anthropic. (2024). Building effective agents. — anthropic.com ↗
Blog Eugene Yan. (2023). Patterns for Building LLM-based Systems. — eugeneyan.com ↗

Papers

Paper Khattab, O. et al. (2023). DSPy: Compiling Declarative LM Pipelines into Self-Improving Systems. arXiv:2310.03714. — arxiv:2310.03714 ↗

Docs

Docs LangSmith. (2024). LLM Observability Guide. — docs.smith.langchain.com ↗

AI System Design

The 6 Layers of a Production AI System

Layer Details

The Three Core Decisions

Decision 1: RAG vs Fine-Tuning vs Prompting Alone

Decision 2: Agent Loop vs Deterministic Pipeline

Decision 3: Build vs Buy vs Open-Source

Evals-First Development

Eval Dimensions

Eval Tools

The Eval Harness Workflow

Reliability Patterns

Retry with Exponential Backoff

Fallback Strategies

Caching

Rate Limiting & Quotas

Working Code Example

Cost, Latency & Quality Triangle

What to Explore Next

References

Related concepts