SYSTEM DESIGN

AI System Design

The practitioner layer — from prototype to production system

evals before architecture the #1 rule
RAG · FT · agents the decision stack
6 production layers what you're building
On this page
⚠️ System design is the layer most tutorials skip. Individual tools are well documented; how to combine them into something reliable, cost-efficient, and maintainable in production is not.
01 — Architecture

The 6 Layers of a Production AI System

Building a production AI system is not just calling an LLM API. It requires six layers: an interface, orchestration logic, LLM calls and tool integrations, a data layer, reliability wrappers, and observability. Skipping any layer leads to fragile, unpredictable systems.

User-facing API / interface ↓ Orchestration (agent loop / chain) ↓ LLM calls | Tools & retrieval ↓ Data layer (chunks, metadata, vDB) ↓ Reliability (retry, fallback, cache) ↓ Eval & observability (Langfuse, Arize, etc.)

Layer Details

1. User-facing API/Interface: REST endpoint, CLI, chat UI, or webhook. This is how users interact with the system.

2. Orchestration: Agent loops (ReAct, tool-use loops) or deterministic pipelines (chain A → B → C). Controls whether LLM drives decisions or flow is pre-defined.

3. LLM + Tools: Claude, GPT-4, or open-source. Tools are functions: retrieval, calculation, code execution, external APIs.

4. Data Layer: Vector database, relational DB, file storage. Handles retrieval, metadata filtering, embeddings.

5. Reliability: Retry logic with exponential backoff, fallback endpoints, caching (semantic or exact-match), rate limiting.

6. Observability: Logging, tracing, eval metrics. If you cannot measure it, you cannot improve it.

Design your eval harness before your system architecture. If you cannot measure quality, you cannot make reliable design decisions.
Python · System design: decision tree implementation
from enum import Enum
from dataclasses import dataclass

class PatternType(Enum):
    RAG = "rag"
    AGENT = "agent"
    STRUCTURED_OUTPUT = "structured_output"
    FINE_TUNED = "fine_tuned"
    SIMPLE_PROMPT = "simple_prompt"

@dataclass
class Requirements:
    needs_realtime_data: bool = False
    needs_multi_step_reasoning: bool = False
    output_is_structured: bool = False
    high_volume: bool = False
    domain_specific: bool = False
    latency_sensitive: bool = False

def choose_pattern(req: Requirements) -> PatternType:
    """Decision tree for selecting the right GenAI architecture."""
    if req.needs_realtime_data or req.domain_specific:
        return PatternType.RAG
    if req.needs_multi_step_reasoning:
        return PatternType.AGENT
    if req.output_is_structured:
        return PatternType.STRUCTURED_OUTPUT
    if req.high_volume and req.domain_specific:
        return PatternType.FINE_TUNED
    return PatternType.SIMPLE_PROMPT

# Examples
cases = [
    Requirements(needs_realtime_data=True),
    Requirements(needs_multi_step_reasoning=True),
    Requirements(output_is_structured=True, high_volume=True),
    Requirements(domain_specific=True, high_volume=True),
]
for r in cases:
    print(f"{choose_pattern(r).value}")
# rag, agent, structured_output, fine_tuned
02 — Decisions

The Three Core Decisions

Every production AI system makes three architecture decisions. Each has profound cost, quality, and latency trade-offs.

Decision 1: RAG vs Fine-Tuning vs Prompting Alone

Prompting alone: Fast, cheap, works for general knowledge. Fails when you have proprietary data or need up-to-date facts.

RAG (Retrieval-Augmented Generation): Retrieve relevant chunks, inject into context, generate. Best for knowledge bases, documentation, FAQ. Flexible (update documents without retraining), transparent (grounding is visible). Cost: retrieval latency + embedding overhead.

Fine-tuning: Train on your data to bake knowledge into weights. Best when you need reasoning over proprietary data or want to enforce style/format. Cost: training time, inference cost (usually higher), data preparation. Risk: outdated knowledge (static weights).

ApproachSpeedCostFlexibilityBest for
Prompting aloneFastLowHigh (change prompt)General Q&A, general knowledge
RAGMediumMediumHigh (update docs)Knowledge bases, docs, FAQ
Fine-tuningDependsHighLow (retrain to update)Proprietary reasoning, style control
RAG + FTMediumHighMediumDomain knowledge + reasoning

Decision 2: Agent Loop vs Deterministic Pipeline

Agent loop (ReAct): LLM decides when to use tools, what to query, whether to iterate. Flexible, can handle novel cases. Risk: unpredictable latency, can hallucinate about tool availability, can loop infinitely.

Deterministic pipeline: Fixed sequence: retrieve → rerank → generate. Fast, predictable, debuggable. Risk: brittle (breaks on edge cases the pipeline wasn't designed for).

Most production systems use a hybrid: 80% deterministic pipeline (fast path), 20% agent fallback for edge cases.

Decision 3: Build vs Buy vs Open-Source

Build: Custom implementation. Full control, longest time-to-market, ongoing maintenance.

Buy (SaaS): Hosted RAG/agent platforms (e.g., Anthropic's APIs, LangSmith, Vectara). Fast deployment, vendor lock-in risk.

Open-source: LlamaIndex, LangChain, DSPy. Flexible, but you own operations and updates.

03 — Quality

Evals-First Development

The single most important practice: design your eval harness before you design the system. Without metrics, all decisions are opinions.

Eval Dimensions

Correctness: Is the final answer right? Measured by comparing against ground truth (for factual Q&A) or human rating (for subjective tasks).

Grounding: Is the answer supported by retrieved context? Measure: # claims supported by context / total claims.

Latency: How fast is the response? P50, P99, tail latency matter for interactive systems.

Cost per query: Sum of API calls (LLM, embedding, reranking). For high-volume systems, this dominates.

Coverage: What % of queries does the system handle (vs. fallback)? Coverage vs accuracy trade-off is crucial.

Eval Tools

Framework
Langsmith
Tracing and evaluation for LLM chains. Integrates with LangChain.
Framework
Langfuse
Open-source LLM observability. Evals, traces, cost tracking.
Framework
RAGAS
Reference-free metrics for RAG: faithfulness, relevancy, precision, recall.
Framework
DeepEval
Python framework for synthetic test generation and LLM eval.
Observability
Arize Phoenix
Production monitoring, drift detection, performance tracking.
Python · Evaluation-first development: golden set + regression gate
import json, statistics
from openai import OpenAI

client = OpenAI()
GOLDEN_SET = [
    {"input": "What is RAG?",
     "must_contain": ["retrieval", "generation"],
     "must_not_contain": ["hallucination is fine"]},
    {"input": "Explain fine-tuning in one sentence.",
     "must_contain": ["training", "weights"],
     "must_not_contain": []},
]

def evaluate_system(system_prompt: str, model: str = "gpt-4o") -> dict:
    results = []
    for item in GOLDEN_SET:
        resp = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": item["input"]}
            ]
        ).choices[0].message.content.lower()

        passed = (
            all(kw in resp for kw in item["must_contain"]) and
            all(kw not in resp for kw in item["must_not_contain"])
        )
        results.append({"input": item["input"], "passed": passed, "response": resp[:100]})

    score = sum(r["passed"] for r in results) / len(results)
    return {"score": score, "details": results, "passed": score >= 0.8}

# Gate deploys on eval results
result = evaluate_system("You are a concise GenAI expert. Answer clearly.")
print(f"Score: {result['score']:.0%} → {'DEPLOY OK' if result['passed'] else 'BLOCK'}")

The Eval Harness Workflow

1. Create a test set (100–500 examples). For proprietary data, sample real user queries. 2. Define metrics (correctness, latency, cost). 3. Baseline current system. 4. Run evals before every architecture change. 5. Track metrics over time in a dashboard.

Best practice: Run evals on every commit. If a change improves correctness but doubles latency, you'll know. If a change cuts cost by 20% but drops accuracy by 5%, that's a data-driven decision to make explicitly.
04 — Resilience

Reliability Patterns

Production systems fail. Networks are flaky. APIs timeout. Your job is to handle failures gracefully.

Retry with Exponential Backoff

On transient failure (timeout, 429, 503), retry with increasing wait: 1s, 2s, 4s, 8s, 16s. Jitter (add random noise) prevents thundering herd. Max retries: 3–5. Timeout per attempt: 10–30s depending on operation.

Fallback Strategies

Model fallback: If Claude fails, try GPT-4. If both fail, return cached result. Rank by cost and latency.

Endpoint fallback: Multiple regions. If us-east fails, try eu-west.

Feature fallback: If retrieval fails, still generate from prompt alone. Quality drops, but service is up.

Caching

Exact-match cache: Hash query, return cached result if query seen before. Simple, high hit rate on repeated questions.

Semantic cache: Embed query, find similar cached queries (cosine similarity > threshold), return their results. More flexible, captures paraphrases.

TTL (time-to-live): Invalidate cache after N hours or on data update. Balance freshness vs speed.

Rate Limiting & Quotas

Protect your system and wallet: per-user request limits, concurrent request limits, cost budgets. Transparent errors ("You've hit your monthly limit") are better than silent failures.

⚠️ Cost runaway: An agent that loops infinitely can drain your budget in minutes. Always set max_iterations, max_tokens, and cost budgets.
Python · Production AI system with observability and circuit breaker
import time, logging, functools
from openai import OpenAI
from dataclasses import dataclass, field

client = OpenAI()
logger = logging.getLogger(__name__)

@dataclass
class CircuitBreaker:
    failure_threshold: int = 5
    reset_timeout: float = 60.0
    failures: int = 0
    last_failure: float = 0
    state: str = "closed"  # closed, open, half-open

    def call(self, fn, *args, **kwargs):
        if self.state == "open":
            if time.time() - self.last_failure > self.reset_timeout:
                self.state = "half-open"
            else:
                raise RuntimeError("Circuit breaker OPEN — service unavailable")
        try:
            result = fn(*args, **kwargs)
            if self.state == "half-open":
                self.state = "closed"
                self.failures = 0
            return result
        except Exception as e:
            self.failures += 1
            self.last_failure = time.time()
            if self.failures >= self.failure_threshold:
                self.state = "open"
                logger.error(f"Circuit breaker opened after {self.failures} failures")
            raise

cb = CircuitBreaker()

def call_with_observability(prompt: str, **kwargs) -> str:
    start = time.perf_counter()
    try:
        def _call():
            return client.chat.completions.create(
                model="gpt-4o",
                messages=[{"role": "user", "content": prompt}],
                **kwargs
            ).choices[0].message.content
        result = cb.call(_call)
        latency = time.perf_counter() - start
        logger.info(f"LLM call ok latency={latency:.3f}s tokens={len(result.split())}")
        return result
    except Exception as e:
        latency = time.perf_counter() - start
        logger.error(f"LLM call failed latency={latency:.3f}s error={e}")
        raise
05 — Implementation

Working Code Example

A minimal production RAG system skeleton with caching, retry logic, and structured output:

# Production AI system skeleton import anthropic from sentence_transformers import SentenceTransformer import chromadb from typing import Optional import logging import time logger = logging.getLogger(__name__) class ProductionRAGSystem: def __init__(self, collection_name: str = "docs"): self.client = anthropic.Anthropic() self.embed_model = SentenceTransformer("BAAI/bge-small-en-v1.5") self.db = chromadb.Client() self.collection = self.db.get_or_create_collection(collection_name) self.semantic_cache: dict = {} def _get_cached(self, query: str) -> Optional[str]: # Simple semantic cache (production: use cosine similarity) return self.semantic_cache.get(query) def query(self, question: str, max_retries: int = 3) -> dict: # Check cache cached = self._get_cached(question) if cached: return {"answer": cached, "source": "cache"} # Retrieve with fallback for attempt in range(max_retries): try: q_emb = self.embed_model.encode([question]).tolist() results = self.collection.query(query_embeddings=q_emb, n_results=3) context = "\n".join(results["documents"][0]) if results["documents"][0] else "" resp = self.client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=512, messages=[{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}\nAnswer:"}] ) answer = resp.content[0].text self.semantic_cache[question] = answer return {"answer": answer, "source": "rag", "context_used": len(context)} except Exception as e: logger.warning(f"Attempt {attempt+1} failed: {e}") if attempt < max_retries - 1: time.sleep(2 ** attempt) # exponential backoff return {"answer": "Service temporarily unavailable", "source": "fallback"} system = ProductionRAGSystem() result = system.query("What is the refund policy?") print(result)

Key features: caching before retrieval, exponential backoff on failure, structured return (source tracking), and graceful fallback.

06 — Trade-offs

Cost, Latency & Quality Triangle

Every production AI system lives inside a triangle: cost, latency, and quality. You can optimise any two, but improving all three simultaneously requires architectural changes. Understanding this triangle prevents the most common mistake — chasing quality in development, then discovering the latency and cost are unacceptable at scale.

Typical levers: model size (quality vs cost/latency), quantization (cost/latency vs slight quality drop), caching (latency/cost vs freshness), retrieval (quality vs latency), streaming (perceived latency vs complexity). Map your requirements first, then choose the point in the triangle your use case demands.

LeverImprovesTrade-offTypical Savings
Smaller model (GPT-4o → 4o-mini)Cost, latencyQuality drop on hard tasks10–30× cheaper
Quantization (fp16 → int4)Cost, latencySlight accuracy loss2–4× faster serving
Response cachingCost, latencyStale results on dynamic queriesUp to 60% cache hit
Streaming outputPerceived latencyMore complex client codeTime-to-first-token -80%
Prompt compressionCost, latencyPossible information loss30–50% token reduction
06 — Go Deeper

What to Explore Next

This overview covered decision frameworks and architecture. For deeper dives:

Concept
Decision Frameworks
Structured thinking for RAG vs FT vs agent vs pipeline choices.
Concept
Evals-First Dev
Build your eval harness before your system. The #1 practice for production quality.
Concept
Compound AI Systems
Pipelines of models, retrievers, routers, caches. How to orchestrate complex systems.
Concept
Frontier Implications
How test-time compute and long context change system design.
07 — Further Reading

References

Blog Posts
Papers
Docs