The practitioner layer — from prototype to production system
Building a production AI system is not just calling an LLM API. It requires six layers: an interface, orchestration logic, LLM calls and tool integrations, a data layer, reliability wrappers, and observability. Skipping any layer leads to fragile, unpredictable systems.
1. User-facing API/Interface: REST endpoint, CLI, chat UI, or webhook. This is how users interact with the system.
2. Orchestration: Agent loops (ReAct, tool-use loops) or deterministic pipelines (chain A → B → C). Controls whether LLM drives decisions or flow is pre-defined.
3. LLM + Tools: Claude, GPT-4, or open-source. Tools are functions: retrieval, calculation, code execution, external APIs.
4. Data Layer: Vector database, relational DB, file storage. Handles retrieval, metadata filtering, embeddings.
5. Reliability: Retry logic with exponential backoff, fallback endpoints, caching (semantic or exact-match), rate limiting.
6. Observability: Logging, tracing, eval metrics. If you cannot measure it, you cannot improve it.
from enum import Enum
from dataclasses import dataclass
class PatternType(Enum):
RAG = "rag"
AGENT = "agent"
STRUCTURED_OUTPUT = "structured_output"
FINE_TUNED = "fine_tuned"
SIMPLE_PROMPT = "simple_prompt"
@dataclass
class Requirements:
needs_realtime_data: bool = False
needs_multi_step_reasoning: bool = False
output_is_structured: bool = False
high_volume: bool = False
domain_specific: bool = False
latency_sensitive: bool = False
def choose_pattern(req: Requirements) -> PatternType:
"""Decision tree for selecting the right GenAI architecture."""
if req.needs_realtime_data or req.domain_specific:
return PatternType.RAG
if req.needs_multi_step_reasoning:
return PatternType.AGENT
if req.output_is_structured:
return PatternType.STRUCTURED_OUTPUT
if req.high_volume and req.domain_specific:
return PatternType.FINE_TUNED
return PatternType.SIMPLE_PROMPT
# Examples
cases = [
Requirements(needs_realtime_data=True),
Requirements(needs_multi_step_reasoning=True),
Requirements(output_is_structured=True, high_volume=True),
Requirements(domain_specific=True, high_volume=True),
]
for r in cases:
print(f"{choose_pattern(r).value}")
# rag, agent, structured_output, fine_tuned
Every production AI system makes three architecture decisions. Each has profound cost, quality, and latency trade-offs.
Prompting alone: Fast, cheap, works for general knowledge. Fails when you have proprietary data or need up-to-date facts.
RAG (Retrieval-Augmented Generation): Retrieve relevant chunks, inject into context, generate. Best for knowledge bases, documentation, FAQ. Flexible (update documents without retraining), transparent (grounding is visible). Cost: retrieval latency + embedding overhead.
Fine-tuning: Train on your data to bake knowledge into weights. Best when you need reasoning over proprietary data or want to enforce style/format. Cost: training time, inference cost (usually higher), data preparation. Risk: outdated knowledge (static weights).
| Approach | Speed | Cost | Flexibility | Best for |
|---|---|---|---|---|
| Prompting alone | Fast | Low | High (change prompt) | General Q&A, general knowledge |
| RAG | Medium | Medium | High (update docs) | Knowledge bases, docs, FAQ |
| Fine-tuning | Depends | High | Low (retrain to update) | Proprietary reasoning, style control |
| RAG + FT | Medium | High | Medium | Domain knowledge + reasoning |
Agent loop (ReAct): LLM decides when to use tools, what to query, whether to iterate. Flexible, can handle novel cases. Risk: unpredictable latency, can hallucinate about tool availability, can loop infinitely.
Deterministic pipeline: Fixed sequence: retrieve → rerank → generate. Fast, predictable, debuggable. Risk: brittle (breaks on edge cases the pipeline wasn't designed for).
Most production systems use a hybrid: 80% deterministic pipeline (fast path), 20% agent fallback for edge cases.
Build: Custom implementation. Full control, longest time-to-market, ongoing maintenance.
Buy (SaaS): Hosted RAG/agent platforms (e.g., Anthropic's APIs, LangSmith, Vectara). Fast deployment, vendor lock-in risk.
Open-source: LlamaIndex, LangChain, DSPy. Flexible, but you own operations and updates.
The single most important practice: design your eval harness before you design the system. Without metrics, all decisions are opinions.
Correctness: Is the final answer right? Measured by comparing against ground truth (for factual Q&A) or human rating (for subjective tasks).
Grounding: Is the answer supported by retrieved context? Measure: # claims supported by context / total claims.
Latency: How fast is the response? P50, P99, tail latency matter for interactive systems.
Cost per query: Sum of API calls (LLM, embedding, reranking). For high-volume systems, this dominates.
Coverage: What % of queries does the system handle (vs. fallback)? Coverage vs accuracy trade-off is crucial.
import json, statistics
from openai import OpenAI
client = OpenAI()
GOLDEN_SET = [
{"input": "What is RAG?",
"must_contain": ["retrieval", "generation"],
"must_not_contain": ["hallucination is fine"]},
{"input": "Explain fine-tuning in one sentence.",
"must_contain": ["training", "weights"],
"must_not_contain": []},
]
def evaluate_system(system_prompt: str, model: str = "gpt-4o") -> dict:
results = []
for item in GOLDEN_SET:
resp = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": item["input"]}
]
).choices[0].message.content.lower()
passed = (
all(kw in resp for kw in item["must_contain"]) and
all(kw not in resp for kw in item["must_not_contain"])
)
results.append({"input": item["input"], "passed": passed, "response": resp[:100]})
score = sum(r["passed"] for r in results) / len(results)
return {"score": score, "details": results, "passed": score >= 0.8}
# Gate deploys on eval results
result = evaluate_system("You are a concise GenAI expert. Answer clearly.")
print(f"Score: {result['score']:.0%} → {'DEPLOY OK' if result['passed'] else 'BLOCK'}")
1. Create a test set (100–500 examples). For proprietary data, sample real user queries. 2. Define metrics (correctness, latency, cost). 3. Baseline current system. 4. Run evals before every architecture change. 5. Track metrics over time in a dashboard.
Production systems fail. Networks are flaky. APIs timeout. Your job is to handle failures gracefully.
On transient failure (timeout, 429, 503), retry with increasing wait: 1s, 2s, 4s, 8s, 16s. Jitter (add random noise) prevents thundering herd. Max retries: 3–5. Timeout per attempt: 10–30s depending on operation.
Model fallback: If Claude fails, try GPT-4. If both fail, return cached result. Rank by cost and latency.
Endpoint fallback: Multiple regions. If us-east fails, try eu-west.
Feature fallback: If retrieval fails, still generate from prompt alone. Quality drops, but service is up.
Exact-match cache: Hash query, return cached result if query seen before. Simple, high hit rate on repeated questions.
Semantic cache: Embed query, find similar cached queries (cosine similarity > threshold), return their results. More flexible, captures paraphrases.
TTL (time-to-live): Invalidate cache after N hours or on data update. Balance freshness vs speed.
Protect your system and wallet: per-user request limits, concurrent request limits, cost budgets. Transparent errors ("You've hit your monthly limit") are better than silent failures.
import time, logging, functools
from openai import OpenAI
from dataclasses import dataclass, field
client = OpenAI()
logger = logging.getLogger(__name__)
@dataclass
class CircuitBreaker:
failure_threshold: int = 5
reset_timeout: float = 60.0
failures: int = 0
last_failure: float = 0
state: str = "closed" # closed, open, half-open
def call(self, fn, *args, **kwargs):
if self.state == "open":
if time.time() - self.last_failure > self.reset_timeout:
self.state = "half-open"
else:
raise RuntimeError("Circuit breaker OPEN — service unavailable")
try:
result = fn(*args, **kwargs)
if self.state == "half-open":
self.state = "closed"
self.failures = 0
return result
except Exception as e:
self.failures += 1
self.last_failure = time.time()
if self.failures >= self.failure_threshold:
self.state = "open"
logger.error(f"Circuit breaker opened after {self.failures} failures")
raise
cb = CircuitBreaker()
def call_with_observability(prompt: str, **kwargs) -> str:
start = time.perf_counter()
try:
def _call():
return client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
**kwargs
).choices[0].message.content
result = cb.call(_call)
latency = time.perf_counter() - start
logger.info(f"LLM call ok latency={latency:.3f}s tokens={len(result.split())}")
return result
except Exception as e:
latency = time.perf_counter() - start
logger.error(f"LLM call failed latency={latency:.3f}s error={e}")
raise
A minimal production RAG system skeleton with caching, retry logic, and structured output:
# Production AI system skeleton
import anthropic
from sentence_transformers import SentenceTransformer
import chromadb
from typing import Optional
import logging
import time
logger = logging.getLogger(__name__)
class ProductionRAGSystem:
def __init__(self, collection_name: str = "docs"):
self.client = anthropic.Anthropic()
self.embed_model = SentenceTransformer("BAAI/bge-small-en-v1.5")
self.db = chromadb.Client()
self.collection = self.db.get_or_create_collection(collection_name)
self.semantic_cache: dict = {}
def _get_cached(self, query: str) -> Optional[str]:
# Simple semantic cache (production: use cosine similarity)
return self.semantic_cache.get(query)
def query(self, question: str, max_retries: int = 3) -> dict:
# Check cache
cached = self._get_cached(question)
if cached:
return {"answer": cached, "source": "cache"}
# Retrieve with fallback
for attempt in range(max_retries):
try:
q_emb = self.embed_model.encode([question]).tolist()
results = self.collection.query(query_embeddings=q_emb, n_results=3)
context = "\n".join(results["documents"][0]) if results["documents"][0] else ""
resp = self.client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=512,
messages=[{"role": "user", "content":
f"Context:\n{context}\n\nQuestion: {question}\nAnswer:"}]
)
answer = resp.content[0].text
self.semantic_cache[question] = answer
return {"answer": answer, "source": "rag", "context_used": len(context)}
except Exception as e:
logger.warning(f"Attempt {attempt+1} failed: {e}")
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # exponential backoff
return {"answer": "Service temporarily unavailable", "source": "fallback"}
system = ProductionRAGSystem()
result = system.query("What is the refund policy?")
print(result)Key features: caching before retrieval, exponential backoff on failure, structured return (source tracking), and graceful fallback.
Every production AI system lives inside a triangle: cost, latency, and quality. You can optimise any two, but improving all three simultaneously requires architectural changes. Understanding this triangle prevents the most common mistake — chasing quality in development, then discovering the latency and cost are unacceptable at scale.
Typical levers: model size (quality vs cost/latency), quantization (cost/latency vs slight quality drop), caching (latency/cost vs freshness), retrieval (quality vs latency), streaming (perceived latency vs complexity). Map your requirements first, then choose the point in the triangle your use case demands.
| Lever | Improves | Trade-off | Typical Savings |
|---|---|---|---|
| Smaller model (GPT-4o → 4o-mini) | Cost, latency | Quality drop on hard tasks | 10–30× cheaper |
| Quantization (fp16 → int4) | Cost, latency | Slight accuracy loss | 2–4× faster serving |
| Response caching | Cost, latency | Stale results on dynamic queries | Up to 60% cache hit |
| Streaming output | Perceived latency | More complex client code | Time-to-first-token -80% |
| Prompt compression | Cost, latency | Possible information loss | 30–50% token reduction |
This overview covered decision frameworks and architecture. For deeper dives: