System Architecture · Production

Prod Engineering

The engineering layer between prototype and production: execution models, reliability patterns, traffic management, session state, and human oversight.

5 Execution models
99.9% Uptime target
Human-in-loop Critical
Contents
  1. Prototype to production
  2. Execution models
  3. Reliability patterns
  4. Traffic & cost
  5. State & sessions
  6. Human oversight
  7. Patterns & examples
  8. References
01 — Challenge

From Prototype to Production

The journey from notebook to production is where AI systems fail most often. A working prototype is not a production system. Production requires: reliability under load, graceful failure handling, cost management, stateless design, observability, and human oversight.

Key Differences

DimensionPrototypeProduction
Availability Works most of the time 99.9%+ SLA required
Failure modes Crash = restart Graceful degradation, fallbacks
Latency Any latency acceptable P99 latency budget (typically 200-2000ms)
Throughput Handle current requests Scale to 10x+ under load
Cost Who cares? Budget-aware scaling, cost per request
Observability Print statements Structured logging, metrics, traces, alerts
Testing Manual spot checks Automated testing, canary rollouts

The biggest shock for ML engineers moving to production is discovering that model accuracy isn't the main problem. Reliability is.

02 — Architecture

Execution Models: Sync, Async, Streaming, Batch

Different use cases require different execution patterns. Each has tradeoffs in latency, throughput, and complexity.

Synchronous (Blocking)

User waits for response. Request → Model inference → Response. Simple but scales poorly. Suitable for: chatbots, real-time classification, low-latency requirements.

Asynchronous (Queue-based)

Request is queued, user is notified later. Request → Queue → Worker pool → Callback. Decouples producer and consumer. Good for: document processing, batch recommendations, report generation.

Streaming

Continuous token-by-token output. Client receives partial results as they're generated. Essential for: LLM chat, any generative task. Dramatically improves perceived latency.

Batch Processing

Process many items at once on schedule. Suitable for: periodic model scoring, feature generation, nightly reranking. Lowest cost per request.

Example: Retry Pattern

Production systems must handle transient failures (network hiccups, rate limits, temporary service outages). Implement exponential backoff retries:

import logging import anthropic from datetime import datetime from functools import wraps import time logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) def llm_call_with_retry(max_retries=3, backoff=2.0): def decorator(func): @wraps(func) def wrapper(*args, **kwargs): for attempt in range(max_retries): start = time.time() try: result = func(*args, **kwargs) latency_ms = (time.time() - start) * 1000 logger.info(f"llm_call_success latency_ms={latency_ms:.0f} attempt={attempt}") return result except Exception as e: if attempt == max_retries - 1: raise wait = backoff ** attempt logger.warning(f"llm_call_retry attempt={attempt} error={e} wait={wait}s") time.sleep(wait) return wrapper return decorator
03 — Resilience

Reliability Patterns

Production systems fail. Graceful degradation beats perfect failure modes.

Retry with Backoff

Transient failures often recover within milliseconds. Retry with exponential backoff: 100ms, 200ms, 400ms, 800ms, etc. Prevents cascading failures when upstream services flake.

Circuit Breaker

If an external service is failing, stop calling it and fail fast instead of burning up timeouts. Circuits have three states: closed (normal), open (failing, fast-fail), half-open (probe for recovery).

Fallback Responses

Always have a plan when a service fails. For recommendations: return popular items. For translations: return original text. For moderation: default to safe (block). Don't just error out.

Bulkheads

Isolate critical resources. If one model inference is slow, don't let it block other requests. Use separate thread pools, queues, or machines for different tasks.

Timeouts

Never wait forever. Set aggressive timeouts (100-2000ms). Better to fail fast than hang. Cascading timeouts are your enemy: if A→B→C each waits 30s, total is 90s.

Load Shedding

When overloaded, reject low-priority requests instead of degrading everything. Queue capacity is a feature, not a bug.

04 — Efficiency

Traffic Management & Cost Control

As traffic grows, costs grow. LLM inference is expensive. A single call to Claude costs $0.0015-$0.10 depending on model. With millions of requests, this adds up fast.

Cost Per Request

Track and optimize:

Traffic Scaling

Cost vs. Quality Tradeoff

Not every request needs the best model. Route intelligently:

05 — Statefulness

State & Session Management

Production systems must be stateless for horizontal scaling. Session data (conversation history, user preferences) must live outside the application server.

Conversation State

For stateful interactions (chatbots, multi-turn conversations), store session history in a database or cache, not in memory. Options:

Session Size

Conversation history grows with each turn. Long conversations cause:

Solution: Summarization. Periodically compress old messages into a summary, discarding originals. Trade memory for one model call.

Idempotency

If a request is retried (due to network failure), it should produce the same result. Use idempotency keys: client provides a unique ID, server deduplicates.

06 — Control

Human Oversight Loops

AI systems make mistakes. Building human oversight into the loop is not a limitation—it's essential for safety and quality.

Common Patterns

Building Effective Workflows

Make human oversight easy and attractive:

07 — Ecosystem

Production Tools & Infrastructure

Modal
Serverless GPU
Deploy Python functions as serverless endpoints. Auto-scales, pay-per-use. Great for models.
Ray Serve
Model Serving
Serve models at scale. Batching, versioning, A/B testing built-in. Open source.
vLLM
LLM Inference
Fast LLM serving. Paged attention, KV cache optimization. SOTA performance.
BentoML
Model Packaging
Package models as versioned services. Deploy anywhere (cloud, edge, on-prem).
Kubernetes
Orchestration
Industry standard for container orchestration. Complex but powerful.
Docker
Containerization
Package models + dependencies as images. Reproducible, portable.
Prometheus
Metrics
Time-series metrics database. Scrape endpoints, set alerts.
Datadog
Monitoring
Full-stack monitoring. Logs, traces, metrics, APM. Comprehensive.
Temporal
Workflows
Durable async workflows. Survives restarts, handles retries, auditable.
OpenTelemetry
Observability
Standard instrumentation for logs, traces, metrics. Vendor-agnostic.
Python · Async streaming LLM endpoint with SSE
import asyncio
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import AsyncOpenAI

app = FastAPI()
client = AsyncOpenAI()

@app.post("/stream")
async def stream_completion(prompt: str):
    """Server-Sent Events endpoint for streaming LLM output."""
    async def generate():
        try:
            stream = await client.chat.completions.create(
                model="gpt-4o",
                messages=[{"role": "user", "content": prompt}],
                stream=True,
                timeout=30.0
            )
            async for chunk in stream:
                delta = chunk.choices[0].delta.content or ""
                if delta:
                    # SSE format: "data: 

"
                    yield f"data: {delta}

"
        except asyncio.TimeoutError:
            yield "data: [TIMEOUT]

"
        except Exception as e:
            yield f"data: [ERROR: {e}]

"
        finally:
            yield "data: [DONE]

"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",   # disable nginx buffering
            "Connection": "keep-alive"
        }
    )

# JavaScript client:
# const es = new EventSource('/stream?prompt=Hello');
# es.onmessage = (e) => { if (e.data === '[DONE]') es.close();
#                          else document.body.innerText += e.data; };
Python · Retry with exponential backoff and fallback model
import time, random, logging
from functools import wraps
from openai import OpenAI, RateLimitError, APIError

client = OpenAI()
PRIMARY   = "gpt-4o"
FALLBACK  = "gpt-4o-mini"

def with_retry(max_attempts=3, base_delay=1.0, use_fallback=True):
    def decorator(fn):
        @wraps(fn)
        def wrapper(*args, **kwargs):
            last_err = None
            for attempt in range(max_attempts):
                try:
                    return fn(*args, **kwargs)
                except RateLimitError as e:
                    last_err = e
                    wait = base_delay * (2 ** attempt) + random.random()
                    logging.warning(f"Rate limit — retry in {wait:.1f}s (attempt {attempt+1})")
                    time.sleep(wait)
                except APIError as e:
                    if e.status_code and e.status_code >= 500:
                        last_err = e
                        time.sleep(base_delay * (attempt + 1))
                    else:
                        raise  # 4xx = client error, don't retry
            # All retries exhausted — try cheaper fallback
            if use_fallback and kwargs.get("model") == PRIMARY:
                logging.warning(f"Falling back to {FALLBACK}")
                kwargs["model"] = FALLBACK
                return fn(*args, **kwargs)
            raise last_err
        return wrapper
    return decorator

@with_retry(max_attempts=3, use_fallback=True)
def call_llm(prompt: str, model: str = PRIMARY) -> str:
    return client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    ).choices[0].message.content
08 — Related Topics

Deep Dive into Subclusters

Production engineering breaks into specialized domains:

09 — Further Reading

References

Key Papers & Reports
Documentation & Guides
Practitioner Writing