The engineering layer between prototype and production: execution models, reliability patterns, traffic management, session state, and human oversight.
The journey from notebook to production is where AI systems fail most often. A working prototype is not a production system. Production requires: reliability under load, graceful failure handling, cost management, stateless design, observability, and human oversight.
| Dimension | Prototype | Production |
|---|---|---|
| Availability | Works most of the time | 99.9%+ SLA required |
| Failure modes | Crash = restart | Graceful degradation, fallbacks |
| Latency | Any latency acceptable | P99 latency budget (typically 200-2000ms) |
| Throughput | Handle current requests | Scale to 10x+ under load |
| Cost | Who cares? | Budget-aware scaling, cost per request |
| Observability | Print statements | Structured logging, metrics, traces, alerts |
| Testing | Manual spot checks | Automated testing, canary rollouts |
The biggest shock for ML engineers moving to production is discovering that model accuracy isn't the main problem. Reliability is.
Different use cases require different execution patterns. Each has tradeoffs in latency, throughput, and complexity.
User waits for response. Request → Model inference → Response. Simple but scales poorly. Suitable for: chatbots, real-time classification, low-latency requirements.
Request is queued, user is notified later. Request → Queue → Worker pool → Callback. Decouples producer and consumer. Good for: document processing, batch recommendations, report generation.
Continuous token-by-token output. Client receives partial results as they're generated. Essential for: LLM chat, any generative task. Dramatically improves perceived latency.
Process many items at once on schedule. Suitable for: periodic model scoring, feature generation, nightly reranking. Lowest cost per request.
Production systems must handle transient failures (network hiccups, rate limits, temporary service outages). Implement exponential backoff retries:
Production systems fail. Graceful degradation beats perfect failure modes.
Transient failures often recover within milliseconds. Retry with exponential backoff: 100ms, 200ms, 400ms, 800ms, etc. Prevents cascading failures when upstream services flake.
If an external service is failing, stop calling it and fail fast instead of burning up timeouts. Circuits have three states: closed (normal), open (failing, fast-fail), half-open (probe for recovery).
Always have a plan when a service fails. For recommendations: return popular items. For translations: return original text. For moderation: default to safe (block). Don't just error out.
Isolate critical resources. If one model inference is slow, don't let it block other requests. Use separate thread pools, queues, or machines for different tasks.
Never wait forever. Set aggressive timeouts (100-2000ms). Better to fail fast than hang. Cascading timeouts are your enemy: if A→B→C each waits 30s, total is 90s.
When overloaded, reject low-priority requests instead of degrading everything. Queue capacity is a feature, not a bug.
As traffic grows, costs grow. LLM inference is expensive. A single call to Claude costs $0.0015-$0.10 depending on model. With millions of requests, this adds up fast.
Track and optimize:
Not every request needs the best model. Route intelligently:
Production systems must be stateless for horizontal scaling. Session data (conversation history, user preferences) must live outside the application server.
For stateful interactions (chatbots, multi-turn conversations), store session history in a database or cache, not in memory. Options:
Conversation history grows with each turn. Long conversations cause:
Solution: Summarization. Periodically compress old messages into a summary, discarding originals. Trade memory for one model call.
If a request is retried (due to network failure), it should produce the same result. Use idempotency keys: client provides a unique ID, server deduplicates.
AI systems make mistakes. Building human oversight into the loop is not a limitation—it's essential for safety and quality.
Make human oversight easy and attractive:
import asyncio
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import AsyncOpenAI
app = FastAPI()
client = AsyncOpenAI()
@app.post("/stream")
async def stream_completion(prompt: str):
"""Server-Sent Events endpoint for streaming LLM output."""
async def generate():
try:
stream = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
stream=True,
timeout=30.0
)
async for chunk in stream:
delta = chunk.choices[0].delta.content or ""
if delta:
# SSE format: "data:
"
yield f"data: {delta}
"
except asyncio.TimeoutError:
yield "data: [TIMEOUT]
"
except Exception as e:
yield f"data: [ERROR: {e}]
"
finally:
yield "data: [DONE]
"
return StreamingResponse(
generate(),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"X-Accel-Buffering": "no", # disable nginx buffering
"Connection": "keep-alive"
}
)
# JavaScript client:
# const es = new EventSource('/stream?prompt=Hello');
# es.onmessage = (e) => { if (e.data === '[DONE]') es.close();
# else document.body.innerText += e.data; };
import time, random, logging
from functools import wraps
from openai import OpenAI, RateLimitError, APIError
client = OpenAI()
PRIMARY = "gpt-4o"
FALLBACK = "gpt-4o-mini"
def with_retry(max_attempts=3, base_delay=1.0, use_fallback=True):
def decorator(fn):
@wraps(fn)
def wrapper(*args, **kwargs):
last_err = None
for attempt in range(max_attempts):
try:
return fn(*args, **kwargs)
except RateLimitError as e:
last_err = e
wait = base_delay * (2 ** attempt) + random.random()
logging.warning(f"Rate limit — retry in {wait:.1f}s (attempt {attempt+1})")
time.sleep(wait)
except APIError as e:
if e.status_code and e.status_code >= 500:
last_err = e
time.sleep(base_delay * (attempt + 1))
else:
raise # 4xx = client error, don't retry
# All retries exhausted — try cheaper fallback
if use_fallback and kwargs.get("model") == PRIMARY:
logging.warning(f"Falling back to {FALLBACK}")
kwargs["model"] = FALLBACK
return fn(*args, **kwargs)
raise last_err
return wrapper
return decorator
@with_retry(max_attempts=3, use_fallback=True)
def call_llm(prompt: str, model: str = PRIMARY) -> str:
return client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
).choices[0].message.content
Production engineering breaks into specialized domains: