Prod Engineering

Contents

Prototype to production
Execution models
Reliability patterns
Traffic & cost
State & sessions
Human oversight
Patterns & examples
References

01 — Challenge

From Prototype to Production

The journey from notebook to production is where AI systems fail most often. A working prototype is not a production system. Production requires: reliability under load, graceful failure handling, cost management, stateless design, observability, and human oversight.

Key Differences

Dimension	Prototype	Production
Availability	Works most of the time	99.9%+ SLA required
Failure modes	Crash = restart	Graceful degradation, fallbacks
Latency	Any latency acceptable	P99 latency budget (typically 200-2000ms)
Throughput	Handle current requests	Scale to 10x+ under load
Cost	Who cares?	Budget-aware scaling, cost per request
Observability	Print statements	Structured logging, metrics, traces, alerts
Testing	Manual spot checks	Automated testing, canary rollouts

The biggest shock for ML engineers moving to production is discovering that model accuracy isn't the main problem. Reliability is.

02 — Architecture

Execution Models: Sync, Async, Streaming, Batch

Different use cases require different execution patterns. Each has tradeoffs in latency, throughput, and complexity.

Synchronous (Blocking)

User waits for response. Request → Model inference → Response. Simple but scales poorly. Suitable for: chatbots, real-time classification, low-latency requirements.

P99 latency: sub-100ms for language models is rare. Expect 200-2000ms.
Concurrency limit: each GPU can handle 1-10 concurrent requests (depends on batch size).
Better approach: queue requests, batch them, return results via webhook.

Asynchronous (Queue-based)

Request is queued, user is notified later. Request → Queue → Worker pool → Callback. Decouples producer and consumer. Good for: document processing, batch recommendations, report generation.

Throughput: much higher (20-100x) because of batching efficiency.
Latency: can be minutes or hours. Requires callback mechanism (webhook, polling).
Tooling: Celery, Bull, SQS + Lambda, Kafka consumers.

Streaming

Continuous token-by-token output. Client receives partial results as they're generated. Essential for: LLM chat, any generative task. Dramatically improves perceived latency.

User sees first token in ~100ms instead of waiting 3000ms for full response.
Requires HTTP streaming (Server-Sent Events) or WebSocket.
Most LLM APIs (OpenAI, Anthropic) support streaming.

Batch Processing

Process many items at once on schedule. Suitable for: periodic model scoring, feature generation, nightly reranking. Lowest cost per request.

Latency: minutes to hours. Results stored in database or file system.
Throughput: maximum, because of GPU/CPU batching efficiency.
Cost: minimum per request.

Example: Retry Pattern

Production systems must handle transient failures (network hiccups, rate limits, temporary service outages). Implement exponential backoff retries:

import logging import anthropic from datetime import datetime from functools import wraps import time logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) def llm_call_with_retry(max_retries=3, backoff=2.0): def decorator(func): @wraps(func) def wrapper(*args, **kwargs): for attempt in range(max_retries): start = time.time() try: result = func(*args, **kwargs) latency_ms = (time.time() - start) * 1000 logger.info(f"llm_call_success latency_ms={latency_ms:.0f} attempt={attempt}") return result except Exception as e: if attempt == max_retries - 1: raise wait = backoff ** attempt logger.warning(f"llm_call_retry attempt={attempt} error={e} wait={wait}s") time.sleep(wait) return wrapper return decorator

03 — Resilience

Reliability Patterns

Production systems fail. Graceful degradation beats perfect failure modes.

Retry with Backoff

Transient failures often recover within milliseconds. Retry with exponential backoff: 100ms, 200ms, 400ms, 800ms, etc. Prevents cascading failures when upstream services flake.

Circuit Breaker

If an external service is failing, stop calling it and fail fast instead of burning up timeouts. Circuits have three states: closed (normal), open (failing, fast-fail), half-open (probe for recovery).

Fallback Responses

Always have a plan when a service fails. For recommendations: return popular items. For translations: return original text. For moderation: default to safe (block). Don't just error out.

Bulkheads

Isolate critical resources. If one model inference is slow, don't let it block other requests. Use separate thread pools, queues, or machines for different tasks.

Timeouts

Never wait forever. Set aggressive timeouts (100-2000ms). Better to fail fast than hang. Cascading timeouts are your enemy: if A→B→C each waits 30s, total is 90s.

Load Shedding

When overloaded, reject low-priority requests instead of degrading everything. Queue capacity is a feature, not a bug.

04 — Efficiency

Traffic Management & Cost Control

As traffic grows, costs grow. LLM inference is expensive. A single call to Claude costs $0.0015-$0.10 depending on model. With millions of requests, this adds up fast.

Cost Per Request

Track and optimize:

Prompt tokens: Minimize context window. Reuse system prompts. Cache long documents.
Completion tokens: Set max_tokens to prevent runaway generations. Use smaller models when possible.
Cache misses: Prompt caching can reduce cost 10-100x. Use with long-context tasks (RAG, summarization).
Model selection: Cheaper models (Haiku, Llama-3-8B) are often fast enough. A/B test before committing to expensive models.

Traffic Scaling

Horizontal scaling: Add more workers (containers, Lambda instances) as demand grows.
Vertical scaling: Use better hardware (GPUs, TPUs). Can hit capacity limits.
Predictive scaling: Scale before peak traffic (based on day-of-week, time-of-day patterns).
Rate limiting: Prevent abuse. Quotas per user, token bucket algorithms, sliding windows.

Cost vs. Quality Tradeoff

Not every request needs the best model. Route intelligently:

Simple questions → fast, cheap model (Haiku).
Complex reasoning → full model (Claude 3.5 Sonnet).
Uncertain cases → human review (not model at all).

05 — Statefulness

State & Session Management

Production systems must be stateless for horizontal scaling. Session data (conversation history, user preferences) must live outside the application server.

Conversation State

For stateful interactions (chatbots, multi-turn conversations), store session history in a database or cache, not in memory. Options:

Redis: Fast in-memory store. Good for sessions. Scales with master-slave replication.
PostgreSQL: Durable, transactional. Slower than Redis but more reliable.
DynamoDB: Serverless, scales automatically. No ops burden.
Vector databases: For RAG systems, store embeddings for quick retrieval.

Session Size

Conversation history grows with each turn. Long conversations cause:

Higher latency (more tokens to process).
Higher cost (proportional to prompt token count).
Storage bloat (millions of conversations).

Solution: Summarization. Periodically compress old messages into a summary, discarding originals. Trade memory for one model call.

Idempotency

If a request is retried (due to network failure), it should produce the same result. Use idempotency keys: client provides a unique ID, server deduplicates.

06 — Control

Human Oversight Loops

AI systems make mistakes. Building human oversight into the loop is not a limitation—it's essential for safety and quality.

Common Patterns

Review before publish: Human approves before output goes to users. Works for small volumes (support tickets, content moderation).
Review after fact: System publishes, humans spot-check later. Catches systematic failures (bias, accuracy drift).
Escalation: System handles simple cases, humans handle hard ones. Reduces workload (80/20 rule).
Feedback loops: Users thumbs-up/down responses. Train models to predict user satisfaction.
Uncertainty routing: High-confidence predictions pass through. Borderline cases → human review.

Building Effective Workflows

Make human oversight easy and attractive:

Show confidence scores. Humans approve 95% confident, review 70%, reject 50%.
Provide context. Show source documents, similar past cases, model reasoning.
Make actions fast. One-click approve/reject. No forms.
Measure quality. How often do humans agree with the model? Track and improve.

07 — Ecosystem

Production Tools & Infrastructure

Modal

Serverless GPU

Deploy Python functions as serverless endpoints. Auto-scales, pay-per-use. Great for models.

Ray Serve

Model Serving

Serve models at scale. Batching, versioning, A/B testing built-in. Open source.

vLLM

LLM Inference

Fast LLM serving. Paged attention, KV cache optimization. SOTA performance.

BentoML

Model Packaging

Package models as versioned services. Deploy anywhere (cloud, edge, on-prem).

Kubernetes

Orchestration

Industry standard for container orchestration. Complex but powerful.

Docker

Containerization

Package models + dependencies as images. Reproducible, portable.

Prometheus

Metrics

Time-series metrics database. Scrape endpoints, set alerts.

Datadog

Monitoring

Full-stack monitoring. Logs, traces, metrics, APM. Comprehensive.

Temporal

Workflows

Durable async workflows. Survives restarts, handles retries, auditable.

OpenTelemetry

Observability

Standard instrumentation for logs, traces, metrics. Vendor-agnostic.

Python · Async streaming LLM endpoint with SSE

import asyncio
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from openai import AsyncOpenAI

app = FastAPI()
client = AsyncOpenAI()

@app.post("/stream")
async def stream_completion(prompt: str):
    """Server-Sent Events endpoint for streaming LLM output."""
    async def generate():
        try:
            stream = await client.chat.completions.create(
                model="gpt-4o",
                messages=[{"role": "user", "content": prompt}],
                stream=True,
                timeout=30.0
            )
            async for chunk in stream:
                delta = chunk.choices[0].delta.content or ""
                if delta:
                    # SSE format: "data: 

"
                    yield f"data: {delta}

"
        except asyncio.TimeoutError:
            yield "data: [TIMEOUT]

"
        except Exception as e:
            yield f"data: [ERROR: {e}]

"
        finally:
            yield "data: [DONE]

"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",   # disable nginx buffering
            "Connection": "keep-alive"
        }
    )

# JavaScript client:
# const es = new EventSource('/stream?prompt=Hello');
# es.onmessage = (e) => { if (e.data === '[DONE]') es.close();
#                          else document.body.innerText += e.data; };

Python · Retry with exponential backoff and fallback model

import time, random, logging
from functools import wraps
from openai import OpenAI, RateLimitError, APIError

client = OpenAI()
PRIMARY   = "gpt-4o"
FALLBACK  = "gpt-4o-mini"

def with_retry(max_attempts=3, base_delay=1.0, use_fallback=True):
    def decorator(fn):
        @wraps(fn)
        def wrapper(*args, **kwargs):
            last_err = None
            for attempt in range(max_attempts):
                try:
                    return fn(*args, **kwargs)
                except RateLimitError as e:
                    last_err = e
                    wait = base_delay * (2 ** attempt) + random.random()
                    logging.warning(f"Rate limit — retry in {wait:.1f}s (attempt {attempt+1})")
                    time.sleep(wait)
                except APIError as e:
                    if e.status_code and e.status_code >= 500:
                        last_err = e
                        time.sleep(base_delay * (attempt + 1))
                    else:
                        raise  # 4xx = client error, don't retry
            # All retries exhausted — try cheaper fallback
            if use_fallback and kwargs.get("model") == PRIMARY:
                logging.warning(f"Falling back to {FALLBACK}")
                kwargs["model"] = FALLBACK
                return fn(*args, **kwargs)
            raise last_err
        return wrapper
    return decorator

@with_retry(max_attempts=3, use_fallback=True)
def call_llm(prompt: str, model: str = PRIMARY) -> str:
    return client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    ).choices[0].message.content

08 — Related Topics

Deep Dive into Subclusters

Production engineering breaks into specialized domains:

Execution Models: Sync vs async, streaming, batch processing, request queuing.
Reliability Patterns: Retries, circuit breakers, fallbacks, bulkheads, timeouts.
Traffic & Cost: Scaling, rate limiting, cost optimization, model routing.
State & Sessions: Session storage, conversation history, idempotency, stateless design.
Human Oversight: Review workflows, escalation, uncertainty routing, feedback loops.

09 — Further Reading

References

Key Papers & Reports

Paper Sculley, D. et al. (2015). Hidden Technical Debt in Machine Learning Systems. NeurIPS. — arxiv.org/abs/1503.04811 ↗
Paper Breck, E. et al. (2016). The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. IEEE BigData. — arxiv.org/abs/1605.02874 ↗

Documentation & Guides

Docs vLLM documentation. docs.vllm.ai ↗
Docs Ray Serve guide. ray.io/serve ↗
Docs OpenTelemetry. opentelemetry.io/docs ↗
Guide Site Reliability Engineering book. sre.google/books ↗

Practitioner Writing

Blog Chip Huyen. (2022). ML Systems Design Interview. — huyenchip.com ↗
Blog Jeff Dean. (2021). Systems for Machine Learning: Operations and Deployment at Scale. — ctuning.org ↗