Production Patterns

GenAI Applications

Six canonical application patterns that repeatedly work in production. Learn which one fits your use case and how to avoid common pitfalls.

6 Application Patterns
3 Complexity Tiers
On This Page
01 — Overview

Six Canonical Application Patterns

GenAI applications fall into a small number of repeatable patterns. Each has different complexity, cost, and latency characteristics. Know the six, match your use case to one, and adapt the proven architecture.

The Six Patterns

1. RAG Systems: User asks a question, system retrieves relevant documents, LLM answers from context. Use for Q&A over documents, knowledge bases, product docs. Latency: 500ms-2s. Complexity: Medium. Maturity: High.

2. Code Assistants: User writes code, AI provides completions, suggestions, or analysis. Use for IDE plugins, code review, auto-fix. Latency: 100ms-500ms. Complexity: Low-Medium. Maturity: High.

3. Structured Output Apps: User provides text/image, AI extracts structured data (JSON, tables, labels). Use for form extraction, data classification, entity recognition. Latency: 200ms-1s. Complexity: Low. Maturity: Very High.

4. Text-to-Data (Text-to-SQL): User asks a natural language question, system converts to query (SQL, API call), executes, returns results. Use for business intelligence, database exploration. Latency: 500ms-3s. Complexity: High. Maturity: Medium.

5. Voice Agents: User speaks, system transcribes, reasons, acts, then speaks back. Use for customer support, hands-free control, conversational interfaces. Latency: 1-5s. Complexity: Very High. Maturity: Medium.

6. Document Processing: User uploads document, system extracts text, chunks, analyzes, summarizes, or processes at scale. Use for invoice processing, contract review, knowledge extraction. Latency: 1-10s per document. Complexity: Medium-High. Maturity: High.

💡 Pattern selection: Match your use case to the pattern, not vice versa. If you're forcing a use case into the wrong pattern, the architecture will fight you. RAG for knowledge lookup, agents for multi-step tasks, structured output for extraction.
02 — Analysis

Pattern Comparison

Use this table to compare patterns across key dimensions.

Pattern Best For Latency Complexity Cost/Request
RAG Systems Q&A, knowledge lookup 500ms-2s Medium Low-Medium
Code Assistants Completions, analysis 100-500ms Low Low
Structured Output Extraction, classification 200ms-1s Low Very Low
Text-to-SQL DB queries, analytics 500ms-3s High Medium
Voice Agents Conversational, hands-free 1-5s Very High High
Document Processing Batch analysis, extraction 1-10s/doc Medium Medium
Production rule: Structured output and code assistants are safest first products (low complexity, proven). RAG is next (higher complexity but well-understood). Text-to-SQL and voice agents need experienced teams. Document processing is good for internal tools.
03 — Pattern 1

RAG Systems

Retrieval-augmented generation is the most common production pattern. User asks a question, system retrieves relevant documents from a knowledge base or vector database, then feeds those documents to an LLM to answer. The LLM answers from facts, not hallucinations.

When to Use

Customer support knowledge bases. Product documentation Q&A. Internal wiki search. Legal document search. Financial reports Q&A. Medical literature review. Any domain where up-to-date, accurate facts are critical and hallucination is unacceptable.

Key Decisions

Chunking: How to split documents (by paragraph, by token, by semantic boundary). Retrieval: Hybrid search (keyword + semantic), BM25, dense vectors, or traditional SQL. Ranking: Re-rank results by relevance. Prompting: How to format retrieved documents in the prompt.

Common Pitfalls

Stale documents: If knowledge base isn't updated, answers are outdated. Retriever failure: Wrong documents retrieved = wrong answer. Context overload: Too many documents in prompt confuses the LLM. No evaluation: Can't measure retrieval quality without test set.

⚠️ RAG debugging: Build three things in parallel: the retriever (measure recall/precision), the prompt (measure BLEU/exact match), and the evaluation set (measure end-to-end). Most RAG failures are retriever failures, not LLM failures.
Python · Production-ready RAG with ChromaDB and OpenAI
from openai import OpenAI
import chromadb

client = OpenAI()
chroma = chromadb.PersistentClient("./chroma_db")
collection = chroma.get_or_create_collection(
    "docs", metadata={"hnsw:space": "cosine"}
)

def embed(texts: list[str]) -> list[list[float]]:
    resp = client.embeddings.create(
        model="text-embedding-3-small", input=texts
    )
    return [r.embedding for r in resp.data]

def index_documents(docs: list[dict]):
    """Index documents. Each doc: {id, content, metadata}."""
    batch_size = 100
    for i in range(0, len(docs), batch_size):
        batch = docs[i:i+batch_size]
        collection.add(
            ids=[d["id"] for d in batch],
            embeddings=embed([d["content"] for d in batch]),
            documents=[d["content"] for d in batch],
            metadatas=[d.get("metadata", {}) for d in batch]
        )

def rag_query(question: str, top_k: int = 5) -> str:
    # 1. Retrieve
    results = collection.query(
        query_embeddings=embed([question]),
        n_results=top_k
    )
    context = "

---

".join(results["documents"][0])

    # 2. Generate grounded answer
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content":
                "Answer using ONLY the provided context. "
                "If the answer isn't there, say 'I don't have that information.'"},
            {"role": "user", "content":
                f"Context:
{context}

Question: {question}"}
        ]
    ).choices[0].message.content
04 — Pattern 2

Code Assistants

AI that understands and generates code. Used in IDEs (Copilot, Cursor), as code review tools, and for auto-fix. The LLM sees the current file, project context, and generates the next line(s) or explains code.

When to Use

Code completion in IDEs. Pull request review and suggestions. Refactoring and migration. Bug detection and fixes. Documentation generation from code. Test generation.

Key Decisions

Context window: How much file and project context to give the LLM (more = better understanding, slower). Model: Specialized code models (Claude, GPT-4, Codestral) vs general models. Interaction pattern: Real-time streaming vs batch. Integration: IDE plugin, API, or web UI.

Common Pitfalls

Context pollution: Including too much boilerplate context confuses the model. Security: Sending proprietary code to external APIs. False suggestions: AI suggests syntactically correct but semantically wrong code. No feedback loop: Can't improve without measuring suggestion quality.

💡 Code context matters: Include method signatures, type hints, recent similar functions, and test examples. Avoid including large dependencies or build artifacts. Specialized code models (Claude) are worth the cost.
05 — Pattern 3

Structured Output Apps

Extract structured data from text or images. User provides input, LLM outputs JSON, table, or labeled fields. Used for classification, extraction, entity recognition, and data enrichment.

When to Use

Form extraction from documents. Product classification and tagging. NER (named entity recognition). Sentiment analysis and emotion detection. Data augmentation and enrichment. Invoice/receipt processing. Resume parsing.

Key Decisions

Output schema: Define JSON schema or table structure. Validation: Constrain LLM output to schema (use Instructor, JSON mode, or regex). Fallback: What to do if LLM can't extract field. Batch vs real-time.

Common Pitfalls

Invalid JSON: LLM violates schema despite instructions. Use tool like Instructor to enforce. Missing fields: LLM skips optional fields or hallucinates values. No validation: Extracted data isn't checked for type or range. Bad schema: Schema doesn't match actual data, causing mismatches.

Best practice: Use Instructor library (Python) or similar to enforce schema. Define schema in code, not prose. Test on diverse examples. Always validate output before using it downstream.
Python · Structured output extraction with Pydantic + OpenAI
from pydantic import BaseModel, Field
from typing import Optional
from openai import OpenAI

client = OpenAI()

class JobPosting(BaseModel):
    title: str
    company: str
    salary_min: Optional[int] = Field(None, description="Min salary USD")
    salary_max: Optional[int] = Field(None, description="Max salary USD")
    remote: bool
    required_skills: list[str]
    years_experience: Optional[int] = None

def extract_job(raw_text: str) -> JobPosting:
    result = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {"role": "system", "content":
                "Extract structured data from the job posting. "
                "Be precise about salary ranges and skill requirements."},
            {"role": "user", "content": raw_text}
        ],
        response_format=JobPosting,
        temperature=0.0
    )
    return result.choices[0].message.parsed

# Example
raw = """
Senior ML Engineer at Acme Corp. Fully remote. 
Compensation: $160,000–$200,000. Requires 5+ years Python experience, 
strong PyTorch background, and experience with distributed training.
"""
job = extract_job(raw)
print(f"{job.title} @ {job.company}")
print(f"Salary: ${job.salary_min:,}–${job.salary_max:,} | Remote: {job.remote}")
print(f"Skills: {', '.join(job.required_skills)}")
06 — Lessons

Common Production Pitfalls (All Patterns)

Regardless of pattern, these failures repeat across applications. Know them, avoid them.

Latency Problems

Issue: Application is slow, users see spinners for seconds. Root causes: LLM latency (choosing smaller or faster models helps), retrieval latency (optimize vector search), network hops. Fix: Streaming responses to users while inference runs, use smaller models for fast paths, cache common queries.

Cost Explosions

Issue: API bills are higher than expected. Root causes: Excessive context (RAG with 10K docs per call), many retries, large models on cheap-to-implement features. Fix: Monitor per-request token usage, cap retrieval results, use smaller models for simple tasks, cache responses.

Quality Drift

Issue: Quality was good in testing but degraded in production. Root causes: Different data distribution in production, model updates, upstream data changes (e.g., docs updated). Fix: Continuous evaluation on real data, version control prompts, monitor quality metrics, re-baseline after model updates.

Hallucinations & False Positives

Issue: LLM confidently gives wrong answers. Root causes: LLM doesn't know the answer but invents one. Fix: Use retrieval (RAG) to inject facts, constrain outputs with schema, use exact-match confidence thresholds, add human review for high-stakes decisions.

Integration Friction

Issue: Hard to integrate GenAI into existing systems. Root causes: API incompatibilities, latency mismatch, stateless vs stateful design, data privacy. Fix: Design for async/queued processing, plan data privacy early, version your API, use inference services with enterprise support.

⚠️ The production reality: 70% of GenAI problems aren't algorithmic — they're operational. Latency, cost, monitoring, error handling, and data freshness matter more than model size. Plan for ops early.
Python · Tool-calling agent loop (OpenAI function calling)
import json
from openai import OpenAI

client = OpenAI()

TOOLS = [
    {"type": "function", "function": {
        "name": "search_docs",
        "description": "Search internal documentation",
        "parameters": {"type": "object",
                       "properties": {"query": {"type": "string"}},
                       "required": ["query"]}
    }},
    {"type": "function", "function": {
        "name": "create_ticket",
        "description": "Create a support ticket",
        "parameters": {"type": "object",
                       "properties": {"title": {"type": "string"},
                                      "priority": {"type": "string",
                                                   "enum": ["low","medium","high"]}},
                       "required": ["title", "priority"]}
    }}
]

def dispatch(name: str, args: dict) -> str:
    if name == "search_docs":
        return f"[Docs for '{args['query']}': See https://docs.example.com]"
    if name == "create_ticket":
        return f"[Ticket created: #{hash(args['title']) % 9999} — {args['title']}]"
    return "Unknown tool"

def run_agent(task: str) -> str:
    messages = [{"role": "user", "content": task}]
    for _ in range(10):   # max turns
        resp = client.chat.completions.create(
            model="gpt-4o", messages=messages, tools=TOOLS
        )
        msg = resp.choices[0].message
        messages.append(msg)
        if not msg.tool_calls:
            return msg.content  # final answer
        for call in msg.tool_calls:
            result = dispatch(call.function.name,
                              json.loads(call.function.arguments))
            messages.append({"role": "tool",
                              "tool_call_id": call.id, "content": result})
    return "Max turns reached"

print(run_agent("Search for RAG documentation and create a high-priority ticket to review it."))
07 — Explore

Deep Dives: Each Application Pattern

Each pattern deserves a detailed deep dive. Start with the pattern that matches your use case.

Application Patterns

1

RAG Systems

Retrieval-augmented generation: indexing, retrieval, re-ranking, and fact-grounding for question answering.

2

Code Assistants

Code understanding and generation: completion, review, refactoring, and test generation.

3

Structured Output

Extraction and classification: enforcing schemas, validation, and structured data generation.

4

Text-to-Data

Natural language to SQL/API: converting questions to queries, execution, and result formatting.

5

Voice Agents

Conversational AI: speech recognition, reasoning, action, and speech synthesis at scale.

6

Document Processing

Batch and streaming analysis: chunking, extraction, summarization, and scale processing.

Implementation order: Start with structured output (lowest risk, fastest feedback), then RAG (highest ROI), then code assistants. Graduate to voice and text-to-SQL once you have ops experience.
08 — Further Reading

References

Application Patterns
Production Engineering
Frameworks & Libraries