Six canonical application patterns that repeatedly work in production. Learn which one fits your use case and how to avoid common pitfalls.
GenAI applications fall into a small number of repeatable patterns. Each has different complexity, cost, and latency characteristics. Know the six, match your use case to one, and adapt the proven architecture.
1. RAG Systems: User asks a question, system retrieves relevant documents, LLM answers from context. Use for Q&A over documents, knowledge bases, product docs. Latency: 500ms-2s. Complexity: Medium. Maturity: High.
2. Code Assistants: User writes code, AI provides completions, suggestions, or analysis. Use for IDE plugins, code review, auto-fix. Latency: 100ms-500ms. Complexity: Low-Medium. Maturity: High.
3. Structured Output Apps: User provides text/image, AI extracts structured data (JSON, tables, labels). Use for form extraction, data classification, entity recognition. Latency: 200ms-1s. Complexity: Low. Maturity: Very High.
4. Text-to-Data (Text-to-SQL): User asks a natural language question, system converts to query (SQL, API call), executes, returns results. Use for business intelligence, database exploration. Latency: 500ms-3s. Complexity: High. Maturity: Medium.
5. Voice Agents: User speaks, system transcribes, reasons, acts, then speaks back. Use for customer support, hands-free control, conversational interfaces. Latency: 1-5s. Complexity: Very High. Maturity: Medium.
6. Document Processing: User uploads document, system extracts text, chunks, analyzes, summarizes, or processes at scale. Use for invoice processing, contract review, knowledge extraction. Latency: 1-10s per document. Complexity: Medium-High. Maturity: High.
Use this table to compare patterns across key dimensions.
| Pattern | Best For | Latency | Complexity | Cost/Request |
|---|---|---|---|---|
| RAG Systems | Q&A, knowledge lookup | 500ms-2s | Medium | Low-Medium |
| Code Assistants | Completions, analysis | 100-500ms | Low | Low |
| Structured Output | Extraction, classification | 200ms-1s | Low | Very Low |
| Text-to-SQL | DB queries, analytics | 500ms-3s | High | Medium |
| Voice Agents | Conversational, hands-free | 1-5s | Very High | High |
| Document Processing | Batch analysis, extraction | 1-10s/doc | Medium | Medium |
Retrieval-augmented generation is the most common production pattern. User asks a question, system retrieves relevant documents from a knowledge base or vector database, then feeds those documents to an LLM to answer. The LLM answers from facts, not hallucinations.
Customer support knowledge bases. Product documentation Q&A. Internal wiki search. Legal document search. Financial reports Q&A. Medical literature review. Any domain where up-to-date, accurate facts are critical and hallucination is unacceptable.
Chunking: How to split documents (by paragraph, by token, by semantic boundary). Retrieval: Hybrid search (keyword + semantic), BM25, dense vectors, or traditional SQL. Ranking: Re-rank results by relevance. Prompting: How to format retrieved documents in the prompt.
Stale documents: If knowledge base isn't updated, answers are outdated. Retriever failure: Wrong documents retrieved = wrong answer. Context overload: Too many documents in prompt confuses the LLM. No evaluation: Can't measure retrieval quality without test set.
from openai import OpenAI
import chromadb
client = OpenAI()
chroma = chromadb.PersistentClient("./chroma_db")
collection = chroma.get_or_create_collection(
"docs", metadata={"hnsw:space": "cosine"}
)
def embed(texts: list[str]) -> list[list[float]]:
resp = client.embeddings.create(
model="text-embedding-3-small", input=texts
)
return [r.embedding for r in resp.data]
def index_documents(docs: list[dict]):
"""Index documents. Each doc: {id, content, metadata}."""
batch_size = 100
for i in range(0, len(docs), batch_size):
batch = docs[i:i+batch_size]
collection.add(
ids=[d["id"] for d in batch],
embeddings=embed([d["content"] for d in batch]),
documents=[d["content"] for d in batch],
metadatas=[d.get("metadata", {}) for d in batch]
)
def rag_query(question: str, top_k: int = 5) -> str:
# 1. Retrieve
results = collection.query(
query_embeddings=embed([question]),
n_results=top_k
)
context = "
---
".join(results["documents"][0])
# 2. Generate grounded answer
return client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content":
"Answer using ONLY the provided context. "
"If the answer isn't there, say 'I don't have that information.'"},
{"role": "user", "content":
f"Context:
{context}
Question: {question}"}
]
).choices[0].message.content
AI that understands and generates code. Used in IDEs (Copilot, Cursor), as code review tools, and for auto-fix. The LLM sees the current file, project context, and generates the next line(s) or explains code.
Code completion in IDEs. Pull request review and suggestions. Refactoring and migration. Bug detection and fixes. Documentation generation from code. Test generation.
Context window: How much file and project context to give the LLM (more = better understanding, slower). Model: Specialized code models (Claude, GPT-4, Codestral) vs general models. Interaction pattern: Real-time streaming vs batch. Integration: IDE plugin, API, or web UI.
Context pollution: Including too much boilerplate context confuses the model. Security: Sending proprietary code to external APIs. False suggestions: AI suggests syntactically correct but semantically wrong code. No feedback loop: Can't improve without measuring suggestion quality.
Extract structured data from text or images. User provides input, LLM outputs JSON, table, or labeled fields. Used for classification, extraction, entity recognition, and data enrichment.
Form extraction from documents. Product classification and tagging. NER (named entity recognition). Sentiment analysis and emotion detection. Data augmentation and enrichment. Invoice/receipt processing. Resume parsing.
Output schema: Define JSON schema or table structure. Validation: Constrain LLM output to schema (use Instructor, JSON mode, or regex). Fallback: What to do if LLM can't extract field. Batch vs real-time.
Invalid JSON: LLM violates schema despite instructions. Use tool like Instructor to enforce. Missing fields: LLM skips optional fields or hallucinates values. No validation: Extracted data isn't checked for type or range. Bad schema: Schema doesn't match actual data, causing mismatches.
from pydantic import BaseModel, Field
from typing import Optional
from openai import OpenAI
client = OpenAI()
class JobPosting(BaseModel):
title: str
company: str
salary_min: Optional[int] = Field(None, description="Min salary USD")
salary_max: Optional[int] = Field(None, description="Max salary USD")
remote: bool
required_skills: list[str]
years_experience: Optional[int] = None
def extract_job(raw_text: str) -> JobPosting:
result = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content":
"Extract structured data from the job posting. "
"Be precise about salary ranges and skill requirements."},
{"role": "user", "content": raw_text}
],
response_format=JobPosting,
temperature=0.0
)
return result.choices[0].message.parsed
# Example
raw = """
Senior ML Engineer at Acme Corp. Fully remote.
Compensation: $160,000–$200,000. Requires 5+ years Python experience,
strong PyTorch background, and experience with distributed training.
"""
job = extract_job(raw)
print(f"{job.title} @ {job.company}")
print(f"Salary: ${job.salary_min:,}–${job.salary_max:,} | Remote: {job.remote}")
print(f"Skills: {', '.join(job.required_skills)}")
Regardless of pattern, these failures repeat across applications. Know them, avoid them.
Issue: Application is slow, users see spinners for seconds. Root causes: LLM latency (choosing smaller or faster models helps), retrieval latency (optimize vector search), network hops. Fix: Streaming responses to users while inference runs, use smaller models for fast paths, cache common queries.
Issue: API bills are higher than expected. Root causes: Excessive context (RAG with 10K docs per call), many retries, large models on cheap-to-implement features. Fix: Monitor per-request token usage, cap retrieval results, use smaller models for simple tasks, cache responses.
Issue: Quality was good in testing but degraded in production. Root causes: Different data distribution in production, model updates, upstream data changes (e.g., docs updated). Fix: Continuous evaluation on real data, version control prompts, monitor quality metrics, re-baseline after model updates.
Issue: LLM confidently gives wrong answers. Root causes: LLM doesn't know the answer but invents one. Fix: Use retrieval (RAG) to inject facts, constrain outputs with schema, use exact-match confidence thresholds, add human review for high-stakes decisions.
Issue: Hard to integrate GenAI into existing systems. Root causes: API incompatibilities, latency mismatch, stateless vs stateful design, data privacy. Fix: Design for async/queued processing, plan data privacy early, version your API, use inference services with enterprise support.
import json
from openai import OpenAI
client = OpenAI()
TOOLS = [
{"type": "function", "function": {
"name": "search_docs",
"description": "Search internal documentation",
"parameters": {"type": "object",
"properties": {"query": {"type": "string"}},
"required": ["query"]}
}},
{"type": "function", "function": {
"name": "create_ticket",
"description": "Create a support ticket",
"parameters": {"type": "object",
"properties": {"title": {"type": "string"},
"priority": {"type": "string",
"enum": ["low","medium","high"]}},
"required": ["title", "priority"]}
}}
]
def dispatch(name: str, args: dict) -> str:
if name == "search_docs":
return f"[Docs for '{args['query']}': See https://docs.example.com]"
if name == "create_ticket":
return f"[Ticket created: #{hash(args['title']) % 9999} — {args['title']}]"
return "Unknown tool"
def run_agent(task: str) -> str:
messages = [{"role": "user", "content": task}]
for _ in range(10): # max turns
resp = client.chat.completions.create(
model="gpt-4o", messages=messages, tools=TOOLS
)
msg = resp.choices[0].message
messages.append(msg)
if not msg.tool_calls:
return msg.content # final answer
for call in msg.tool_calls:
result = dispatch(call.function.name,
json.loads(call.function.arguments))
messages.append({"role": "tool",
"tool_call_id": call.id, "content": result})
return "Max turns reached"
print(run_agent("Search for RAG documentation and create a high-priority ticket to review it."))
Each pattern deserves a detailed deep dive. Start with the pattern that matches your use case.
Retrieval-augmented generation: indexing, retrieval, re-ranking, and fact-grounding for question answering.
Code understanding and generation: completion, review, refactoring, and test generation.
Extraction and classification: enforcing schemas, validation, and structured data generation.
Natural language to SQL/API: converting questions to queries, execution, and result formatting.
Conversational AI: speech recognition, reasoning, action, and speech synthesis at scale.
Batch and streaming analysis: chunking, extraction, summarization, and scale processing.