Production document Q&A, enterprise search, and knowledge management pipelines that reduce hallucination and enable dynamic knowledge updates.
RAG (Retrieval-Augmented Generation) is a pattern where an LLM generates answers using external knowledge: retrieve relevant documents from a corpus → feed them to the LLM as context → the LLM generates grounded responses. This solves two critical problems: hallucination (LLMs making up facts) and knowledge cutoff (LLMs frozen at training time).
Unlike fine-tuning (expensive, permanent, requires retraining), RAG allows you to update knowledge by simply adding new documents to the corpus. The LLM sees fresh context at query time, making RAG ideal for domains where facts change frequently: legal documents, medical records, product catalogs, financial reports.
| Use Case | Fit for RAG? | Reason |
|---|---|---|
| Document Q&A (PDFs, manuals) | Excellent | Answers grounded in docs |
| Enterprise search (Wiki, intranet) | Excellent | Dynamic content, access control |
| Customer support (FAQs) | Excellent | Consistent, traceable answers |
| Medical/legal domain | Excellent | High accuracy requirement, citations |
| General knowledge Q&A | Good | Helps with hallucination |
| Creative writing | Poor | Creativity shouldn't be grounded |
| Real-time analytics | Poor | Data changes faster than indexing |
A production RAG system has clear stages: offline indexing (happens once or in batch) and online serving (happens per-query). Separating these allows optimization: indexing can be slow/expensive, serving must be fast.
Document ingestion: Load PDFs, HTML, databases, Slack dumps, etc. Parsing: Extract text, tables, images. Preserve structure. Chunking: Split into retrieval-sized pieces (typically 256–1024 tokens). Enrichment: Add metadata (source, date, author), summaries, keywords. Embedding: Convert chunks to dense vectors. Storage: Index in vector DB + maintain metadata store.
Query embedding: Convert user query to vector using same embedding model. Retrieval: Semantic search in vector DB → top-k candidates. Optional re-ranking: Cross-encoder scores candidates by relevance → top-n. Context assembly: Grab full documents or passages. Prompt construction: Format context + query for LLM. Generation: LLM reads context and generates grounded answer. Citations: Track which documents supported answer.
Three mature frameworks dominate the RAG ecosystem. Each has different philosophy, maturity, and strengths. Choice depends on your team's background and deployment target.
| Framework | Philosophy | Best for | Learning curve |
|---|---|---|---|
| Haystack | Retrieval-first, pipeline-based | RAG specialists, production systems | Medium |
| LlamaIndex | Data indexing, query translation | Diverse data sources, structured queries | Low |
| LangChain | Chains + agents, wide integrations | Rapid prototyping, complex workflows | Medium |
Originally built at Deepset specifically for RAG. Strong document pipeline, excellent for multi-document workflows. Pipelines are explicit DAGs you can visualize. Great for teams committed to RAG as core feature. Good OSS community, commercial support available.
Focused on indexing diverse data sources: PDFs, databases, APIs, Notion. Excels at translating high-level queries into data-specific access patterns. Lightweight. Great for prototyping. Growing agent support. Simpler than Haystack for beginners.
Broadest tool ecosystem. Not RAG-specific — emphasizes chains (workflows), agents (tool use), memory. Better for applications that mix RAG with other LLM capabilities (code generation, arithmetic, APIs). Largest community. Heavy on abstractions, can be opinionated.
Chunking is arguably the single most important decision in RAG. Bad chunking breaks retrieval; good chunking makes everything else easy. Chunk size affects recall (smaller = more docs retrieved), precision (larger = fewer false positives), and latency (more chunks = slower embedding).
Split every N tokens, optionally with overlap. Simplest approach. Deterministic. Problem: breaks mid-sentence, splits concepts. Best for diverse corpora where structure isn't known.
Split at sentence or paragraph boundaries, then merge until chunk reaches target size. Respects semantics. Better coherence. Slightly slower at index time. Recommended for structured documents: reports, articles, wikis.
Index small chunks for precision, but fetch parent chunk for context. Example: 256-token chunks for retrieval, but show 1024-token parent to LLM. Combines precision and context. Requires document hierarchy awareness (sections, chapters).
| Document type | Recommended chunk size | Overlap | Method |
|---|---|---|---|
| PDFs, papers | 512 tokens | 64–128 tokens | Semantic |
| Web pages, blogs | 256–512 tokens | 32–64 tokens | Fixed or semantic |
| Code files | 256 tokens | 32 tokens | Language-aware splitting |
| Tables/structured | 128 tokens | 0 tokens | Table-aware parsing |
| Long documents | Small (256) primary, 1024 parent | 64 tokens | Hierarchical |
Your embedding model determines what "semantic similarity" means to your RAG system. Choice affects both quality and cost. General-purpose models (OpenAI, Anthropic) are good defaults. Domain-specific models (legal, medical, code) can be better if you have the budget.
High quality (proprietary): OpenAI text-embedding-3-large. Better generalization, higher cost. Balanced (open): BAAI/bge-large. Strong performance, free, can self-host. Fast (lightweight): e5-base-v2, all-MiniLM-L6. Lower latency, smaller models, acceptable quality.
Vector DBs store high-dimensional embeddings and answer nearest-neighbor queries fast. Dozens exist; choice depends on scale, latency budget, and whether you need hybrid (dense+sparse) search.
| Database | Type | Scale | Hybrid search | Best for |
|---|---|---|---|---|
| Chroma | OSS/managed | Small-medium | No | Prototypes, demos |
| Weaviate | OSS/managed | Medium-large | Yes | GraphQL queries, hybrid |
| Pinecone | Managed | Very large | No (dense only) | Production, serverless |
| Milvus | OSS | Medium-large | No | Scalable OSS deployments |
| Qdrant | OSS/managed | Medium-large | Yes (sparse) | Hybrid, production-ready |
| Elasticsearch | OSS/managed | Large | Yes (native) | Hybrid BM25+dense |
Hybrid search: If your corpus has domain-specific terminology (legal, medical, code), combine dense embeddings with BM25. Weaviate, Qdrant, Elasticsearch all support this. Scale: Small corpus (<100k docs) → Chroma is fine. Large corpus (millions) → Pinecone or self-hosted Qdrant/Milvus. Latency: Managed (Pinecone) is simpler ops; self-hosted gives you control. Cost: Chroma/Milvus/Qdrant OSS are free; Pinecone charges per query.
Moving RAG from prototype to production requires handling scale, reliability, and governance that lab settings don't require. Enterprise RAG introduces access control, data freshness, multi-tenancy, and auditability.
In enterprise, different users see different documents. A single shared vector DB can't enforce this naively. Solutions: document-level filtering (tag chunks with user_id, filter at retrieval time), per-user indexes (overkill for most cases), row-level security in vector DB (Qdrant, Weaviate support metadata-based filtering). Recommended: tag chunks with accessible_users and filter before ranking.
How often do you re-index? Daily/weekly is typical; minutes requires streaming indexing (harder). Strategy: batch indexing (new docs → index offline → swap at midnight), incremental indexing (new docs → add to existing index immediately), document versioning (keep old chunks, mark as stale, weight new higher). For real-time data (stock prices, live scores), RAG isn't the right tool alone.
What was retrieved? What was generated? Did we hallucinate? Track: queries and results (audit trail), retrieval quality (MRR, NDCG metrics), answer quality (user feedback, LLM-based scoring), latency per stage (retrieve, rerank, generate). Log sources of answers for compliance.
| Aspect | Consideration | Implementation |
|---|---|---|
| Access Control | Different users, different documents | Metadata-based filtering + user context |
| Data Freshness | How old is the corpus? | Batch or incremental indexing cadence |
| Version Control | Track document updates | Timestamp, hash, version field in metadata |
| Monitoring | Quality metrics | RAGAS or similar evaluation framework |
| Latency | Per-stage timings | Instrument retrieval, ranking, generation |
| Fallback | What if retrieval fails? | Graceful degradation, fallback to LLM-only |