APPLICATIONS & SYSTEMS

RAG Systems

Production document Q&A, enterprise search, and knowledge management pipelines that reduce hallucination and enable dynamic knowledge updates.

retrieve → generate the basic pattern
chunking strategy quality foundation
access control enterprise requirement
Contents
  1. What is RAG?
  2. Pipeline stages
  3. Framework comparison
  4. Chunking strategies
  5. Embedding models
  6. Vector database choice
  7. Enterprise considerations
01 — Foundation

What Is RAG?

RAG (Retrieval-Augmented Generation) is a pattern where an LLM generates answers using external knowledge: retrieve relevant documents from a corpus → feed them to the LLM as context → the LLM generates grounded responses. This solves two critical problems: hallucination (LLMs making up facts) and knowledge cutoff (LLMs frozen at training time).

Unlike fine-tuning (expensive, permanent, requires retraining), RAG allows you to update knowledge by simply adding new documents to the corpus. The LLM sees fresh context at query time, making RAG ideal for domains where facts change frequently: legal documents, medical records, product catalogs, financial reports.

When RAG Makes Sense

Use CaseFit for RAG?Reason
Document Q&A (PDFs, manuals)ExcellentAnswers grounded in docs
Enterprise search (Wiki, intranet)ExcellentDynamic content, access control
Customer support (FAQs)ExcellentConsistent, traceable answers
Medical/legal domainExcellentHigh accuracy requirement, citations
General knowledge Q&AGoodHelps with hallucination
Creative writingPoorCreativity shouldn't be grounded
Real-time analyticsPoorData changes faster than indexing
💡 Key insight: RAG is not a replacement for fine-tuning — it's orthogonal. You can combine both: fine-tune for style/format, use RAG for facts. For fact-heavy domains, RAG nearly always outperforms vanilla LLMs.
02 — Architecture

RAG Pipeline Stages

A production RAG system has clear stages: offline indexing (happens once or in batch) and online serving (happens per-query). Separating these allows optimization: indexing can be slow/expensive, serving must be fast.

Offline Indexing

Document ingestion: Load PDFs, HTML, databases, Slack dumps, etc. Parsing: Extract text, tables, images. Preserve structure. Chunking: Split into retrieval-sized pieces (typically 256–1024 tokens). Enrichment: Add metadata (source, date, author), summaries, keywords. Embedding: Convert chunks to dense vectors. Storage: Index in vector DB + maintain metadata store.

Indexing pipeline pseudocode: docs = load_pdfs("knowledge_base/") chunks = [] for doc in docs: parsed = parse_pdf(doc) # text, tables, metadata doc_chunks = split_into_chunks( parsed.text, chunk_size=512, chunk_overlap=64, preserve_boundaries=True ) for chunk in doc_chunks: chunk.metadata = { "source": doc.filename, "page": parsed.current_page, "date": doc.created_date } chunks.extend(doc_chunks) embeddings = embed_batch(chunks, model="BAAI/bge-large") vectorstore.index(chunks, embeddings) metadata_db.store(chunks) # for re-ranking, filtering

Online Serving

Query embedding: Convert user query to vector using same embedding model. Retrieval: Semantic search in vector DB → top-k candidates. Optional re-ranking: Cross-encoder scores candidates by relevance → top-n. Context assembly: Grab full documents or passages. Prompt construction: Format context + query for LLM. Generation: LLM reads context and generates grounded answer. Citations: Track which documents supported answer.

Best practice: Index early and often. Small updates → incremental indexing. Large corpus changes → full re-index offline, swap indexes at serve time. Never block serving on re-indexing.
03 — Framework Ecosystem

Framework Comparison: Haystack vs LlamaIndex vs LangChain

Three mature frameworks dominate the RAG ecosystem. Each has different philosophy, maturity, and strengths. Choice depends on your team's background and deployment target.

FrameworkPhilosophyBest forLearning curve
HaystackRetrieval-first, pipeline-basedRAG specialists, production systemsMedium
LlamaIndexData indexing, query translationDiverse data sources, structured queriesLow
LangChainChains + agents, wide integrationsRapid prototyping, complex workflowsMedium

Haystack (Production-Grade RAG)

Originally built at Deepset specifically for RAG. Strong document pipeline, excellent for multi-document workflows. Pipelines are explicit DAGs you can visualize. Great for teams committed to RAG as core feature. Good OSS community, commercial support available.

LlamaIndex (Query Translation)

Focused on indexing diverse data sources: PDFs, databases, APIs, Notion. Excels at translating high-level queries into data-specific access patterns. Lightweight. Great for prototyping. Growing agent support. Simpler than Haystack for beginners.

LangChain (General Purpose)

Broadest tool ecosystem. Not RAG-specific — emphasizes chains (workflows), agents (tool use), memory. Better for applications that mix RAG with other LLM capabilities (code generation, arithmetic, APIs). Largest community. Heavy on abstractions, can be opinionated.

⚠️ Framework lock-in risk: Choose frameworks early — switching later is costly. None are significantly better; pick based on team familiarity and your primary use case. RAG patterns are framework-agnostic; the code won't be.
04 — Retrieval Precision

Chunking Strategies: The Foundation

Chunking is arguably the single most important decision in RAG. Bad chunking breaks retrieval; good chunking makes everything else easy. Chunk size affects recall (smaller = more docs retrieved), precision (larger = fewer false positives), and latency (more chunks = slower embedding).

Fixed-Size Chunking

Split every N tokens, optionally with overlap. Simplest approach. Deterministic. Problem: breaks mid-sentence, splits concepts. Best for diverse corpora where structure isn't known.

Semantic Chunking

Split at sentence or paragraph boundaries, then merge until chunk reaches target size. Respects semantics. Better coherence. Slightly slower at index time. Recommended for structured documents: reports, articles, wikis.

Hierarchical (Parent-Child)

Index small chunks for precision, but fetch parent chunk for context. Example: 256-token chunks for retrieval, but show 1024-token parent to LLM. Combines precision and context. Requires document hierarchy awareness (sections, chapters).

Snippet Size Guidelines

Document typeRecommended chunk sizeOverlapMethod
PDFs, papers512 tokens64–128 tokensSemantic
Web pages, blogs256–512 tokens32–64 tokensFixed or semantic
Code files256 tokens32 tokensLanguage-aware splitting
Tables/structured128 tokens0 tokensTable-aware parsing
Long documentsSmall (256) primary, 1024 parent64 tokensHierarchical
💡 Pro tip: Start with semantic chunking at 512 tokens with 64 token overlap. Benchmark against your golden dataset. Tune down if precision suffers, up if recall drops. Different document types may need different settings.
05 — Vector Representation

Embedding Model Selection

Your embedding model determines what "semantic similarity" means to your RAG system. Choice affects both quality and cost. General-purpose models (OpenAI, Anthropic) are good defaults. Domain-specific models (legal, medical, code) can be better if you have the budget.

Popular Embedding Models

Dense (API)
OpenAI text-embedding-3
High quality, 3072 dimensions, fast. $0.02 per 1M tokens. Proprietary.
# Minimal RAG pipeline with LlamaIndex # pip install llama-index llama-index-llms-openai llama-index-embeddings-openai from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings from llama_index.llms.openai import OpenAI from llama_index.embeddings.openai import OpenAIEmbedding # Configure models Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0.1) Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small") Settings.chunk_size = 512 Settings.chunk_overlap = 50 # Index your documents documents = SimpleDirectoryReader("./docs").load_data() index = VectorStoreIndex.from_documents(documents) # Query query_engine = index.as_query_engine(similarity_top_k=4) response = query_engine.query("What are the key SLA requirements?") print(response) # Persist index to disk (avoids re-embedding on restart) index.storage_context.persist("./storage")
Dense (OSS)
BAAI/bge-large-en-v1.5
Strong general-purpose, 1024 dims. Free, can run on-device.
Dense (OSS)
e5-base-v2
Lightweight, good for latency-sensitive apps. 768 dims.
Dense (API)
Cohere Embed v3
Multilingual, 4096 dims, can compress to 128. API-based.
Sparse
BM25 (Elasticsearch)
Exact keyword matching. Fast, interpretable. Often used hybrid with dense.
Hybrid
ColBERT
Token-level dense embeddings. Better precision than pure dense.

Embedding Quality vs Cost

High quality (proprietary): OpenAI text-embedding-3-large. Better generalization, higher cost. Balanced (open): BAAI/bge-large. Strong performance, free, can self-host. Fast (lightweight): e5-base-v2, all-MiniLM-L6. Lower latency, smaller models, acceptable quality.

Recommendation: For production, test 2–3 models against your golden dataset. OpenAI's model is reliable baseline if budget allows. BAAI/bge-large is excellent OSS default. For proprietary data, consider fine-tuning embeddings on domain examples.
06 — Indexing & Retrieval

Vector Database Choice

Vector DBs store high-dimensional embeddings and answer nearest-neighbor queries fast. Dozens exist; choice depends on scale, latency budget, and whether you need hybrid (dense+sparse) search.

Vector Database Comparison

DatabaseTypeScaleHybrid searchBest for
ChromaOSS/managedSmall-mediumNoPrototypes, demos
WeaviateOSS/managedMedium-largeYesGraphQL queries, hybrid
PineconeManagedVery largeNo (dense only)Production, serverless
MilvusOSSMedium-largeNoScalable OSS deployments
QdrantOSS/managedMedium-largeYes (sparse)Hybrid, production-ready
ElasticsearchOSS/managedLargeYes (native)Hybrid BM25+dense

Key Considerations

Hybrid search: If your corpus has domain-specific terminology (legal, medical, code), combine dense embeddings with BM25. Weaviate, Qdrant, Elasticsearch all support this. Scale: Small corpus (<100k docs) → Chroma is fine. Large corpus (millions) → Pinecone or self-hosted Qdrant/Milvus. Latency: Managed (Pinecone) is simpler ops; self-hosted gives you control. Cost: Chroma/Milvus/Qdrant OSS are free; Pinecone charges per query.

⚠️ Vector DB lock-in: Embedding model and vector DB are coupled (you can't easily switch). Start with Chroma for prototyping; plan your migration path to production DB early.
# Hybrid retrieval: dense + BM25 with reranking # pip install llama-index llama-index-retrievers-bm25 sentence-transformers from llama_index.core import VectorStoreIndex, StorageContext, load_index_from_storage from llama_index.core.retrievers import VectorIndexRetriever from llama_index.retrievers.bm25 import BM25Retriever from llama_index.core.retrievers import QueryFusionRetriever from llama_index.core.query_engine import RetrieverQueryEngine # Load persisted index storage_ctx = StorageContext.from_defaults(persist_dir="./storage") index = load_index_from_storage(storage_ctx) # Build hybrid retriever (dense + sparse) vector_retriever = VectorIndexRetriever(index=index, similarity_top_k=6) bm25_retriever = BM25Retriever.from_defaults(index=index, similarity_top_k=6) hybrid_retriever = QueryFusionRetriever( retrievers=[vector_retriever, bm25_retriever], similarity_top_k=4, num_queries=1, # don't generate sub-queries mode="reciprocal_rerank", ) query_engine = RetrieverQueryEngine.from_args(hybrid_retriever) response = query_engine.query("Summarise the data retention policy") print(response) # Source nodes: response.source_nodes for node in response.source_nodes: print(f" score={node.score:.3f} | {node.text[:80]}...")
07 — Production Maturity

Enterprise RAG Considerations

Moving RAG from prototype to production requires handling scale, reliability, and governance that lab settings don't require. Enterprise RAG introduces access control, data freshness, multi-tenancy, and auditability.

Access Control & Multi-Tenancy

In enterprise, different users see different documents. A single shared vector DB can't enforce this naively. Solutions: document-level filtering (tag chunks with user_id, filter at retrieval time), per-user indexes (overkill for most cases), row-level security in vector DB (Qdrant, Weaviate support metadata-based filtering). Recommended: tag chunks with accessible_users and filter before ranking.

Data Freshness

How often do you re-index? Daily/weekly is typical; minutes requires streaming indexing (harder). Strategy: batch indexing (new docs → index offline → swap at midnight), incremental indexing (new docs → add to existing index immediately), document versioning (keep old chunks, mark as stale, weight new higher). For real-time data (stock prices, live scores), RAG isn't the right tool alone.

Observability & Audit

What was retrieved? What was generated? Did we hallucinate? Track: queries and results (audit trail), retrieval quality (MRR, NDCG metrics), answer quality (user feedback, LLM-based scoring), latency per stage (retrieve, rerank, generate). Log sources of answers for compliance.

Best Practices Checklist

AspectConsiderationImplementation
Access ControlDifferent users, different documentsMetadata-based filtering + user context
Data FreshnessHow old is the corpus?Batch or incremental indexing cadence
Version ControlTrack document updatesTimestamp, hash, version field in metadata
MonitoringQuality metricsRAGAS or similar evaluation framework
LatencyPer-stage timingsInstrument retrieval, ranking, generation
FallbackWhat if retrieval fails?Graceful degradation, fallback to LLM-only
⚠️ Enterprise gotcha: Access control is often an afterthought. Design it in from the start. Document-level filtering is simpler than per-index solutions and scales better.
08 — Further Reading

References and Related Concepts

Child Concepts
Related Concepts
Papers & Guides