RAG Systems

Contents

What is RAG?
Pipeline stages
Framework comparison
Chunking strategies
Embedding models
Vector database choice
Enterprise considerations

01 — Foundation

What Is RAG?

RAG (Retrieval-Augmented Generation) is a pattern where an LLM generates answers using external knowledge: retrieve relevant documents from a corpus → feed them to the LLM as context → the LLM generates grounded responses. This solves two critical problems: hallucination (LLMs making up facts) and knowledge cutoff (LLMs frozen at training time).

Unlike fine-tuning (expensive, permanent, requires retraining), RAG allows you to update knowledge by simply adding new documents to the corpus. The LLM sees fresh context at query time, making RAG ideal for domains where facts change frequently: legal documents, medical records, product catalogs, financial reports.

When RAG Makes Sense

Use Case	Fit for RAG?	Reason
Document Q&A (PDFs, manuals)	Excellent	Answers grounded in docs
Enterprise search (Wiki, intranet)	Excellent	Dynamic content, access control
Customer support (FAQs)	Excellent	Consistent, traceable answers
Medical/legal domain	Excellent	High accuracy requirement, citations
General knowledge Q&A	Good	Helps with hallucination
Creative writing	Poor	Creativity shouldn't be grounded
Real-time analytics	Poor	Data changes faster than indexing

💡 Key insight: RAG is not a replacement for fine-tuning — it's orthogonal. You can combine both: fine-tune for style/format, use RAG for facts. For fact-heavy domains, RAG nearly always outperforms vanilla LLMs.

02 — Architecture

RAG Pipeline Stages

A production RAG system has clear stages: offline indexing (happens once or in batch) and online serving (happens per-query). Separating these allows optimization: indexing can be slow/expensive, serving must be fast.

Offline Indexing

Document ingestion: Load PDFs, HTML, databases, Slack dumps, etc. Parsing: Extract text, tables, images. Preserve structure. Chunking: Split into retrieval-sized pieces (typically 256–1024 tokens). Enrichment: Add metadata (source, date, author), summaries, keywords. Embedding: Convert chunks to dense vectors. Storage: Index in vector DB + maintain metadata store.

Indexing pipeline pseudocode: docs = load_pdfs("knowledge_base/") chunks = [] for doc in docs: parsed = parse_pdf(doc) # text, tables, metadata doc_chunks = split_into_chunks( parsed.text, chunk_size=512, chunk_overlap=64, preserve_boundaries=True ) for chunk in doc_chunks: chunk.metadata = { "source": doc.filename, "page": parsed.current_page, "date": doc.created_date } chunks.extend(doc_chunks) embeddings = embed_batch(chunks, model="BAAI/bge-large") vectorstore.index(chunks, embeddings) metadata_db.store(chunks) # for re-ranking, filtering

Online Serving

Query embedding: Convert user query to vector using same embedding model. Retrieval: Semantic search in vector DB → top-k candidates. Optional re-ranking: Cross-encoder scores candidates by relevance → top-n. Context assembly: Grab full documents or passages. Prompt construction: Format context + query for LLM. Generation: LLM reads context and generates grounded answer. Citations: Track which documents supported answer.

✓ Best practice: Index early and often. Small updates → incremental indexing. Large corpus changes → full re-index offline, swap indexes at serve time. Never block serving on re-indexing.

03 — Framework Ecosystem

Framework Comparison: Haystack vs LlamaIndex vs LangChain

Three mature frameworks dominate the RAG ecosystem. Each has different philosophy, maturity, and strengths. Choice depends on your team's background and deployment target.

Framework	Philosophy	Best for	Learning curve
Haystack	Retrieval-first, pipeline-based	RAG specialists, production systems	Medium
LlamaIndex	Data indexing, query translation	Diverse data sources, structured queries	Low
LangChain	Chains + agents, wide integrations	Rapid prototyping, complex workflows	Medium

Haystack (Production-Grade RAG)

Originally built at Deepset specifically for RAG. Strong document pipeline, excellent for multi-document workflows. Pipelines are explicit DAGs you can visualize. Great for teams committed to RAG as core feature. Good OSS community, commercial support available.

LlamaIndex (Query Translation)

Focused on indexing diverse data sources: PDFs, databases, APIs, Notion. Excels at translating high-level queries into data-specific access patterns. Lightweight. Great for prototyping. Growing agent support. Simpler than Haystack for beginners.

LangChain (General Purpose)

Broadest tool ecosystem. Not RAG-specific — emphasizes chains (workflows), agents (tool use), memory. Better for applications that mix RAG with other LLM capabilities (code generation, arithmetic, APIs). Largest community. Heavy on abstractions, can be opinionated.

⚠️ Framework lock-in risk: Choose frameworks early — switching later is costly. None are significantly better; pick based on team familiarity and your primary use case. RAG patterns are framework-agnostic; the code won't be.

04 — Retrieval Precision

Chunking Strategies: The Foundation

Chunking is arguably the single most important decision in RAG. Bad chunking breaks retrieval; good chunking makes everything else easy. Chunk size affects recall (smaller = more docs retrieved), precision (larger = fewer false positives), and latency (more chunks = slower embedding).

Fixed-Size Chunking

Split every N tokens, optionally with overlap. Simplest approach. Deterministic. Problem: breaks mid-sentence, splits concepts. Best for diverse corpora where structure isn't known.

Semantic Chunking

Split at sentence or paragraph boundaries, then merge until chunk reaches target size. Respects semantics. Better coherence. Slightly slower at index time. Recommended for structured documents: reports, articles, wikis.

Hierarchical (Parent-Child)

Index small chunks for precision, but fetch parent chunk for context. Example: 256-token chunks for retrieval, but show 1024-token parent to LLM. Combines precision and context. Requires document hierarchy awareness (sections, chapters).

Snippet Size Guidelines

Document type	Recommended chunk size	Overlap	Method
PDFs, papers	512 tokens	64–128 tokens	Semantic
Web pages, blogs	256–512 tokens	32–64 tokens	Fixed or semantic
Code files	256 tokens	32 tokens	Language-aware splitting
Tables/structured	128 tokens	0 tokens	Table-aware parsing
Long documents	Small (256) primary, 1024 parent	64 tokens	Hierarchical

💡 Pro tip: Start with semantic chunking at 512 tokens with 64 token overlap. Benchmark against your golden dataset. Tune down if precision suffers, up if recall drops. Different document types may need different settings.

05 — Vector Representation

Embedding Model Selection

Your embedding model determines what "semantic similarity" means to your RAG system. Choice affects both quality and cost. General-purpose models (OpenAI, Anthropic) are good defaults. Domain-specific models (legal, medical, code) can be better if you have the budget.

Popular Embedding Models

Dense (API)

OpenAI text-embedding-3

High quality, 3072 dimensions, fast. $0.02 per 1M tokens. Proprietary.

# Minimal RAG pipeline with LlamaIndex # pip install llama-index llama-index-llms-openai llama-index-embeddings-openai from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings from llama_index.llms.openai import OpenAI from llama_index.embeddings.openai import OpenAIEmbedding # Configure models Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0.1) Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small") Settings.chunk_size = 512 Settings.chunk_overlap = 50 # Index your documents documents = SimpleDirectoryReader("./docs").load_data() index = VectorStoreIndex.from_documents(documents) # Query query_engine = index.as_query_engine(similarity_top_k=4) response = query_engine.query("What are the key SLA requirements?") print(response) # Persist index to disk (avoids re-embedding on restart) index.storage_context.persist("./storage")

Dense (OSS)

BAAI/bge-large-en-v1.5

Strong general-purpose, 1024 dims. Free, can run on-device.

Dense (OSS)

e5-base-v2

Lightweight, good for latency-sensitive apps. 768 dims.

Dense (API)

Cohere Embed v3

Multilingual, 4096 dims, can compress to 128. API-based.

Sparse

BM25 (Elasticsearch)

Exact keyword matching. Fast, interpretable. Often used hybrid with dense.

Hybrid

ColBERT

Token-level dense embeddings. Better precision than pure dense.

Embedding Quality vs Cost

High quality (proprietary): OpenAI text-embedding-3-large. Better generalization, higher cost. Balanced (open): BAAI/bge-large. Strong performance, free, can self-host. Fast (lightweight): e5-base-v2, all-MiniLM-L6. Lower latency, smaller models, acceptable quality.

✓ Recommendation: For production, test 2–3 models against your golden dataset. OpenAI's model is reliable baseline if budget allows. BAAI/bge-large is excellent OSS default. For proprietary data, consider fine-tuning embeddings on domain examples.

06 — Indexing & Retrieval

Vector Database Choice

Vector DBs store high-dimensional embeddings and answer nearest-neighbor queries fast. Dozens exist; choice depends on scale, latency budget, and whether you need hybrid (dense+sparse) search.

Vector Database Comparison

Database	Type	Scale	Hybrid search	Best for
Chroma	OSS/managed	Small-medium	No	Prototypes, demos
Weaviate	OSS/managed	Medium-large	Yes	GraphQL queries, hybrid
Pinecone	Managed	Very large	No (dense only)	Production, serverless
Milvus	OSS	Medium-large	No	Scalable OSS deployments
Qdrant	OSS/managed	Medium-large	Yes (sparse)	Hybrid, production-ready
Elasticsearch	OSS/managed	Large	Yes (native)	Hybrid BM25+dense

Key Considerations

Hybrid search: If your corpus has domain-specific terminology (legal, medical, code), combine dense embeddings with BM25. Weaviate, Qdrant, Elasticsearch all support this. Scale: Small corpus (<100k docs) → Chroma is fine. Large corpus (millions) → Pinecone or self-hosted Qdrant/Milvus. Latency: Managed (Pinecone) is simpler ops; self-hosted gives you control. Cost: Chroma/Milvus/Qdrant OSS are free; Pinecone charges per query.

⚠️ Vector DB lock-in: Embedding model and vector DB are coupled (you can't easily switch). Start with Chroma for prototyping; plan your migration path to production DB early.

# Hybrid retrieval: dense + BM25 with reranking # pip install llama-index llama-index-retrievers-bm25 sentence-transformers from llama_index.core import VectorStoreIndex, StorageContext, load_index_from_storage from llama_index.core.retrievers import VectorIndexRetriever from llama_index.retrievers.bm25 import BM25Retriever from llama_index.core.retrievers import QueryFusionRetriever from llama_index.core.query_engine import RetrieverQueryEngine # Load persisted index storage_ctx = StorageContext.from_defaults(persist_dir="./storage") index = load_index_from_storage(storage_ctx) # Build hybrid retriever (dense + sparse) vector_retriever = VectorIndexRetriever(index=index, similarity_top_k=6) bm25_retriever = BM25Retriever.from_defaults(index=index, similarity_top_k=6) hybrid_retriever = QueryFusionRetriever( retrievers=[vector_retriever, bm25_retriever], similarity_top_k=4, num_queries=1, # don't generate sub-queries mode="reciprocal_rerank", ) query_engine = RetrieverQueryEngine.from_args(hybrid_retriever) response = query_engine.query("Summarise the data retention policy") print(response) # Source nodes: response.source_nodes for node in response.source_nodes: print(f" score={node.score:.3f} | {node.text[:80]}...")

07 — Production Maturity

Enterprise RAG Considerations

Moving RAG from prototype to production requires handling scale, reliability, and governance that lab settings don't require. Enterprise RAG introduces access control, data freshness, multi-tenancy, and auditability.

Access Control & Multi-Tenancy

In enterprise, different users see different documents. A single shared vector DB can't enforce this naively. Solutions: document-level filtering (tag chunks with user_id, filter at retrieval time), per-user indexes (overkill for most cases), row-level security in vector DB (Qdrant, Weaviate support metadata-based filtering). Recommended: tag chunks with accessible_users and filter before ranking.

Data Freshness

How often do you re-index? Daily/weekly is typical; minutes requires streaming indexing (harder). Strategy: batch indexing (new docs → index offline → swap at midnight), incremental indexing (new docs → add to existing index immediately), document versioning (keep old chunks, mark as stale, weight new higher). For real-time data (stock prices, live scores), RAG isn't the right tool alone.

Observability & Audit

What was retrieved? What was generated? Did we hallucinate? Track: queries and results (audit trail), retrieval quality (MRR, NDCG metrics), answer quality (user feedback, LLM-based scoring), latency per stage (retrieve, rerank, generate). Log sources of answers for compliance.

Best Practices Checklist

Aspect	Consideration	Implementation
Access Control	Different users, different documents	Metadata-based filtering + user context
Data Freshness	How old is the corpus?	Batch or incremental indexing cadence
Version Control	Track document updates	Timestamp, hash, version field in metadata
Monitoring	Quality metrics	RAGAS or similar evaluation framework
Latency	Per-stage timings	Instrument retrieval, ranking, generation
Fallback	What if retrieval fails?	Graceful degradation, fallback to LLM-only

⚠️ Enterprise gotcha: Access control is often an afterthought. Design it in from the start. Document-level filtering is simpler than per-index solutions and scales better.

08 — Further Reading

References and Related Concepts

Child Concepts

Haystack — Production RAG framework by Deepset
Cognee — Graph-based knowledge management for RAG

Related Concepts

Advanced RAG Patterns — Query transformation, reranking, multi-hop
Retrieval Technologies — Dense, sparse, and hybrid search
Vector Databases — Storage and indexing for embeddings
Embeddings — Vector representations for semantic search
Chunking Strategies — Document splitting for retrieval

Papers & Guides

Paper Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — arXiv:2005.11401
Docs Haystack Documentation. docs.haystack.deepset.ai ↗
Docs LlamaIndex Documentation. docs.llamaindex.ai ↗
Blog Embedding Quality & Model Selection. MTEB Leaderboard ↗

RAG Systems

What Is RAG?

When RAG Makes Sense

RAG Pipeline Stages

Offline Indexing

Online Serving

Framework Comparison: Haystack vs LlamaIndex vs LangChain

Haystack (Production-Grade RAG)

LlamaIndex (Query Translation)

LangChain (General Purpose)

Chunking Strategies: The Foundation

Fixed-Size Chunking

Semantic Chunking

Hierarchical (Parent-Child)

Snippet Size Guidelines

Embedding Model Selection

Popular Embedding Models

Embedding Quality vs Cost

Vector Database Choice

Vector Database Comparison

Key Considerations

Enterprise RAG Considerations

Access Control & Multi-Tenancy

Data Freshness

Observability & Audit

Best Practices Checklist

References and Related Concepts

Related concepts