RAG — Retrieval-Augmented Generation

On This Page

Why LLMs need RAG
The indexing pipeline
The retrieval pipeline
Minimal code example
RAG vs fine-tuning
What to explore next
References

01 — Overview

Why LLMs Need RAG

LLMs have a fundamental knowledge problem: they only know what was in their training data, which has a hard cutoff. They don't know about your proprietary documents, your company database, live market data, or anything that happened after training. When you ask a plain LLM a question outside its training data, it hallucinates — it fabricates plausible-sounding but false answers.

RAG (Retrieval-Augmented Generation) solves this by attaching external knowledge to the LLM at query time. Instead of asking the LLM directly, you first retrieve the relevant chunks from your documents, then feed both the chunks and the question to the LLM. The LLM then generates a grounded answer based on what it retrieved, not its training data.

💡 RAG solves the core problem with plain LLMs: they only know what was in their training data. RAG attaches any external knowledge — your docs, your database, live data — to any LLM at query time, without retraining the model.

RAG vs. Hallucination

Without RAG: Question: "What is our refund policy?" LLM: "Your policy is 60 days..." [hallucinated, wrong] With RAG: Question: "What is our refund policy?" Retrieve: "Our refund policy is 30 days, no questions asked." LLM: "Your policy is 30 days, no questions asked." [grounded, correct]

When to use RAG

You have external documents or data: Contracts, policies, knowledge bases, databases
Knowledge is private or proprietary: Your internal docs were not in the training data
Knowledge is live or frequently updated: Market data, APIs, real-time feeds
You need to cite sources: Users want to know where the answer came from
You want to update knowledge without retraining: Update the corpus, not the model

02 — Offline Preparation

The Indexing Pipeline

Indexing is the offline phase: you run it once when you have a new corpus, or periodically when your documents change. The goal is to make documents searchable.

The Four Steps

Chunk — Break documents into pieces

Split documents into small, semantically coherent chunks (typically 256–1024 tokens). You need chunks small enough to retrieve precisely but large enough to contain context.

Fixed-size chunks: Fast, simple, but ignores semantics
Semantic chunks: Split at sentence/paragraph boundaries, better coherence
Hierarchical (parent-child): Small chunks for retrieval, fetch parent for context

Embed — Turn text into vectors

Use an embedding model to convert each chunk into a dense vector (e.g., 1536 dimensions). Semantically similar chunks will have similar vectors, enabling similarity search.

Embedding models: BAAI/bge-small, OpenAI text-embedding-3-small, Anthropic embeddings
Dimension: 256–1536 dims. Larger models, larger vectors, higher quality but slower search
Metric: Most systems use cosine similarity for retrieval

Store — Index in a vector database

Store the chunks + embeddings in a vector database. The DB builds an index (typically HNSW or IVF) for fast approximate nearest neighbor (ANN) search.

Vector DBs: Chroma, Pinecone, Weaviate, Milvus, Qdrant, FAISS
Also store metadata: filename, chunk number, timestamp, author, source URL
The index enables fast retrieval: O(log n) to O(1) instead of O(n) full scan

Verify — Test the index

Query the index with test questions to ensure relevant chunks are retrievable. Bad chunking or embedding models lead to garbage retrieval, which no LLM can fix.

Test retrieval quality before going live
Identify edge cases (ambiguous queries, rare terms)
Iterate on chunking strategy if needed

⚠️ RAG quality is determined before any model call — by how you chunk, what metadata you store, and how you retrieve. Fix the data pipeline first, prompt engineering second.

03 — Online Query

The Retrieval Pipeline

Retrieval is the online phase: happens on every user query. The goal is to find the top-k most relevant chunks, then send them to the LLM.

The Core Flow

User query: "What is the return policy?" STEP 1 — Embed the query Query vector = embed("What is the return policy?") STEP 2 — Search the index Top-3 similar chunks = vector_db.search(query_vector, k=3) Results might be: 1. "Returns are accepted within 30 days of purchase..." 2. "No questions asked 30-day return policy." 3. "Damaged items: full refund within 30 days." STEP 3 — Format context context = "\n".join([chunk1, chunk2, chunk3]) STEP 4 — Prompt the LLM prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:" STEP 5 — Generate answer LLM reads the context and question, outputs grounded answer

Key Design Decisions

How many chunks to retrieve (k)? Trade-off between context breadth and token budget. Typical: k=3 to k=10. More chunks = more context = slower + more cost, but lower risk of missing relevant info.

Embedding consistency: The query and corpus must be embedded with the same model. Mismatched models = broken retrieval.

Context window limits: Your LLM has a max token limit. Calculate: tokens(context) + tokens(query) + tokens(answer) must fit. For GPT-4 with 128k tokens and 10 chunks of 300 tokens each, you have 97k tokens for reasoning.

✓ Best practice: Start with k=3, measure retrieval quality (are the top-3 relevant?), then tune k upward if recall is low. Most systems work well with k ≤ 10.

04 — Working Example

Minimal Code Example

Here is a complete, working RAG system in Python. Install dependencies with pip install anthropic chromadb sentence-transformers.

# Minimal RAG pipeline: embed -> store -> retrieve -> answer # pip install anthropic chromadb sentence-transformers from sentence_transformers import SentenceTransformer import chromadb from anthropic import Anthropic embed_model = SentenceTransformer('BAAI/bge-small-en-v1.5') col = chromadb.Client().get_or_create_collection('docs') client = Anthropic() # Indexing (offline, run once per corpus update) docs = [ 'Our refund policy is 30 days, no questions asked.', 'Shipping takes 3-5 business days to the US.', 'Contact support at help@example.com for account issues.', ] embeddings = embed_model.encode(docs).tolist() col.add(documents=docs, embeddings=embeddings, ids=[f'doc{i}' for i in range(len(docs))]) # Retrieval + Generation (online, per query) def rag(question: str, k: int = 2) -> str: q_emb = embed_model.encode([question]).tolist() results = col.query(query_embeddings=q_emb, n_results=k) context = '\n'.join(results['documents'][0]) resp = client.messages.create( model='claude-haiku-4-5-20251001', max_tokens=256, messages=[{'role':'user','content': f'Context:\n{context}\n\nQuestion: {question}\nAnswer briefly:'}] ) return resp.content[0].text print(rag('What is the return policy?')) # Output: "Our refund policy is 30 days, no questions asked."

What's happening here

SentenceTransformer: Embedding model that converts text to dense vectors
chromadb.Client(): In-memory vector database. For production, use Pinecone or Weaviate
embed_model.encode(docs): Convert each document to a 384-dim vector
col.add(...): Store documents and their vectors in the index
col.query(...): Find top-k documents most similar to the question vector
client.messages.create(...): Call Claude with context + question, get grounded answer

Scaling to production

Replace Chroma with Pinecone or Weaviate for persistent, distributed storage
Add reranking after retrieval: coarse-rank top-50 with embedding, fine-rank top-5 with cross-encoder
Add metadata filtering: retrieve only documents matching date, author, category
Implement hybrid search: combine dense (embedding) + sparse (BM25) retrieval
Monitor retrieval quality with RAGAS metrics: context precision, context recall

05 — Architecture Decision

RAG vs. Fine-Tuning: When to Use Which

Both RAG and fine-tuning add knowledge to an LLM. They're not mutually exclusive — you can do both. But they solve different problems and have different trade-offs.

RAG: Retrieval-Augmented Generation

You give the LLM external knowledge at query time via retrieved documents.

Best for: Private/proprietary data, frequently updated knowledge, specific facts, citations
Cost: Cheap — no training. Pay only for retrieval + generation
Latency: Extra query: embed question + search index (~50–200ms)
Update speed: Add/remove documents instantly. No retraining
Hallucination: Lower — LLM grounds answer in retrieved context
Scaling: Scale retrieval independently of LLM. Use multiple vector DBs
Drawback: Depends entirely on retrieval quality. Bad retrieval = bad answer

Fine-Tuning: Adapt the Model Weights

You update the LLM's weights on domain-specific examples. Knowledge becomes part of the model.

Best for: Style/tone adaptation, domain-specific reasoning, specialized tasks
Cost: Expensive — hours of GPU time, compute, storage
Latency: No extra latency — generation is the same speed
Update speed: Slow — need to retrain and deploy new model version
Hallucination: Higher — model can confabulate facts it learned poorly
Scaling: Each fine-tuned model is separate; can't share knowledge easily
Advantage: Works even if you can't retrieve (e.g., rare edge cases)

Criterion	RAG	Fine-Tuning	Use RAG if...
Knowledge updates	Real-time (add documents)	Slow (retrain model)	Knowledge changes weekly or daily
Cost	Low (retrieval + generation)	High (GPUs + storage)	Budget is tight
Hallucination	Low (grounded in docs)	Higher (model can confabulate)	Accuracy is critical
Source attribution	Easy (cite retrieved doc)	Impossible (weights are opaque)	Users need to verify sources
Rare/edge cases	Works if doc is in corpus	Works if model saw example	Your docs cover edge cases
Task complexity	Simple Q&A, summarization	Complex reasoning, style	Tasks are straightforward

⚠️ Hybrid approach: The best systems often use both. Use RAG for facts and citations. Use fine-tuning for style, tone, and domain-specific reasoning patterns. Fine-tune on top of a retrieval-augmented prompt.

06 — Deeper Dives

What to Explore Next

The concepts on this page are the foundation. As you build RAG systems, you'll encounter more advanced topics:

Concept

Embeddings

Dense vector representations for semantic similarity. Learn how to choose embedding models, understand dimensions, and why embedding quality drives retrieval quality.

→ Read more

Concept

Vector Databases

Specialized databases for approximate nearest neighbor search. Understand HNSW, IVF, scaling, and when to use Pinecone vs. Weaviate vs. self-hosted FAISS.

→ Read more

Concept

Retrieval Techniques

Beyond simple embedding similarity. Dense, sparse (BM25), and hybrid retrieval. Learn RRF fusion, metadata filtering, and query rewriting.

→ Read more

Concept

Post-Retrieval Processing

Improve retrieved context quality. Reranking, context compression, and managing token limits while preserving relevance.

→ Read more

Concept

Advanced RAG

Multi-hop retrieval, agentic RAG, graph-based retrieval, and iterative reasoning loops. For complex research and reasoning tasks.

→ Read more

07 — Further Reading

References

Academic Papers

Paper Lewis, P. et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401. — arxiv:2005.11401 ↗
Paper Gao, Y. et al. (2023). Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997. — arxiv:2312.10997 ↗

Learning Resources

Blog Pinecone. (2024). What is RAG? — pinecone.io/learn ↗
Docs LangChain. (2024). RAG Tutorial. — python.langchain.com ↗
Blog Anthropic. (2024). Building effective retrieval for LLM applications. — anthropic.com ↗

LEARNING PATH

Learning Path

RAG touches embeddings, vector search, chunking, and LLM prompting. Here's the recommended sequence:

Embeddingsturn text → vectors

→

Vector SearchFAISS / Weaviate

→

Chunkingsplit documents

→

RAG Pipelineyou are here

→

Advanced RAGrerank + hybrid

→

RAG EvalRAGAS / TruLens

Understand embeddings first

RAG only makes sense once you know how text becomes a vector and why similar meaning → similar vector. Start with Embeddings.

Build a basic RAG pipeline

Use LangChain or LlamaIndex to index a PDF and answer questions over it. Get something working before optimising. Target: <1 hour to a demo.

Measure before you tune

Run RAGAS or LlamaIndex evaluations on your pipeline. Context recall and answer faithfulness tell you where it breaks before you guess at fixes.

Apply Advanced RAG techniques selectively

Hybrid search, reranking, and HyDE each help in specific failure modes. Fix the failure mode you measured, not the one you assume.

Consider alternatives at scale

If your whole corpus fits in a 200K context window, try prompt stuffing before maintaining a vector index. See Frontier Implications.

Retrieval-Augmented Generation

Why LLMs Need RAG

RAG vs. Hallucination

When to use RAG

The Indexing Pipeline

The Four Steps

Chunk — Break documents into pieces

Embed — Turn text into vectors

Store — Index in a vector database

Verify — Test the index

The Retrieval Pipeline

The Core Flow

Key Design Decisions

Minimal Code Example

What's happening here

Scaling to production

RAG vs. Fine-Tuning: When to Use Which

RAG: Retrieval-Augmented Generation

Fine-Tuning: Adapt the Model Weights

What to Explore Next

References

Learning Path

Understand embeddings first

Build a basic RAG pipeline

Measure before you tune

Apply Advanced RAG techniques selectively

Consider alternatives at scale

Related concepts