01 — Overview
Why LLMs Need RAG
LLMs have a fundamental knowledge problem: they only know what was in their training data, which has a hard cutoff. They don't know about your proprietary documents, your company database, live market data, or anything that happened after training. When you ask a plain LLM a question outside its training data, it hallucinates — it fabricates plausible-sounding but false answers.
RAG (Retrieval-Augmented Generation) solves this by attaching external knowledge to the LLM at query time. Instead of asking the LLM directly, you first retrieve the relevant chunks from your documents, then feed both the chunks and the question to the LLM. The LLM then generates a grounded answer based on what it retrieved, not its training data.
💡
RAG solves the core problem with plain LLMs: they only know what was in their training data. RAG attaches any external knowledge — your docs, your database, live data — to any LLM at query time, without retraining the model.
RAG vs. Hallucination
Without RAG:
Question: "What is our refund policy?"
LLM: "Your policy is 60 days..." [hallucinated, wrong]
With RAG:
Question: "What is our refund policy?"
Retrieve: "Our refund policy is 30 days, no questions asked."
LLM: "Your policy is 30 days, no questions asked." [grounded, correct]
When to use RAG
- You have external documents or data: Contracts, policies, knowledge bases, databases
- Knowledge is private or proprietary: Your internal docs were not in the training data
- Knowledge is live or frequently updated: Market data, APIs, real-time feeds
- You need to cite sources: Users want to know where the answer came from
- You want to update knowledge without retraining: Update the corpus, not the model
02 — Offline Preparation
The Indexing Pipeline
Indexing is the offline phase: you run it once when you have a new corpus, or periodically when your documents change. The goal is to make documents searchable.
The Four Steps
1
Chunk — Break documents into pieces
Split documents into small, semantically coherent chunks (typically 256–1024 tokens). You need chunks small enough to retrieve precisely but large enough to contain context.
- Fixed-size chunks: Fast, simple, but ignores semantics
- Semantic chunks: Split at sentence/paragraph boundaries, better coherence
- Hierarchical (parent-child): Small chunks for retrieval, fetch parent for context
2
Embed — Turn text into vectors
Use an embedding model to convert each chunk into a dense vector (e.g., 1536 dimensions). Semantically similar chunks will have similar vectors, enabling similarity search.
- Embedding models: BAAI/bge-small, OpenAI text-embedding-3-small, Anthropic embeddings
- Dimension: 256–1536 dims. Larger models, larger vectors, higher quality but slower search
- Metric: Most systems use cosine similarity for retrieval
3
Store — Index in a vector database
Store the chunks + embeddings in a vector database. The DB builds an index (typically HNSW or IVF) for fast approximate nearest neighbor (ANN) search.
4
Verify — Test the index
Query the index with test questions to ensure relevant chunks are retrievable. Bad chunking or embedding models lead to garbage retrieval, which no LLM can fix.
- Test retrieval quality before going live
- Identify edge cases (ambiguous queries, rare terms)
- Iterate on chunking strategy if needed
⚠️
RAG quality is determined before any model call — by how you chunk, what metadata you store, and how you retrieve. Fix the data pipeline first, prompt engineering second.
03 — Online Query
The Retrieval Pipeline
Retrieval is the online phase: happens on every user query. The goal is to find the top-k most relevant chunks, then send them to the LLM.
The Core Flow
User query: "What is the return policy?"
STEP 1 — Embed the query
Query vector = embed("What is the return policy?")
STEP 2 — Search the index
Top-3 similar chunks = vector_db.search(query_vector, k=3)
Results might be:
1. "Returns are accepted within 30 days of purchase..."
2. "No questions asked 30-day return policy."
3. "Damaged items: full refund within 30 days."
STEP 3 — Format context
context = "\n".join([chunk1, chunk2, chunk3])
STEP 4 — Prompt the LLM
prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
STEP 5 — Generate answer
LLM reads the context and question, outputs grounded answer
Key Design Decisions
How many chunks to retrieve (k)? Trade-off between context breadth and token budget. Typical: k=3 to k=10. More chunks = more context = slower + more cost, but lower risk of missing relevant info.
Embedding consistency: The query and corpus must be embedded with the same model. Mismatched models = broken retrieval.
Context window limits: Your LLM has a max token limit. Calculate: tokens(context) + tokens(query) + tokens(answer) must fit. For GPT-4 with 128k tokens and 10 chunks of 300 tokens each, you have 97k tokens for reasoning.
✓
Best practice: Start with k=3, measure retrieval quality (are the top-3 relevant?), then tune k upward if recall is low. Most systems work well with k ≤ 10.
04 — Working Example
Minimal Code Example
Here is a complete, working RAG system in Python. Install dependencies with pip install anthropic chromadb sentence-transformers.
# Minimal RAG pipeline: embed -> store -> retrieve -> answer
# pip install anthropic chromadb sentence-transformers
from sentence_transformers import SentenceTransformer
import chromadb
from anthropic import Anthropic
embed_model = SentenceTransformer('BAAI/bge-small-en-v1.5')
col = chromadb.Client().get_or_create_collection('docs')
client = Anthropic()
# Indexing (offline, run once per corpus update)
docs = [
'Our refund policy is 30 days, no questions asked.',
'Shipping takes 3-5 business days to the US.',
'Contact support at help@example.com for account issues.',
]
embeddings = embed_model.encode(docs).tolist()
col.add(documents=docs, embeddings=embeddings,
ids=[f'doc{i}' for i in range(len(docs))])
# Retrieval + Generation (online, per query)
def rag(question: str, k: int = 2) -> str:
q_emb = embed_model.encode([question]).tolist()
results = col.query(query_embeddings=q_emb, n_results=k)
context = '\n'.join(results['documents'][0])
resp = client.messages.create(
model='claude-haiku-4-5-20251001', max_tokens=256,
messages=[{'role':'user','content':
f'Context:\n{context}\n\nQuestion: {question}\nAnswer briefly:'}]
)
return resp.content[0].text
print(rag('What is the return policy?'))
# Output: "Our refund policy is 30 days, no questions asked."
What's happening here
SentenceTransformer: Embedding model that converts text to dense vectors
chromadb.Client(): In-memory vector database. For production, use Pinecone or Weaviate
embed_model.encode(docs): Convert each document to a 384-dim vector
col.add(...): Store documents and their vectors in the index
col.query(...): Find top-k documents most similar to the question vector
client.messages.create(...): Call Claude with context + question, get grounded answer
Scaling to production
- Replace Chroma with Pinecone or Weaviate for persistent, distributed storage
- Add reranking after retrieval: coarse-rank top-50 with embedding, fine-rank top-5 with cross-encoder
- Add metadata filtering: retrieve only documents matching date, author, category
- Implement hybrid search: combine dense (embedding) + sparse (BM25) retrieval
- Monitor retrieval quality with RAGAS metrics: context precision, context recall
05 — Architecture Decision
RAG vs. Fine-Tuning: When to Use Which
Both RAG and fine-tuning add knowledge to an LLM. They're not mutually exclusive — you can do both. But they solve different problems and have different trade-offs.
RAG: Retrieval-Augmented Generation
You give the LLM external knowledge at query time via retrieved documents.
- Best for: Private/proprietary data, frequently updated knowledge, specific facts, citations
- Cost: Cheap — no training. Pay only for retrieval + generation
- Latency: Extra query: embed question + search index (~50–200ms)
- Update speed: Add/remove documents instantly. No retraining
- Hallucination: Lower — LLM grounds answer in retrieved context
- Scaling: Scale retrieval independently of LLM. Use multiple vector DBs
- Drawback: Depends entirely on retrieval quality. Bad retrieval = bad answer
Fine-Tuning: Adapt the Model Weights
You update the LLM's weights on domain-specific examples. Knowledge becomes part of the model.
- Best for: Style/tone adaptation, domain-specific reasoning, specialized tasks
- Cost: Expensive — hours of GPU time, compute, storage
- Latency: No extra latency — generation is the same speed
- Update speed: Slow — need to retrain and deploy new model version
- Hallucination: Higher — model can confabulate facts it learned poorly
- Scaling: Each fine-tuned model is separate; can't share knowledge easily
- Advantage: Works even if you can't retrieve (e.g., rare edge cases)
| Criterion |
RAG |
Fine-Tuning |
Use RAG if... |
| Knowledge updates |
Real-time (add documents) |
Slow (retrain model) |
Knowledge changes weekly or daily |
| Cost |
Low (retrieval + generation) |
High (GPUs + storage) |
Budget is tight |
| Hallucination |
Low (grounded in docs) |
Higher (model can confabulate) |
Accuracy is critical |
| Source attribution |
Easy (cite retrieved doc) |
Impossible (weights are opaque) |
Users need to verify sources |
| Rare/edge cases |
Works if doc is in corpus |
Works if model saw example |
Your docs cover edge cases |
| Task complexity |
Simple Q&A, summarization |
Complex reasoning, style |
Tasks are straightforward |
⚠️
Hybrid approach: The best systems often use both. Use RAG for facts and citations. Use fine-tuning for style, tone, and domain-specific reasoning patterns. Fine-tune on top of a retrieval-augmented prompt.
06 — Deeper Dives
What to Explore Next
The concepts on this page are the foundation. As you build RAG systems, you'll encounter more advanced topics:
07 — Further Reading
References
Academic Papers
-
Paper
Lewis, P. et al. (2020).
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
arXiv:2005.11401. —
arxiv:2005.11401 ↗
-
Paper
Gao, Y. et al. (2023).
Retrieval-Augmented Generation for Large Language Models: A Survey.
arXiv:2312.10997. —
arxiv:2312.10997 ↗
Learning Resources
LEARNING PATH
Learning Path
RAG touches embeddings, vector search, chunking, and LLM prompting. Here's the recommended sequence:
Embeddingsturn text → vectors
→
Vector SearchFAISS / Weaviate
→
Chunkingsplit documents
→
RAG Pipelineyou are here
→
Advanced RAGrerank + hybrid
→
RAG EvalRAGAS / TruLens
1
Understand embeddings first
RAG only makes sense once you know how text becomes a vector and why similar meaning → similar vector. Start with Embeddings.
2
Build a basic RAG pipeline
Use LangChain or LlamaIndex to index a PDF and answer questions over it. Get something working before optimising. Target: <1 hour to a demo.
3
Measure before you tune
Run RAGAS or LlamaIndex evaluations on your pipeline. Context recall and answer faithfulness tell you where it breaks before you guess at fixes.
4
Apply Advanced RAG techniques selectively
Hybrid search, reranking, and HyDE each help in specific failure modes. Fix the failure mode you measured, not the one you assume.