Retrieval-Augmented Generation

Retrieval-Augmented Generation

Ground any LLM in your documents and data — without retraining

index → retrieve → generate The Pipeline
embeddings + vector DB The Mechanism
RAG vs fine-tuning The Key Decision
On This Page
01 — Overview

Why LLMs Need RAG

LLMs have a fundamental knowledge problem: they only know what was in their training data, which has a hard cutoff. They don't know about your proprietary documents, your company database, live market data, or anything that happened after training. When you ask a plain LLM a question outside its training data, it hallucinates — it fabricates plausible-sounding but false answers.

RAG (Retrieval-Augmented Generation) solves this by attaching external knowledge to the LLM at query time. Instead of asking the LLM directly, you first retrieve the relevant chunks from your documents, then feed both the chunks and the question to the LLM. The LLM then generates a grounded answer based on what it retrieved, not its training data.

💡 RAG solves the core problem with plain LLMs: they only know what was in their training data. RAG attaches any external knowledge — your docs, your database, live data — to any LLM at query time, without retraining the model.

RAG vs. Hallucination

Without RAG: Question: "What is our refund policy?" LLM: "Your policy is 60 days..." [hallucinated, wrong] With RAG: Question: "What is our refund policy?" Retrieve: "Our refund policy is 30 days, no questions asked." LLM: "Your policy is 30 days, no questions asked." [grounded, correct]

When to use RAG

02 — Offline Preparation

The Indexing Pipeline

Indexing is the offline phase: you run it once when you have a new corpus, or periodically when your documents change. The goal is to make documents searchable.

The Four Steps

1

Chunk — Break documents into pieces

Split documents into small, semantically coherent chunks (typically 256–1024 tokens). You need chunks small enough to retrieve precisely but large enough to contain context.

  • Fixed-size chunks: Fast, simple, but ignores semantics
  • Semantic chunks: Split at sentence/paragraph boundaries, better coherence
  • Hierarchical (parent-child): Small chunks for retrieval, fetch parent for context
2

Embed — Turn text into vectors

Use an embedding model to convert each chunk into a dense vector (e.g., 1536 dimensions). Semantically similar chunks will have similar vectors, enabling similarity search.

  • Embedding models: BAAI/bge-small, OpenAI text-embedding-3-small, Anthropic embeddings
  • Dimension: 256–1536 dims. Larger models, larger vectors, higher quality but slower search
  • Metric: Most systems use cosine similarity for retrieval
3

Store — Index in a vector database

Store the chunks + embeddings in a vector database. The DB builds an index (typically HNSW or IVF) for fast approximate nearest neighbor (ANN) search.

4

Verify — Test the index

Query the index with test questions to ensure relevant chunks are retrievable. Bad chunking or embedding models lead to garbage retrieval, which no LLM can fix.

  • Test retrieval quality before going live
  • Identify edge cases (ambiguous queries, rare terms)
  • Iterate on chunking strategy if needed
⚠️ RAG quality is determined before any model call — by how you chunk, what metadata you store, and how you retrieve. Fix the data pipeline first, prompt engineering second.
03 — Online Query

The Retrieval Pipeline

Retrieval is the online phase: happens on every user query. The goal is to find the top-k most relevant chunks, then send them to the LLM.

The Core Flow

User query: "What is the return policy?" STEP 1 — Embed the query Query vector = embed("What is the return policy?") STEP 2 — Search the index Top-3 similar chunks = vector_db.search(query_vector, k=3) Results might be: 1. "Returns are accepted within 30 days of purchase..." 2. "No questions asked 30-day return policy." 3. "Damaged items: full refund within 30 days." STEP 3 — Format context context = "\n".join([chunk1, chunk2, chunk3]) STEP 4 — Prompt the LLM prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:" STEP 5 — Generate answer LLM reads the context and question, outputs grounded answer

Key Design Decisions

How many chunks to retrieve (k)? Trade-off between context breadth and token budget. Typical: k=3 to k=10. More chunks = more context = slower + more cost, but lower risk of missing relevant info.

Embedding consistency: The query and corpus must be embedded with the same model. Mismatched models = broken retrieval.

Context window limits: Your LLM has a max token limit. Calculate: tokens(context) + tokens(query) + tokens(answer) must fit. For GPT-4 with 128k tokens and 10 chunks of 300 tokens each, you have 97k tokens for reasoning.

Best practice: Start with k=3, measure retrieval quality (are the top-3 relevant?), then tune k upward if recall is low. Most systems work well with k ≤ 10.
04 — Working Example

Minimal Code Example

Here is a complete, working RAG system in Python. Install dependencies with pip install anthropic chromadb sentence-transformers.

# Minimal RAG pipeline: embed -> store -> retrieve -> answer # pip install anthropic chromadb sentence-transformers from sentence_transformers import SentenceTransformer import chromadb from anthropic import Anthropic embed_model = SentenceTransformer('BAAI/bge-small-en-v1.5') col = chromadb.Client().get_or_create_collection('docs') client = Anthropic() # Indexing (offline, run once per corpus update) docs = [ 'Our refund policy is 30 days, no questions asked.', 'Shipping takes 3-5 business days to the US.', 'Contact support at help@example.com for account issues.', ] embeddings = embed_model.encode(docs).tolist() col.add(documents=docs, embeddings=embeddings, ids=[f'doc{i}' for i in range(len(docs))]) # Retrieval + Generation (online, per query) def rag(question: str, k: int = 2) -> str: q_emb = embed_model.encode([question]).tolist() results = col.query(query_embeddings=q_emb, n_results=k) context = '\n'.join(results['documents'][0]) resp = client.messages.create( model='claude-haiku-4-5-20251001', max_tokens=256, messages=[{'role':'user','content': f'Context:\n{context}\n\nQuestion: {question}\nAnswer briefly:'}] ) return resp.content[0].text print(rag('What is the return policy?')) # Output: "Our refund policy is 30 days, no questions asked."

What's happening here

Scaling to production

05 — Architecture Decision

RAG vs. Fine-Tuning: When to Use Which

Both RAG and fine-tuning add knowledge to an LLM. They're not mutually exclusive — you can do both. But they solve different problems and have different trade-offs.

RAG: Retrieval-Augmented Generation

You give the LLM external knowledge at query time via retrieved documents.

Fine-Tuning: Adapt the Model Weights

You update the LLM's weights on domain-specific examples. Knowledge becomes part of the model.

Criterion RAG Fine-Tuning Use RAG if...
Knowledge updates Real-time (add documents) Slow (retrain model) Knowledge changes weekly or daily
Cost Low (retrieval + generation) High (GPUs + storage) Budget is tight
Hallucination Low (grounded in docs) Higher (model can confabulate) Accuracy is critical
Source attribution Easy (cite retrieved doc) Impossible (weights are opaque) Users need to verify sources
Rare/edge cases Works if doc is in corpus Works if model saw example Your docs cover edge cases
Task complexity Simple Q&A, summarization Complex reasoning, style Tasks are straightforward
⚠️ Hybrid approach: The best systems often use both. Use RAG for facts and citations. Use fine-tuning for style, tone, and domain-specific reasoning patterns. Fine-tune on top of a retrieval-augmented prompt.
06 — Deeper Dives

What to Explore Next

The concepts on this page are the foundation. As you build RAG systems, you'll encounter more advanced topics:

Concept
Embeddings
Dense vector representations for semantic similarity. Learn how to choose embedding models, understand dimensions, and why embedding quality drives retrieval quality.
→ Read more
Concept
Vector Databases
Specialized databases for approximate nearest neighbor search. Understand HNSW, IVF, scaling, and when to use Pinecone vs. Weaviate vs. self-hosted FAISS.
→ Read more
Concept
Retrieval Techniques
Beyond simple embedding similarity. Dense, sparse (BM25), and hybrid retrieval. Learn RRF fusion, metadata filtering, and query rewriting.
→ Read more
Concept
Post-Retrieval Processing
Improve retrieved context quality. Reranking, context compression, and managing token limits while preserving relevance.
→ Read more
Concept
Advanced RAG
Multi-hop retrieval, agentic RAG, graph-based retrieval, and iterative reasoning loops. For complex research and reasoning tasks.
→ Read more
07 — Further Reading

References

Academic Papers
Learning Resources
LEARNING PATH

Learning Path

RAG touches embeddings, vector search, chunking, and LLM prompting. Here's the recommended sequence:

Embeddingsturn text → vectors
Vector SearchFAISS / Weaviate
Chunkingsplit documents
RAG Pipelineyou are here
Advanced RAGrerank + hybrid
RAG EvalRAGAS / TruLens
1

Understand embeddings first

RAG only makes sense once you know how text becomes a vector and why similar meaning → similar vector. Start with Embeddings.

2

Build a basic RAG pipeline

Use LangChain or LlamaIndex to index a PDF and answer questions over it. Get something working before optimising. Target: <1 hour to a demo.

3

Measure before you tune

Run RAGAS or LlamaIndex evaluations on your pipeline. Context recall and answer faithfulness tell you where it breaks before you guess at fixes.

4

Apply Advanced RAG techniques selectively

Hybrid search, reranking, and HyDE each help in specific failure modes. Fix the failure mode you measured, not the one you assume.

5

Consider alternatives at scale

If your whole corpus fits in a 200K context window, try prompt stuffing before maintaining a vector index. See Frontier Implications.