RETRIEVAL-AUGMENTED GENERATION

Advanced RAG Patterns

Beyond naive chunking — reranking, fusion, multi-hop, and agentic retrieval for production systems

retrieve → rerank → generate the full pipeline
MRR & NDCG retrieval quality metrics
naive vs agentic the spectrum
Contents
  1. Why naive RAG fails
  2. Query transformation
  3. Chunking strategies
  4. Reranking
  5. Fusion & hybrid search
  6. Multi-hop & agentic RAG
  7. Evaluation framework
01 — Foundation

Why Naive RAG Fails

Naive RAG follows a simple linear pattern: chunk document → embed → store in vector DB → retrieve top-k → send to LLM. This works for straightforward factual Q&A, but production systems expose critical failure modes.

The naive assumption is that one query → one retrieval → answer. But real-world tasks rarely work this way. Users ask multi-hop questions, the best answer spans multiple documents, or the user's phrasing drifts far from the actual concept in your documents.

Failure Modes

Poor recall: Relevant chunks are missed because semantic drift in embeddings causes them to have low dot-product with the query. Poor precision: Top-k retrieval floods the context with borderline-relevant chunks that confuse the LLM. Lost context: Fixed chunk sizes break paragraphs, splitting concepts across boundaries. Context window overload: Sending too many chunks to the LLM increases latency and confusion.

FailureCauseFix
Missed relevant chunksSemantic drift in embeddingHyDE, query expansion
Irrelevant chunks in contextTop-k too large, low precisionReranking
Split conceptsFixed chunk size breaks paragraphsSemantic chunking
Stale answersVector DB not updatedIncremental indexing
Multi-hop neededOne chunk can't answer questionMulti-hop / agentic RAG
💡 Key insight: Naive RAG treats retrieval as a solved problem. Production RAG treats it as an optimization problem with many levers: query transformation, chunking, ranking, fusion, and iteration.
02 — Retrieval

Query Transformation

The user's query is not always the best form of the question for retrieval. Your embedding model was trained on diverse text; the user's question is just one phrasing. Query transformation bridges that gap by regenerating the query in forms more similar to your documents.

Transformation Methods

1

HyDE — Hypothetical Document Embeddings

Generate a hypothetical answer to the query, embed that → closer to document embeddings than the question itself.

  • LLM generates a plausible 150-token answer to the query
  • Embed the hypothetical answer
  • Retrieve against document corpus
  • Documents about the topic appear more similar to hypothetical answers than to questions
2

Multi-Query — Diversity via rephrasing

Generate 3–5 rephrasings of the question, retrieve for each, union results.

  • LLM rewrites question in different styles and perspectives
  • Retrieve for each rephrasing independently
  • Merge result lists (union or RRF)
  • Higher recall at cost of multiple retrievals
3

Step-Back Prompting — Hierarchical retrieval

Ask a more general background question first, retrieve that, then answer the original.

  • Generate abstract version of question first
  • Retrieve background context on the general topic
  • Then retrieve specifics for the original question
  • Useful for complex technical questions
4

Query Decomposition — Sub-question fusion

Break complex question into sub-questions, answer each independently, synthesize.

  • Decompose query into atomic sub-questions
  • Retrieve and answer each sub-question
  • LLM synthesizes answers into final response
  • Best for relational and multi-aspect questions
User query: "What causes transformer attention to fail at long contexts?" HyDE step: LLM generates hypothetical answer paragraph (~150 tokens): "Transformer attention fails at long contexts because..." Embed the hypothetical answer. Result: Documents about attention and long context appear more similar to your hypothetical answer than to the question itself. (vs. embedding the question directly → might retrieve "what is attention?" content)
HyDE works best when: Your queries and documents live in different embedding spaces — questions vs. statements. If your corpus is all technical papers and users ask conversational questions, HyDE closes that gap.
03 — Indexing

Chunking Strategies

Chunk size and overlap are not hyperparameters to tune blindly — they depend entirely on document type, query length, and retrieval precision goals. Larger chunks preserve context but reduce precision; smaller chunks improve precision but lose coherence.

Chunking Methods

1

Fixed-Size Chunking — Simple baseline

Split every N tokens with M token overlap. Fast. Breaks mid-sentence. Good baseline.

  • Implementation: split_into_chunks(text, 512, 100)
  • Pros: Simple, deterministic, fast
  • Cons: Breaks mid-sentence, loses paragraph structure
  • Best for: Quick prototypes, diverse document types
2

Semantic Chunking — Respects boundaries

Split at sentence/paragraph boundaries, merge until chunk reaches target size. Better coherence.

  • Split on sentence/paragraph boundaries first
  • Merge chunks until byte size ≈ target
  • Preserves semantic units
  • Slightly more complex but cleaner chunks
3

Hierarchical (Parent-Child) — Best of both

Index small chunks for retrieval precision; fetch parent chunk for context. High quality at retrieval + context time.

  • Small child chunks (e.g., 256 tokens) for retrieval
  • Retrieve child, fetch parent (e.g., 1024 tokens) for context
  • High precision on retrieval, full context in generation
  • Requires document structure awareness
4

Late Chunking — Preserve cross-chunk context

Embed full document first, then pool embeddings per chunk. Preserves cross-chunk context in embeddings.

  • Embed full document as sequence
  • Pool embeddings by chunk boundary
  • Highest quality embeddings
  • Compute-intensive, requires full document in VRAM
MethodRetrieval precisionContext coherenceComplexityBest for
Fixed-sizeMediumLowLowQuick prototypes
SemanticMedium-highHighMediumGeneral docs
Parent-childHighHighMediumLong documents
Late chunkingHighHighestHighTechnical docs, code
04 — Ranking

Reranking

Vector similarity is a fast but imprecise proxy for relevance. Reranking solves this: after retrieving top-k candidates (e.g., 50), score each against the query using a cross-encoder → re-sort → send top-n (e.g., 5) to LLM. This two-stage approach combines speed and quality.

Retrieval Models Comparison

Bi-encoder (FAISS): Embed query and passages separately in parallel. Very fast initial retrieval but coarse similarity. ColBERT: Compute token-level embeddings for both, max-sim pooling. Faster than cross-encoder, better than bi-encoder. Cross-encoder: See (query, passage) together → much more accurate but too slow for large corpora. Perfect for reranking.

Model typeSpeedQualityCan retrieve from large corpus?
Bi-encoder (FAISS)Very fastGoodYes
ColBERTFastBetterYes (with PLAID)
Cross-encoderSlowBestNo (rerank only)

Reranking Tools

Reranker
Cohere Rerank
API-based cross-encoder reranker with semantic understanding.
Reranker
BGE-Reranker
Open-source multilingual cross-encoder from BAAI.
Reranker
Jina Reranker
Zero-shot reranker optimized for speed and quality.
Reranker
FlashRank
Lightweight reranker for edge deployment.
Dense Retriever
ColBERT / RAGatouille
Token-level dense retrieval with late interaction scoring.
⚠️ Reranking latency: Cross-encoder reranking adds ~100–500ms per query depending on corpus size. For interactive systems, test whether precision gains justify latency trade-off.
05 — Fusion

Fusion and Hybrid Search

Dense retrieval (embeddings) excels at semantic similarity but misses keyword-exact matches. Sparse retrieval (BM25) catches exact matches but misses semantic similarity. Hybrid search combines both rankings via fusion — typically Reciprocal Rank Fusion (RRF).

RRF (Reciprocal Rank Fusion) is a parameter-free fusion method: RRF_score = Σ 1/(k + rank_i). It's simple, robust, and works well in practice. No need to normalize scores or tune weights.

Query: "RLHF instability" BM25 results: [doc_A rank 1, doc_C rank 2, doc_B rank 3] Dense results: [doc_B rank 1, doc_A rank 2, doc_D rank 3] RRF (k=60): doc_A: 1/(60+1) + 1/(60+2) = 0.0164 + 0.0161 = 0.0325 doc_B: 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323 doc_C: 1/(60+2) = 0.0161 doc_D: 1/(60+3) = 0.0159 Final order: doc_A > doc_B > doc_C > doc_D
⚠️ For domain-specific corpora: With precise terminology (legal, medical, code), BM25 often outperforms pure dense retrieval. Always test hybrid before committing to dense-only.
06 — Reasoning

Multi-Hop and Agentic RAG

Some questions require chaining multiple retrievals: "What did the author of X say about Y?" You must retrieve X, identify the author, then retrieve on the author's name. Multi-hop RAG handles this by iterating: retrieve → read → formulate follow-up query → retrieve again → synthesize.

Agentic RAG goes further: the LLM decides when to retrieve, what to query, whether to search again, whether to use a calculator or code execution. Retrieval is just one tool in a larger decision-making loop.

Architectural Patterns

ArchitectureRetrieval callsComplexityBest for
Naive RAG1LowSimple factual Q&A
Query-expanded RAG3–5LowBetter recall
Multi-hop RAG2–5MediumRelational questions
Agentic RAGDynamicHighComplex research tasks

IRCoT (Interleaving Retrieval with Chain-of-Thought)

IRCoT retrieves at each reasoning step, not just at the start. The model reasons, decides it needs information, retrieves, reasons again. This is closer to how humans research — iterative reasoning with strategic lookups.

⚠️ Agentic RAG latency is unpredictable: Multiple retrieval calls and LLM reasoning loops compound latency. Always set max_iterations and timeouts to prevent runaway loops.
07 — Quality

Evaluation Framework

RAG quality has multiple dimensions: retrieval quality (did we get relevant documents?), answer quality (is the final answer correct?), grounding (is the answer supported by context?). A single metric doesn't capture all three.

Evaluation Tools & Frameworks

Framework
RAGAS
Metrics: faithfulness, answer relevancy, context precision/recall.
Framework
TruLens
Instrumentation for tracing and evaluating RAG chains.
Framework
LangChain Eval
Built-in evaluators for retrieval and generation quality.
Framework
DeepEval
Python framework for synthetic test generation and RAG metrics.
Observability
Arize Phoenix
Production monitoring for RAG systems.
Framework
LlamaIndex Eval
Evaluation suite for RAG retrieval and generation.

RAGAS Metrics

Faithfulness: Is the answer grounded in the retrieved context? (# claims supported by context / total claims) Answer Relevancy: Does the answer address the query? Context Precision: What fraction of retrieved chunks are actually relevant? (# relevant chunks / k) Context Recall: What fraction of ground-truth facts are covered by retrieved context?

RAGAS metric formulas: Faithfulness = # claims in answer supported by context / total claims Context Precision = # relevant chunks in top-k / k Context Recall = # ground-truth facts covered by context / total facts
Best practice: Use RAGAS in continuous evaluation. Track each metric separately — precision and recall often trade off. A system with high context precision but low recall is missing important information.
08 — Further Reading

References

Academic Papers
LEARNING PATH

Learning Path

Advanced RAG is for when basic RAG already works but you need better accuracy, recall, or latency. Always start with basic RAG and measure before applying these techniques.

Basic RAGworking baseline
MeasureRAGAS scores
Diagnoseretrieval vs. generation?
Hybrid Searchfix recall
Rerankingfix precision
Re-evaldid it help?
1

Diagnose before you optimise

Use RAGAS or LlamaIndex evals to separate retrieval problems (low context recall) from generation problems (low faithfulness). Each requires different fixes.

2

Try hybrid search first

Add BM25 to your dense retrieval (Weaviate / Elasticsearch hybrid mode). If recall goes up, you had keyword-matchable queries. This is often the highest-ROI fix.

3

Add a reranker for precision

Cohere Rerank or BGE-Reranker takes your top-20 chunks and reorders them. Common pattern: retrieve 20, rerank to 5. Precision usually jumps 10–20%.

4

Contextual retrieval for large corpora

Anthropic's contextual retrieval prepends a short document summary to each chunk before embedding. Dramatically improves recall for large corpora (>100K chunks).