Advanced RAG Patterns

Contents

Why naive RAG fails
Query transformation
Chunking strategies
Reranking
Fusion & hybrid search
Multi-hop & agentic RAG
Evaluation framework

01 — Foundation

Why Naive RAG Fails

Naive RAG follows a simple linear pattern: chunk document → embed → store in vector DB → retrieve top-k → send to LLM. This works for straightforward factual Q&A, but production systems expose critical failure modes.

The naive assumption is that one query → one retrieval → answer. But real-world tasks rarely work this way. Users ask multi-hop questions, the best answer spans multiple documents, or the user's phrasing drifts far from the actual concept in your documents.

Failure Modes

Poor recall: Relevant chunks are missed because semantic drift in embeddings causes them to have low dot-product with the query. Poor precision: Top-k retrieval floods the context with borderline-relevant chunks that confuse the LLM. Lost context: Fixed chunk sizes break paragraphs, splitting concepts across boundaries. Context window overload: Sending too many chunks to the LLM increases latency and confusion.

Failure	Cause	Fix
Missed relevant chunks	Semantic drift in embedding	HyDE, query expansion
Irrelevant chunks in context	Top-k too large, low precision	Reranking
Split concepts	Fixed chunk size breaks paragraphs	Semantic chunking
Stale answers	Vector DB not updated	Incremental indexing
Multi-hop needed	One chunk can't answer question	Multi-hop / agentic RAG

💡 Key insight: Naive RAG treats retrieval as a solved problem. Production RAG treats it as an optimization problem with many levers: query transformation, chunking, ranking, fusion, and iteration.

02 — Retrieval

Query Transformation

The user's query is not always the best form of the question for retrieval. Your embedding model was trained on diverse text; the user's question is just one phrasing. Query transformation bridges that gap by regenerating the query in forms more similar to your documents.

Transformation Methods

HyDE — Hypothetical Document Embeddings

Generate a hypothetical answer to the query, embed that → closer to document embeddings than the question itself.

LLM generates a plausible 150-token answer to the query
Embed the hypothetical answer
Retrieve against document corpus
Documents about the topic appear more similar to hypothetical answers than to questions

Multi-Query — Diversity via rephrasing

Generate 3–5 rephrasings of the question, retrieve for each, union results.

LLM rewrites question in different styles and perspectives
Retrieve for each rephrasing independently
Merge result lists (union or RRF)
Higher recall at cost of multiple retrievals

Step-Back Prompting — Hierarchical retrieval

Ask a more general background question first, retrieve that, then answer the original.

Generate abstract version of question first
Retrieve background context on the general topic
Then retrieve specifics for the original question
Useful for complex technical questions

Query Decomposition — Sub-question fusion

Break complex question into sub-questions, answer each independently, synthesize.

Decompose query into atomic sub-questions
Retrieve and answer each sub-question
LLM synthesizes answers into final response
Best for relational and multi-aspect questions

User query: "What causes transformer attention to fail at long contexts?" HyDE step: LLM generates hypothetical answer paragraph (~150 tokens): "Transformer attention fails at long contexts because..." Embed the hypothetical answer. Result: Documents about attention and long context appear more similar to your hypothetical answer than to the question itself. (vs. embedding the question directly → might retrieve "what is attention?" content)

✓ HyDE works best when: Your queries and documents live in different embedding spaces — questions vs. statements. If your corpus is all technical papers and users ask conversational questions, HyDE closes that gap.

03 — Indexing

Chunking Strategies

Chunk size and overlap are not hyperparameters to tune blindly — they depend entirely on document type, query length, and retrieval precision goals. Larger chunks preserve context but reduce precision; smaller chunks improve precision but lose coherence.

Chunking Methods

Fixed-Size Chunking — Simple baseline

Split every N tokens with M token overlap. Fast. Breaks mid-sentence. Good baseline.

Implementation: split_into_chunks(text, 512, 100)
Pros: Simple, deterministic, fast
Cons: Breaks mid-sentence, loses paragraph structure
Best for: Quick prototypes, diverse document types

Semantic Chunking — Respects boundaries

Split at sentence/paragraph boundaries, merge until chunk reaches target size. Better coherence.

Split on sentence/paragraph boundaries first
Merge chunks until byte size ≈ target
Preserves semantic units
Slightly more complex but cleaner chunks

Hierarchical (Parent-Child) — Best of both

Index small chunks for retrieval precision; fetch parent chunk for context. High quality at retrieval + context time.

Small child chunks (e.g., 256 tokens) for retrieval
Retrieve child, fetch parent (e.g., 1024 tokens) for context
High precision on retrieval, full context in generation
Requires document structure awareness

Late Chunking — Preserve cross-chunk context

Embed full document first, then pool embeddings per chunk. Preserves cross-chunk context in embeddings.

Embed full document as sequence
Pool embeddings by chunk boundary
Highest quality embeddings
Compute-intensive, requires full document in VRAM

Method	Retrieval precision	Context coherence	Complexity	Best for
Fixed-size	Medium	Low	Low	Quick prototypes
Semantic	Medium-high	High	Medium	General docs
Parent-child	High	High	Medium	Long documents
Late chunking	High	Highest	High	Technical docs, code

04 — Ranking

Reranking

Vector similarity is a fast but imprecise proxy for relevance. Reranking solves this: after retrieving top-k candidates (e.g., 50), score each against the query using a cross-encoder → re-sort → send top-n (e.g., 5) to LLM. This two-stage approach combines speed and quality.

Retrieval Models Comparison

Bi-encoder (FAISS): Embed query and passages separately in parallel. Very fast initial retrieval but coarse similarity. ColBERT: Compute token-level embeddings for both, max-sim pooling. Faster than cross-encoder, better than bi-encoder. Cross-encoder: See (query, passage) together → much more accurate but too slow for large corpora. Perfect for reranking.

Model type	Speed	Quality	Can retrieve from large corpus?
Bi-encoder (FAISS)	Very fast	Good	Yes
ColBERT	Fast	Better	Yes (with PLAID)
Cross-encoder	Slow	Best	No (rerank only)

Reranking Tools

Reranker

Cohere Rerank

API-based cross-encoder reranker with semantic understanding.

Reranker

BGE-Reranker

Open-source multilingual cross-encoder from BAAI.

Reranker

Jina Reranker

Zero-shot reranker optimized for speed and quality.

Reranker

FlashRank

Lightweight reranker for edge deployment.

Dense Retriever

ColBERT / RAGatouille

Token-level dense retrieval with late interaction scoring.

⚠️ Reranking latency: Cross-encoder reranking adds ~100–500ms per query depending on corpus size. For interactive systems, test whether precision gains justify latency trade-off.

05 — Fusion

Fusion and Hybrid Search

Dense retrieval (embeddings) excels at semantic similarity but misses keyword-exact matches. Sparse retrieval (BM25) catches exact matches but misses semantic similarity. Hybrid search combines both rankings via fusion — typically Reciprocal Rank Fusion (RRF).

RRF (Reciprocal Rank Fusion) is a parameter-free fusion method: RRF_score = Σ 1/(k + rank_i). It's simple, robust, and works well in practice. No need to normalize scores or tune weights.

Query: "RLHF instability" BM25 results: [doc_A rank 1, doc_C rank 2, doc_B rank 3] Dense results: [doc_B rank 1, doc_A rank 2, doc_D rank 3] RRF (k=60): doc_A: 1/(60+1) + 1/(60+2) = 0.0164 + 0.0161 = 0.0325 doc_B: 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323 doc_C: 1/(60+2) = 0.0161 doc_D: 1/(60+3) = 0.0159 Final order: doc_A > doc_B > doc_C > doc_D

⚠️ For domain-specific corpora: With precise terminology (legal, medical, code), BM25 often outperforms pure dense retrieval. Always test hybrid before committing to dense-only.

06 — Reasoning

Multi-Hop and Agentic RAG

Some questions require chaining multiple retrievals: "What did the author of X say about Y?" You must retrieve X, identify the author, then retrieve on the author's name. Multi-hop RAG handles this by iterating: retrieve → read → formulate follow-up query → retrieve again → synthesize.

Agentic RAG goes further: the LLM decides when to retrieve, what to query, whether to search again, whether to use a calculator or code execution. Retrieval is just one tool in a larger decision-making loop.

Architectural Patterns

Architecture	Retrieval calls	Complexity	Best for
Naive RAG	1	Low	Simple factual Q&A
Query-expanded RAG	3–5	Low	Better recall
Multi-hop RAG	2–5	Medium	Relational questions
Agentic RAG	Dynamic	High	Complex research tasks

IRCoT (Interleaving Retrieval with Chain-of-Thought)

IRCoT retrieves at each reasoning step, not just at the start. The model reasons, decides it needs information, retrieves, reasons again. This is closer to how humans research — iterative reasoning with strategic lookups.

⚠️ Agentic RAG latency is unpredictable: Multiple retrieval calls and LLM reasoning loops compound latency. Always set max_iterations and timeouts to prevent runaway loops.

07 — Quality

Evaluation Framework

RAG quality has multiple dimensions: retrieval quality (did we get relevant documents?), answer quality (is the final answer correct?), grounding (is the answer supported by context?). A single metric doesn't capture all three.

Evaluation Tools & Frameworks

Framework

RAGAS

Metrics: faithfulness, answer relevancy, context precision/recall.

Framework

TruLens

Instrumentation for tracing and evaluating RAG chains.

Framework

LangChain Eval

Built-in evaluators for retrieval and generation quality.

Framework

DeepEval

Python framework for synthetic test generation and RAG metrics.

Observability

Arize Phoenix

Production monitoring for RAG systems.

Framework

LlamaIndex Eval

Evaluation suite for RAG retrieval and generation.

RAGAS Metrics

Faithfulness: Is the answer grounded in the retrieved context? (# claims supported by context / total claims) Answer Relevancy: Does the answer address the query? Context Precision: What fraction of retrieved chunks are actually relevant? (# relevant chunks / k) Context Recall: What fraction of ground-truth facts are covered by retrieved context?

RAGAS metric formulas: Faithfulness = # claims in answer supported by context / total claims Context Precision = # relevant chunks in top-k / k Context Recall = # ground-truth facts covered by context / total facts

✓ Best practice: Use RAGAS in continuous evaluation. Track each metric separately — precision and recall often trade off. A system with high context precision but low recall is missing important information.

08 — Further Reading

References

Academic Papers

Paper Gao, L. et al. (2023). Hypothetical Document Embeddings (HyDE). arXiv:2212.10496. — arxiv:2212.10496 ↗
Paper Es, S. et al. (2023). RAGAS: A Reference-Free Evaluation Metric for Retrieval-Augmented Generation. arXiv:2309.15217. — arxiv:2309.15217 ↗
Paper Khattab, O. et al. (2021). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction. arXiv:2004.12832. — arxiv:2004.12832 ↗
Paper Trivedi, H. et al. (2022). Interleaving Retrieval with Chain-of-Thought Reasoning (IRCoT). arXiv:2212.10509. — arxiv:2212.10509 ↗
Blog JinaAI. (2024). Late Chunking: Embedding Long Documents with Large Context Windows. — jina.ai/blog ↗

LEARNING PATH

Learning Path

Advanced RAG is for when basic RAG already works but you need better accuracy, recall, or latency. Always start with basic RAG and measure before applying these techniques.

Basic RAGworking baseline

→

MeasureRAGAS scores

→

Diagnoseretrieval vs. generation?

→

Hybrid Searchfix recall

→

Rerankingfix precision

→

Re-evaldid it help?

Diagnose before you optimise

Use RAGAS or LlamaIndex evals to separate retrieval problems (low context recall) from generation problems (low faithfulness). Each requires different fixes.

Try hybrid search first

Add BM25 to your dense retrieval (Weaviate / Elasticsearch hybrid mode). If recall goes up, you had keyword-matchable queries. This is often the highest-ROI fix.

Add a reranker for precision

Cohere Rerank or BGE-Reranker takes your top-20 chunks and reorders them. Common pattern: retrieve 20, rerank to 5. Precision usually jumps 10–20%.

Contextual retrieval for large corpora

Anthropic's contextual retrieval prepends a short document summary to each chunk before embedding. Dramatically improves recall for large corpora (>100K chunks).

Advanced RAG Patterns

Why Naive RAG Fails

Failure Modes

Query Transformation

Transformation Methods

HyDE — Hypothetical Document Embeddings

Multi-Query — Diversity via rephrasing

Step-Back Prompting — Hierarchical retrieval

Query Decomposition — Sub-question fusion

Chunking Strategies

Chunking Methods

Fixed-Size Chunking — Simple baseline

Semantic Chunking — Respects boundaries

Hierarchical (Parent-Child) — Best of both

Late Chunking — Preserve cross-chunk context

Reranking

Retrieval Models Comparison

Reranking Tools

Fusion and Hybrid Search

Multi-Hop and Agentic RAG

Architectural Patterns

IRCoT (Interleaving Retrieval with Chain-of-Thought)

Evaluation Framework

Evaluation Tools & Frameworks

RAGAS Metrics

References

Learning Path

Diagnose before you optimise

Try hybrid search first

Add a reranker for precision

Contextual retrieval for large corpora

Related concepts