Beyond naive chunking — reranking, fusion, multi-hop, and agentic retrieval for production systems
Naive RAG follows a simple linear pattern: chunk document → embed → store in vector DB → retrieve top-k → send to LLM. This works for straightforward factual Q&A, but production systems expose critical failure modes.
The naive assumption is that one query → one retrieval → answer. But real-world tasks rarely work this way. Users ask multi-hop questions, the best answer spans multiple documents, or the user's phrasing drifts far from the actual concept in your documents.
Poor recall: Relevant chunks are missed because semantic drift in embeddings causes them to have low dot-product with the query. Poor precision: Top-k retrieval floods the context with borderline-relevant chunks that confuse the LLM. Lost context: Fixed chunk sizes break paragraphs, splitting concepts across boundaries. Context window overload: Sending too many chunks to the LLM increases latency and confusion.
| Failure | Cause | Fix |
|---|---|---|
| Missed relevant chunks | Semantic drift in embedding | HyDE, query expansion |
| Irrelevant chunks in context | Top-k too large, low precision | Reranking |
| Split concepts | Fixed chunk size breaks paragraphs | Semantic chunking |
| Stale answers | Vector DB not updated | Incremental indexing |
| Multi-hop needed | One chunk can't answer question | Multi-hop / agentic RAG |
The user's query is not always the best form of the question for retrieval. Your embedding model was trained on diverse text; the user's question is just one phrasing. Query transformation bridges that gap by regenerating the query in forms more similar to your documents.
Generate a hypothetical answer to the query, embed that → closer to document embeddings than the question itself.
Generate 3–5 rephrasings of the question, retrieve for each, union results.
Ask a more general background question first, retrieve that, then answer the original.
Break complex question into sub-questions, answer each independently, synthesize.
Chunk size and overlap are not hyperparameters to tune blindly — they depend entirely on document type, query length, and retrieval precision goals. Larger chunks preserve context but reduce precision; smaller chunks improve precision but lose coherence.
Split every N tokens with M token overlap. Fast. Breaks mid-sentence. Good baseline.
Split at sentence/paragraph boundaries, merge until chunk reaches target size. Better coherence.
Index small chunks for retrieval precision; fetch parent chunk for context. High quality at retrieval + context time.
Embed full document first, then pool embeddings per chunk. Preserves cross-chunk context in embeddings.
| Method | Retrieval precision | Context coherence | Complexity | Best for |
|---|---|---|---|---|
| Fixed-size | Medium | Low | Low | Quick prototypes |
| Semantic | Medium-high | High | Medium | General docs |
| Parent-child | High | High | Medium | Long documents |
| Late chunking | High | Highest | High | Technical docs, code |
Vector similarity is a fast but imprecise proxy for relevance. Reranking solves this: after retrieving top-k candidates (e.g., 50), score each against the query using a cross-encoder → re-sort → send top-n (e.g., 5) to LLM. This two-stage approach combines speed and quality.
Bi-encoder (FAISS): Embed query and passages separately in parallel. Very fast initial retrieval but coarse similarity. ColBERT: Compute token-level embeddings for both, max-sim pooling. Faster than cross-encoder, better than bi-encoder. Cross-encoder: See (query, passage) together → much more accurate but too slow for large corpora. Perfect for reranking.
| Model type | Speed | Quality | Can retrieve from large corpus? |
|---|---|---|---|
| Bi-encoder (FAISS) | Very fast | Good | Yes |
| ColBERT | Fast | Better | Yes (with PLAID) |
| Cross-encoder | Slow | Best | No (rerank only) |
Dense retrieval (embeddings) excels at semantic similarity but misses keyword-exact matches. Sparse retrieval (BM25) catches exact matches but misses semantic similarity. Hybrid search combines both rankings via fusion — typically Reciprocal Rank Fusion (RRF).
RRF (Reciprocal Rank Fusion) is a parameter-free fusion method: RRF_score = Σ 1/(k + rank_i). It's simple, robust, and works well in practice. No need to normalize scores or tune weights.
Some questions require chaining multiple retrievals: "What did the author of X say about Y?" You must retrieve X, identify the author, then retrieve on the author's name. Multi-hop RAG handles this by iterating: retrieve → read → formulate follow-up query → retrieve again → synthesize.
Agentic RAG goes further: the LLM decides when to retrieve, what to query, whether to search again, whether to use a calculator or code execution. Retrieval is just one tool in a larger decision-making loop.
| Architecture | Retrieval calls | Complexity | Best for |
|---|---|---|---|
| Naive RAG | 1 | Low | Simple factual Q&A |
| Query-expanded RAG | 3–5 | Low | Better recall |
| Multi-hop RAG | 2–5 | Medium | Relational questions |
| Agentic RAG | Dynamic | High | Complex research tasks |
IRCoT retrieves at each reasoning step, not just at the start. The model reasons, decides it needs information, retrieves, reasons again. This is closer to how humans research — iterative reasoning with strategic lookups.
RAG quality has multiple dimensions: retrieval quality (did we get relevant documents?), answer quality (is the final answer correct?), grounding (is the answer supported by context?). A single metric doesn't capture all three.
Faithfulness: Is the answer grounded in the retrieved context? (# claims supported by context / total claims) Answer Relevancy: Does the answer address the query? Context Precision: What fraction of retrieved chunks are actually relevant? (# relevant chunks / k) Context Recall: What fraction of ground-truth facts are covered by retrieved context?
Advanced RAG is for when basic RAG already works but you need better accuracy, recall, or latency. Always start with basic RAG and measure before applying these techniques.
Use RAGAS or LlamaIndex evals to separate retrieval problems (low context recall) from generation problems (low faithfulness). Each requires different fixes.
Add BM25 to your dense retrieval (Weaviate / Elasticsearch hybrid mode). If recall goes up, you had keyword-matchable queries. This is often the highest-ROI fix.
Cohere Rerank or BGE-Reranker takes your top-20 chunks and reorders them. Common pattern: retrieve 20, rerank to 5. Precision usually jumps 10–20%.
Anthropic's contextual retrieval prepends a short document summary to each chunk before embedding. Dramatically improves recall for large corpora (>100K chunks).