Retrieval & RAG

Agentic RAG

Multi-step retrieval and reasoning that adapts dynamically, combining the power of agents with retrieval-augmented generation.

Multi-Step
Retrieval
Planning + Tools
Architecture
Self-Correcting
Key Property

Table of Contents

SECTION 01

Beyond Naive RAG

Traditional RAG (Retrieval-Augmented Generation) is simple: take a user question, retrieve documents, pass documents + question to an LLM, get an answer. This works for straightforward queries but fails on complex questions that require reasoning, multi-hop retrieval, or adaptive strategies.

Limitations of Naive RAG:

Agentic RAG solves these by adding planning, tool use, and iterative refinement. Instead of retrieve-once, agentic RAG decomposes questions, retrieves adaptively, evaluates results, and refines as needed.

Core Idea: Treat the LLM as an agent with tools. Tools include retrievers, web searchers, calculators, code executors, etc. The agent decides which tools to use and when, creating multi-step workflows that adapt to the question.

Why Agentic? An agent is an LLM that can take actions (use tools), observe outcomes, and decide next steps. Agentic RAG applies this to retrieval: the model doesn't passively read docs; it actively explores, evaluates, and refines.
SECTION 02

Core Patterns

Several patterns enable agentic RAG:

1. Iterative Retrieval

After generating an initial answer, the model checks if it's grounded in the retrieved docs. If not, it reformulates the query and retrieves again:

Loop: 1. User asks: "How does photosynthesis relate to climate?" 2. Retrieve docs on photosynthesis 3. Generate answer based on docs 4. Check: Is answer grounded? (Self-evaluation) 5. If not: Reformulate query ("climate change carbon cycle") 6. Retrieve again 7. Go to step 3 with expanded context 8. Repeat until confident Result: Multi-hop reasoning without explicit prompting.

2. Query Decomposition

Complex questions are broken into sub-queries, each retrieved independently:

3. Self-Reflection (FLARE)

While generating, the model identifies missing information and retrieves on-demand:

4. Adaptive Retrieval

The model decides whether it needs retrieval at all. For factual questions, retrieve; for reasoning, sometimes skip retrieval:

Pattern Advantage: These patterns are composable. Use query decomposition + iterative refinement + self-reflection for maximum coverage. Or use just one pattern if the question is simple.
SECTION 03

Query Decomposition

Breaking a complex question into logical sub-questions improves retrieval coverage.

Methods

1. Explicit Decomposition — Prompt the model to break down the question:

User question: "Compare climate policies in the EU and US" System prompt: "Break this question into 3-5 sub-questions." Model output: 1. What are current EU climate policies? 2. What are current US climate policies? 3. How do they differ in scope and ambition? 4. What has been the effectiveness of each approach? 5. What are future plans in both regions? Action: Retrieve on each sub-question separately.

2. Step-Back Prompting — Ask more abstract questions first to capture broader context:

3. HyDE (Hypothetical Document Embeddings)

Generate a hypothetical document answering the question, use its embedding to retrieve similar real documents:

Query: "Best practices for distributed training of LLMs" Step 1: Generate hypothetical document: "Distributed training of LLMs involves data parallelism, model parallelism, and pipeline parallelism. Data parallelism divides the batch across GPUs. Model parallelism divides layers across devices..." Step 2: Embed this hypothetical doc Step 3: Search vector store for docs similar to this embedding Step 4: Retrieve real docs that are semantically similar Result: Retrieves practice-focused docs, not just theoretical ones.

Implementation Example

from anthropic import Anthropic def decompose_query(user_question: str, model: str = "claude-3-5-sonnet-20241022") -> list[str]: """Decompose a complex question into sub-queries.""" client = Anthropic() response = client.messages.create( model=model, max_tokens=500, messages=[{ "role": "user", "content": f"""Break down this question into 3-5 specific sub-questions for retrieval. Return as a numbered list. Question: {user_question}""" }] ) # Parse response into sub-queries text = response.content[0].text sub_queries = [ line.split('. ', 1)[1] if '. ' in line else line for line in text.split('\n') if line.strip() ] return sub_queries def retrieve_for_decomposed_query(user_question: str, retriever) -> dict: """Retrieve for each sub-query and aggregate.""" sub_queries = decompose_query(user_question) all_docs = {} for i, sub_q in enumerate(sub_queries): docs = retriever.retrieve(sub_q, top_k=5) all_docs[f"sub_query_{i}"] = { "query": sub_q, "docs": docs } return all_docs
Tradeoff: Decomposition increases API calls (one retrieval per sub-query). Use for complex questions. For simple questions, skip decomposition to save cost.
SECTION 04

Corrective RAG (CRAG)

Corrective RAG is a structured approach to handle retrieval failures. If retrieved documents don't answer the question, fall back to web search or knowledge expansion.

CRAG Flow

1. User question: "Latest breakthroughs in quantum computing 2026" 2. Retrieve from internal KB 3. Evaluator judges: Are docs relevant? (binary or scored) 4a. If YES: Pass to generation step 4b. If NO: Trigger web search (e.g., Google, Bing) 4c. If PARTIAL: Retrieve more + web search 5. Generate answer from best sources 6. Return with source attribution

Retrieval Evaluation

Key: How do we judge if retrieved docs are relevant? Three approaches:

Implementation

from anthropic import Anthropic def evaluate_retrieval(question: str, docs: list[str]) -> dict: """Evaluate if docs are relevant to question.""" client = Anthropic() docs_text = "\n\n".join(docs[:3]) # Top 3 docs response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=200, messages=[{ "role": "user", "content": f"""Rate if these documents answer the question. Question: {question} Documents: {docs_text} Respond with JSON: {{"relevant": true/false, "reason": "..."}}""" }] ) import json try: return json.loads(response.content[0].text) except: return {"relevant": False, "reason": "Parse error"} def corrective_rag(question: str, retriever, web_searcher): """Corrective RAG pipeline.""" # Step 1: Initial retrieval docs = retriever.retrieve(question, top_k=5) # Step 2: Evaluate eval_result = evaluate_retrieval(question, docs) # Step 3: Fallback if needed if not eval_result["relevant"]: print(f"Retrieved docs not relevant. Reason: {eval_result['reason']}") print("Falling back to web search...") web_docs = web_searcher.search(question, top_k=5) docs = docs + web_docs # Combine # Step 4: Generate client = Anthropic() response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1000, messages=[{ "role": "user", "content": f"""Based on these documents, answer: {question} {chr(10).join([f"- {d}" for d in docs])}""" }] ) return response.content[0].text
CRAG Advantage: Gracefully handles out-of-KB questions by falling back to web search. Prevents "I don't know" when external data could help.
SECTION 05

Self-RAG

Self-RAG extends the model with special tokens that control when to retrieve, evaluate relevance, and judge output quality. The model learns to use these tokens during fine-tuning.

Special Tokens in Self-RAG

Generation with Self-RAG

Question: "What's the latest in quantum error correction?" Model generates: "Recent advances in quantum error correction..." [Retrieve: quantum error correction 2025 2026] [Retrieved: 3 docs on quantum error correction] [IsRel: yes] (docs are relevant) "Surface codes have improved efficiency by..." [IsRel: yes] "The leading approach is topological codes, which..." [IsSup: partial] (partially supported) [IsUse: yes] (still useful) Final answer synthesized from generation + retrieval steps.

Advantages Over Naive RAG

Implementation Note

Self-RAG requires fine-tuning to teach the model when to use special tokens. Public models don't support this, but the paper (Asai et al. 2023) provides guidance. For production, you'd need to:

Complexity Tradeoff: Self-RAG is powerful but requires fine-tuning. For many use cases, agentic patterns (decomposition, iteration) are sufficient without custom fine-tuning.
SECTION 06

Building Agentic RAG with LangGraph

LangGraph is a framework for building agentic systems. It models workflows as graphs with nodes (compute steps) and edges (transitions). Here's how to build agentic RAG:

Architecture

Implementation Example

from langgraph.graph import StateGraph, END from typing import TypedDict, Annotated import anthropic class RAGState(TypedDict): question: str docs: list[str] answer: str iterations: int max_iterations: int def retrieve_node(state: RAGState) -> RAGState: """Retrieve documents based on question.""" # In real code, use actual vector store retriever = VectorStore() # Placeholder docs = retriever.search(state["question"], top_k=5) state["docs"] = docs return state def generate_node(state: RAGState) -> RAGState: """Generate answer from docs.""" client = anthropic.Anthropic() prompt = f"""Based on these documents: {chr(10).join([f"- {d}" for d in state["docs"]])} Answer: {state["question"]}""" response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=500, messages=[{"role": "user", "content": prompt}] ) state["answer"] = response.content[0].text return state def evaluate_node(state: RAGState) -> str: """Evaluate if answer is grounded.""" client = anthropic.Anthropic() response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=100, messages=[{ "role": "user", "content": f"""Is this answer well-grounded? Docs: {state["docs"]} Answer: {state["answer"]} Reply: grounded/needs_refinement""" }] ) verdict = response.content[0].text.lower() if "grounded" in verdict: return "done" else: return "refine" def refine_node(state: RAGState) -> RAGState: """Refine question and re-retrieve.""" if state["iterations"] >= state["max_iterations"]: return state client = anthropic.Anthropic() response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=100, messages=[{ "role": "user", "content": f"""Rephrase for better retrieval: Original: {state["question"]} Answer failed. Suggest new query.""" }] ) state["question"] = response.content[0].text state["iterations"] += 1 return state # Build graph graph = StateGraph(RAGState) graph.add_node("retrieve", retrieve_node) graph.add_node("generate", generate_node) graph.add_node("evaluate", evaluate_node) graph.add_node("refine", refine_node) graph.add_edge("retrieve", "generate") graph.add_conditional_edges( "evaluate", lambda s: "done" if "grounded" in s else "refine" ) graph.add_edge("refine", "retrieve") graph.set_entry_point("retrieve") rag_app = graph.compile() # Run result = rag_app.invoke({ "question": "How does quantum entanglement work?", "docs": [], "answer": "", "iterations": 0, "max_iterations": 3 }) print(result["answer"])

Graph Execution

LangGraph handles the loop: Retrieve → Generate → Evaluate → (Refine or Done). You define nodes and edges; the framework manages execution.

LangGraph Advantage: Explicit control flow. Easy to debug (see which node runs). Integrates with LangChain tools, models, and evaluation. Supports streaming and real-time updates.
SECTION 07

Evaluation & Tracing

Evaluating agentic RAG systems is more complex than naive RAG because multi-hop reasoning makes ground truth harder to define.

RAGAS Metrics (Adapted for Agentic)

Metric Definition For Agentic RAG
Retrieval Precision % of retrieved docs that are relevant Measure per retrieval step; agentic may have 3+ steps
Context Precision Ranking of relevant docs in retrieval list Track across all retrieval iterations
Faithfulness Answer is supported by retrieved docs Judge on final answer + all intermediate docs
Answer Relevance Answer is relevant to question Standard metric; agentic decomposition improves this
Latency Time to generate answer Agentic adds overhead; monitor iterations vs quality

Tracing Multi-Hop Reasoning

Key: Log each step to understand where errors occur:

Trace output (structured log): { "question": "What are the implications of quantum computing on blockchain?", "steps": [ { "step": 1, "action": "decompose_query", "sub_queries": [ "How does quantum computing work?", "How does blockchain cryptography work?", "How does quantum computing threaten blockchain?" ] }, { "step": 2, "action": "retrieve", "sub_query": "How does quantum computing work?", "results": [ {"doc": "Quantum...", "score": 0.89}, {"doc": "Superposition...", "score": 0.87} ] }, { "step": 3, "action": "evaluate_retrieval", "relevant": true, "reason": "Docs explain quantum fundamentals" }, ... ], "final_answer": "...", "total_steps": 8, "total_tokens": 4500, "latency_ms": 3200 }

Quality vs Latency Tradeoff

Scenario: Q&A system with agentic RAG Config 1: No decomposition, 1 retrieval pass - Latency: 500ms - Accuracy: 72% - Cost: $0.05 per query Config 2: Query decomposition, iterative refinement - Latency: 2500ms - Accuracy: 88% - Cost: $0.20 per query Trade-off: 5x slower, 4x more expensive, 16pt accuracy gain. For high-stakes (medical, legal): worth it. For customer chat: maybe not.

Debugging Agentic Pipelines

Best Practice: Instrument your agentic RAG with detailed logging. Every retrieval, evaluation, and generation decision should be traceable. Use LangSmith or similar for visualization and debugging.
SECTION 08

Production Deployment Checklist

Agentic RAG systems introduce failure modes that do not exist in naive retrieval pipelines. Before deploying to production, verify each of the following.

Retrieval: test recall@5 on a domain-representative question set with and without query decomposition; decomposition should not reduce recall on simple queries. Add a "no documents found" fallback so the agent does not hallucinate sources when retrieval returns empty results. Set a maximum retrieval budget (e.g. 5 retrieval steps per query) to bound latency.

Agent loop: implement a hard turn limit (e.g. 15 agent steps) to prevent infinite loops on ambiguous queries. Log every retrieval call and its results for post-hoc auditing. Add a confidence gate: if the agent's self-assessed confidence after retrieval is below a threshold, route to human review rather than auto-responding.

Evaluation: use LLM-as-judge scoring on a held-out test set, with separate metrics for faithfulness (does the answer follow from the retrieved context?) and correctness (is the answer factually accurate?). Track citation accuracy — each claim in the final answer should map to a specific retrieved passage. A faithfulness score below 0.85 on your test set is a signal to tighten the generation prompt or add a verification step.