Agentic RAG

Beyond Naive RAG Core Patterns Query Decomposition Corrective RAG (CRAG) Self-RAG Building with LangGraph Evaluation & Tracing

SECTION 01

Beyond Naive RAG

Traditional RAG (Retrieval-Augmented Generation) is simple: take a user question, retrieve documents, pass documents + question to an LLM, get an answer. This works for straightforward queries but fails on complex questions that require reasoning, multi-hop retrieval, or adaptive strategies.

Limitations of Naive RAG:

Single retrieval pass: If the first retrieval misses relevant docs, the model can't recover. No feedback loop.
No query understanding: Complex questions with multiple sub-queries are treated as monolithic. "What are the causes, symptoms, and treatments of diabetes?" is a single retrieval, missing the fact that each part might need different docs.
No self-correction: If the model realizes retrieved docs are irrelevant, it can't re-query. It has to work with what it got.
Context window limits: Large document collections don't fit in context. Retrieving top-k docs is a bet that top-k is sufficient.
No tool use: The model can only read; it can't run calculations, code, or access real-time data dynamically.

Agentic RAG solves these by adding planning, tool use, and iterative refinement. Instead of retrieve-once, agentic RAG decomposes questions, retrieves adaptively, evaluates results, and refines as needed.

Core Idea: Treat the LLM as an agent with tools. Tools include retrievers, web searchers, calculators, code executors, etc. The agent decides which tools to use and when, creating multi-step workflows that adapt to the question.

Why Agentic? An agent is an LLM that can take actions (use tools), observe outcomes, and decide next steps. Agentic RAG applies this to retrieval: the model doesn't passively read docs; it actively explores, evaluates, and refines.

SECTION 02

Core Patterns

Several patterns enable agentic RAG:

1. Iterative Retrieval

After generating an initial answer, the model checks if it's grounded in the retrieved docs. If not, it reformulates the query and retrieves again:

Loop: 1. User asks: "How does photosynthesis relate to climate?" 2. Retrieve docs on photosynthesis 3. Generate answer based on docs 4. Check: Is answer grounded? (Self-evaluation) 5. If not: Reformulate query ("climate change carbon cycle") 6. Retrieve again 7. Go to step 3 with expanded context 8. Repeat until confident Result: Multi-hop reasoning without explicit prompting.

2. Query Decomposition

Complex questions are broken into sub-queries, each retrieved independently:

"What is the history, current state, and future of AI?" → 3 sub-queries
Retrieve docs for each sub-query
Synthesize results into a cohesive answer

3. Self-Reflection (FLARE)

While generating, the model identifies missing information and retrieves on-demand:

Model generates: "The Renaissance was a period of great innovation..."
Model identifies: I need specific examples of Renaissance innovations
Triggers retrieval: "Renaissance innovations art science"
Continues generation with new context

4. Adaptive Retrieval

The model decides whether it needs retrieval at all. For factual questions, retrieve; for reasoning, sometimes skip retrieval:

Question: "What's 2+2?" → Skip retrieval, answer directly
Question: "Who was the first president of France?" → Retrieve
Question: "Why is 2+2=4?" → Mix retrieval (math foundations) with reasoning

Pattern Advantage: These patterns are composable. Use query decomposition + iterative refinement + self-reflection for maximum coverage. Or use just one pattern if the question is simple.

SECTION 03

Query Decomposition

Breaking a complex question into logical sub-questions improves retrieval coverage.

Methods

1. Explicit Decomposition — Prompt the model to break down the question:

User question: "Compare climate policies in the EU and US" System prompt: "Break this question into 3-5 sub-questions." Model output: 1. What are current EU climate policies? 2. What are current US climate policies? 3. How do they differ in scope and ambition? 4. What has been the effectiveness of each approach? 5. What are future plans in both regions? Action: Retrieve on each sub-question separately.

2. Step-Back Prompting — Ask more abstract questions first to capture broader context:

Original: "What's the impact of federal interest rate decisions on housing prices in California?"
Step-back questions:

"How do federal interest rates affect housing markets?"
"What economic factors drive California real estate?"

Then retrieve on both abstract and original questions

3. HyDE (Hypothetical Document Embeddings)

Generate a hypothetical document answering the question, use its embedding to retrieve similar real documents:

Query: "Best practices for distributed training of LLMs" Step 1: Generate hypothetical document: "Distributed training of LLMs involves data parallelism, model parallelism, and pipeline parallelism. Data parallelism divides the batch across GPUs. Model parallelism divides layers across devices..." Step 2: Embed this hypothetical doc Step 3: Search vector store for docs similar to this embedding Step 4: Retrieve real docs that are semantically similar Result: Retrieves practice-focused docs, not just theoretical ones.

Implementation Example

from anthropic import Anthropic def decompose_query(user_question: str, model: str = "claude-3-5-sonnet-20241022") -> list[str]: """Decompose a complex question into sub-queries.""" client = Anthropic() response = client.messages.create( model=model, max_tokens=500, messages=[{ "role": "user", "content": f"""Break down this question into 3-5 specific sub-questions for retrieval. Return as a numbered list. Question: {user_question}""" }] ) # Parse response into sub-queries text = response.content[0].text sub_queries = [ line.split('. ', 1)[1] if '. ' in line else line for line in text.split('\n') if line.strip() ] return sub_queries def retrieve_for_decomposed_query(user_question: str, retriever) -> dict: """Retrieve for each sub-query and aggregate.""" sub_queries = decompose_query(user_question) all_docs = {} for i, sub_q in enumerate(sub_queries): docs = retriever.retrieve(sub_q, top_k=5) all_docs[f"sub_query_{i}"] = { "query": sub_q, "docs": docs } return all_docs

Tradeoff: Decomposition increases API calls (one retrieval per sub-query). Use for complex questions. For simple questions, skip decomposition to save cost.

SECTION 04

Corrective RAG (CRAG)

Corrective RAG is a structured approach to handle retrieval failures. If retrieved documents don't answer the question, fall back to web search or knowledge expansion.

CRAG Flow

1. User question: "Latest breakthroughs in quantum computing 2026" 2. Retrieve from internal KB 3. Evaluator judges: Are docs relevant? (binary or scored) 4a. If YES: Pass to generation step 4b. If NO: Trigger web search (e.g., Google, Bing) 4c. If PARTIAL: Retrieve more + web search 5. Generate answer from best sources 6. Return with source attribution

Retrieval Evaluation

Key: How do we judge if retrieved docs are relevant? Three approaches:

LLM evaluator: Use Claude/GPT to rate relevance (slow but accurate)
Heuristic evaluator: Check similarity score, keyword overlap, length (fast but brittle)
Hybrid: Use heuristic first; if uncertain, fall back to LLM eval

Implementation

from anthropic import Anthropic def evaluate_retrieval(question: str, docs: list[str]) -> dict: """Evaluate if docs are relevant to question.""" client = Anthropic() docs_text = "\n\n".join(docs[:3]) # Top 3 docs response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=200, messages=[{ "role": "user", "content": f"""Rate if these documents answer the question. Question: {question} Documents: {docs_text} Respond with JSON: {{"relevant": true/false, "reason": "..."}}""" }] ) import json try: return json.loads(response.content[0].text) except: return {"relevant": False, "reason": "Parse error"} def corrective_rag(question: str, retriever, web_searcher): """Corrective RAG pipeline.""" # Step 1: Initial retrieval docs = retriever.retrieve(question, top_k=5) # Step 2: Evaluate eval_result = evaluate_retrieval(question, docs) # Step 3: Fallback if needed if not eval_result["relevant"]: print(f"Retrieved docs not relevant. Reason: {eval_result['reason']}") print("Falling back to web search...") web_docs = web_searcher.search(question, top_k=5) docs = docs + web_docs # Combine # Step 4: Generate client = Anthropic() response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1000, messages=[{ "role": "user", "content": f"""Based on these documents, answer: {question} {chr(10).join([f"- {d}" for d in docs])}""" }] ) return response.content[0].text

CRAG Advantage: Gracefully handles out-of-KB questions by falling back to web search. Prevents "I don't know" when external data could help.

SECTION 05

Self-RAG

Self-RAG extends the model with special tokens that control when to retrieve, evaluate relevance, and judge output quality. The model learns to use these tokens during fine-tuning.

Special Tokens in Self-RAG

[Retrieve]: Signals that retrieval is needed
[IsRel]: Judges if a retrieved doc is relevant (yes/no/irrelevant)
[IsSup]: Checks if generated text is supported by docs (yes/no/partial)
[IsUse]: Evaluates utility (yes/no)

Generation with Self-RAG

Question: "What's the latest in quantum error correction?" Model generates: "Recent advances in quantum error correction..." [Retrieve: quantum error correction 2025 2026] [Retrieved: 3 docs on quantum error correction] [IsRel: yes] (docs are relevant) "Surface codes have improved efficiency by..." [IsRel: yes] "The leading approach is topological codes, which..." [IsSup: partial] (partially supported) [IsUse: yes] (still useful) Final answer synthesized from generation + retrieval steps.

Advantages Over Naive RAG

Selective retrieval: Model only retrieves when needed, not always
Segment-level granularity: Can evaluate each sentence's support
Beam search: Top-k candidates considering retrieval quality
Interpretable: Special tokens show reasoning: when to retrieve, if docs are relevant

Implementation Note

Self-RAG requires fine-tuning to teach the model when to use special tokens. Public models don't support this, but the paper (Asai et al. 2023) provides guidance. For production, you'd need to:

Create training data with [Retrieve], [IsRel], etc. tokens annotated
Fine-tune your base model (LLaMA, etc.) with this data
Implement custom decoding to interpret special tokens

Complexity Tradeoff: Self-RAG is powerful but requires fine-tuning. For many use cases, agentic patterns (decomposition, iteration) are sufficient without custom fine-tuning.

SECTION 06

Building Agentic RAG with LangGraph

LangGraph is a framework for building agentic systems. It models workflows as graphs with nodes (compute steps) and edges (transitions). Here's how to build agentic RAG:

Architecture

Nodes: Retrieve, Generate, Evaluate, Refine
Edges: Conditional transitions based on evaluation
State: Shared context (question, docs, answer)

Implementation Example

from langgraph.graph import StateGraph, END from typing import TypedDict, Annotated import anthropic class RAGState(TypedDict): question: str docs: list[str] answer: str iterations: int max_iterations: int def retrieve_node(state: RAGState) -> RAGState: """Retrieve documents based on question.""" # In real code, use actual vector store retriever = VectorStore() # Placeholder docs = retriever.search(state["question"], top_k=5) state["docs"] = docs return state def generate_node(state: RAGState) -> RAGState: """Generate answer from docs.""" client = anthropic.Anthropic() prompt = f"""Based on these documents: {chr(10).join([f"- {d}" for d in state["docs"]])} Answer: {state["question"]}""" response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=500, messages=[{"role": "user", "content": prompt}] ) state["answer"] = response.content[0].text return state def evaluate_node(state: RAGState) -> str: """Evaluate if answer is grounded.""" client = anthropic.Anthropic() response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=100, messages=[{ "role": "user", "content": f"""Is this answer well-grounded? Docs: {state["docs"]} Answer: {state["answer"]} Reply: grounded/needs_refinement""" }] ) verdict = response.content[0].text.lower() if "grounded" in verdict: return "done" else: return "refine" def refine_node(state: RAGState) -> RAGState: """Refine question and re-retrieve.""" if state["iterations"] >= state["max_iterations"]: return state client = anthropic.Anthropic() response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=100, messages=[{ "role": "user", "content": f"""Rephrase for better retrieval: Original: {state["question"]} Answer failed. Suggest new query.""" }] ) state["question"] = response.content[0].text state["iterations"] += 1 return state # Build graph graph = StateGraph(RAGState) graph.add_node("retrieve", retrieve_node) graph.add_node("generate", generate_node) graph.add_node("evaluate", evaluate_node) graph.add_node("refine", refine_node) graph.add_edge("retrieve", "generate") graph.add_conditional_edges( "evaluate", lambda s: "done" if "grounded" in s else "refine" ) graph.add_edge("refine", "retrieve") graph.set_entry_point("retrieve") rag_app = graph.compile() # Run result = rag_app.invoke({ "question": "How does quantum entanglement work?", "docs": [], "answer": "", "iterations": 0, "max_iterations": 3 }) print(result["answer"])

Graph Execution

LangGraph handles the loop: Retrieve → Generate → Evaluate → (Refine or Done). You define nodes and edges; the framework manages execution.

LangGraph Advantage: Explicit control flow. Easy to debug (see which node runs). Integrates with LangChain tools, models, and evaluation. Supports streaming and real-time updates.

SECTION 07

Evaluation & Tracing

Evaluating agentic RAG systems is more complex than naive RAG because multi-hop reasoning makes ground truth harder to define.

RAGAS Metrics (Adapted for Agentic)

Metric	Definition	For Agentic RAG
Retrieval Precision	% of retrieved docs that are relevant	Measure per retrieval step; agentic may have 3+ steps
Context Precision	Ranking of relevant docs in retrieval list	Track across all retrieval iterations
Faithfulness	Answer is supported by retrieved docs	Judge on final answer + all intermediate docs
Answer Relevance	Answer is relevant to question	Standard metric; agentic decomposition improves this
Latency	Time to generate answer	Agentic adds overhead; monitor iterations vs quality

Tracing Multi-Hop Reasoning

Key: Log each step to understand where errors occur:

Trace output (structured log): { "question": "What are the implications of quantum computing on blockchain?", "steps": [ { "step": 1, "action": "decompose_query", "sub_queries": [ "How does quantum computing work?", "How does blockchain cryptography work?", "How does quantum computing threaten blockchain?" ] }, { "step": 2, "action": "retrieve", "sub_query": "How does quantum computing work?", "results": [ {"doc": "Quantum...", "score": 0.89}, {"doc": "Superposition...", "score": 0.87} ] }, { "step": 3, "action": "evaluate_retrieval", "relevant": true, "reason": "Docs explain quantum fundamentals" }, ... ], "final_answer": "...", "total_steps": 8, "total_tokens": 4500, "latency_ms": 3200 }

Quality vs Latency Tradeoff

Scenario: Q&A system with agentic RAG Config 1: No decomposition, 1 retrieval pass - Latency: 500ms - Accuracy: 72% - Cost: $0.05 per query Config 2: Query decomposition, iterative refinement - Latency: 2500ms - Accuracy: 88% - Cost: $0.20 per query Trade-off: 5x slower, 4x more expensive, 16pt accuracy gain. For high-stakes (medical, legal): worth it. For customer chat: maybe not.

Debugging Agentic Pipelines

Trace logs: Which step failed? Was retrieval insufficient? Was generation poor?
Human evaluation: Sample 100 outputs; have humans rate 0-5 (incorrect, partially correct, correct, excellent)
Failure analysis: Group failures by type (retrieval miss, reasoning error, hallucination)
A/B tests: Compare agentic vs naive RAG on same questions

Best Practice: Instrument your agentic RAG with detailed logging. Every retrieval, evaluation, and generation decision should be traceable. Use LangSmith or similar for visualization and debugging.

Table of Contents

Beyond Naive RAG

Core Patterns

Query Decomposition

Corrective RAG (CRAG)

Self-RAG

Building Agentic RAG with LangGraph

Evaluation & Tracing

Production Deployment Checklist

Agentic RAG

Table of Contents

Beyond Naive RAG

Core Patterns

Query Decomposition

Corrective RAG (CRAG)

Self-RAG

Building Agentic RAG with LangGraph

Evaluation & Tracing

Production Deployment Checklist

Related concepts