SECTION 01
Beyond Naive RAG
Traditional RAG (Retrieval-Augmented Generation) is simple: take a user question, retrieve documents, pass documents + question to an LLM, get an answer. This works for straightforward queries but fails on complex questions that require reasoning, multi-hop retrieval, or adaptive strategies.
Limitations of Naive RAG:
- Single retrieval pass: If the first retrieval misses relevant docs, the model can't recover. No feedback loop.
- No query understanding: Complex questions with multiple sub-queries are treated as monolithic. "What are the causes, symptoms, and treatments of diabetes?" is a single retrieval, missing the fact that each part might need different docs.
- No self-correction: If the model realizes retrieved docs are irrelevant, it can't re-query. It has to work with what it got.
- Context window limits: Large document collections don't fit in context. Retrieving top-k docs is a bet that top-k is sufficient.
- No tool use: The model can only read; it can't run calculations, code, or access real-time data dynamically.
Agentic RAG solves these by adding planning, tool use, and iterative refinement. Instead of retrieve-once, agentic RAG decomposes questions, retrieves adaptively, evaluates results, and refines as needed.
Core Idea: Treat the LLM as an agent with tools. Tools include retrievers, web searchers, calculators, code executors, etc. The agent decides which tools to use and when, creating multi-step workflows that adapt to the question.
Why Agentic? An agent is an LLM that can take actions (use tools), observe outcomes, and decide next steps. Agentic RAG applies this to retrieval: the model doesn't passively read docs; it actively explores, evaluates, and refines.
SECTION 02
Core Patterns
Several patterns enable agentic RAG:
1. Iterative Retrieval
After generating an initial answer, the model checks if it's grounded in the retrieved docs. If not, it reformulates the query and retrieves again:
Loop:
1. User asks: "How does photosynthesis relate to climate?"
2. Retrieve docs on photosynthesis
3. Generate answer based on docs
4. Check: Is answer grounded? (Self-evaluation)
5. If not: Reformulate query ("climate change carbon cycle")
6. Retrieve again
7. Go to step 3 with expanded context
8. Repeat until confident
Result: Multi-hop reasoning without explicit prompting.
2. Query Decomposition
Complex questions are broken into sub-queries, each retrieved independently:
- "What is the history, current state, and future of AI?" → 3 sub-queries
- Retrieve docs for each sub-query
- Synthesize results into a cohesive answer
3. Self-Reflection (FLARE)
While generating, the model identifies missing information and retrieves on-demand:
- Model generates: "The Renaissance was a period of great innovation..."
- Model identifies: I need specific examples of Renaissance innovations
- Triggers retrieval: "Renaissance innovations art science"
- Continues generation with new context
4. Adaptive Retrieval
The model decides whether it needs retrieval at all. For factual questions, retrieve; for reasoning, sometimes skip retrieval:
- Question: "What's 2+2?" → Skip retrieval, answer directly
- Question: "Who was the first president of France?" → Retrieve
- Question: "Why is 2+2=4?" → Mix retrieval (math foundations) with reasoning
Pattern Advantage: These patterns are composable. Use query decomposition + iterative refinement + self-reflection for maximum coverage. Or use just one pattern if the question is simple.
SECTION 03
Query Decomposition
Breaking a complex question into logical sub-questions improves retrieval coverage.
Methods
1. Explicit Decomposition — Prompt the model to break down the question:
User question: "Compare climate policies in the EU and US"
System prompt: "Break this question into 3-5 sub-questions."
Model output:
1. What are current EU climate policies?
2. What are current US climate policies?
3. How do they differ in scope and ambition?
4. What has been the effectiveness of each approach?
5. What are future plans in both regions?
Action: Retrieve on each sub-question separately.
2. Step-Back Prompting — Ask more abstract questions first to capture broader context:
- Original: "What's the impact of federal interest rate decisions on housing prices in California?"
- Step-back questions:
- "How do federal interest rates affect housing markets?"
- "What economic factors drive California real estate?"
- Then retrieve on both abstract and original questions
3. HyDE (Hypothetical Document Embeddings)
Generate a hypothetical document answering the question, use its embedding to retrieve similar real documents:
Query: "Best practices for distributed training of LLMs"
Step 1: Generate hypothetical document:
"Distributed training of LLMs involves data parallelism, model
parallelism, and pipeline parallelism. Data parallelism divides
the batch across GPUs. Model parallelism divides layers across
devices..."
Step 2: Embed this hypothetical doc
Step 3: Search vector store for docs similar to this embedding
Step 4: Retrieve real docs that are semantically similar
Result: Retrieves practice-focused docs, not just theoretical ones.
Implementation Example
from anthropic import Anthropic
def decompose_query(user_question: str, model: str = "claude-3-5-sonnet-20241022") -> list[str]:
"""Decompose a complex question into sub-queries."""
client = Anthropic()
response = client.messages.create(
model=model,
max_tokens=500,
messages=[{
"role": "user",
"content": f"""Break down this question into 3-5 specific sub-questions
for retrieval. Return as a numbered list.
Question: {user_question}"""
}]
)
# Parse response into sub-queries
text = response.content[0].text
sub_queries = [
line.split('. ', 1)[1] if '. ' in line else line
for line in text.split('\n')
if line.strip()
]
return sub_queries
def retrieve_for_decomposed_query(user_question: str, retriever) -> dict:
"""Retrieve for each sub-query and aggregate."""
sub_queries = decompose_query(user_question)
all_docs = {}
for i, sub_q in enumerate(sub_queries):
docs = retriever.retrieve(sub_q, top_k=5)
all_docs[f"sub_query_{i}"] = {
"query": sub_q,
"docs": docs
}
return all_docs
Tradeoff: Decomposition increases API calls (one retrieval per sub-query). Use for complex questions. For simple questions, skip decomposition to save cost.
SECTION 04
Corrective RAG (CRAG)
Corrective RAG is a structured approach to handle retrieval failures. If retrieved documents don't answer the question, fall back to web search or knowledge expansion.
CRAG Flow
1. User question: "Latest breakthroughs in quantum computing 2026"
2. Retrieve from internal KB
3. Evaluator judges: Are docs relevant? (binary or scored)
4a. If YES: Pass to generation step
4b. If NO: Trigger web search (e.g., Google, Bing)
4c. If PARTIAL: Retrieve more + web search
5. Generate answer from best sources
6. Return with source attribution
Retrieval Evaluation
Key: How do we judge if retrieved docs are relevant? Three approaches:
- LLM evaluator: Use Claude/GPT to rate relevance (slow but accurate)
- Heuristic evaluator: Check similarity score, keyword overlap, length (fast but brittle)
- Hybrid: Use heuristic first; if uncertain, fall back to LLM eval
Implementation
from anthropic import Anthropic
def evaluate_retrieval(question: str, docs: list[str]) -> dict:
"""Evaluate if docs are relevant to question."""
client = Anthropic()
docs_text = "\n\n".join(docs[:3]) # Top 3 docs
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=200,
messages=[{
"role": "user",
"content": f"""Rate if these documents answer the question.
Question: {question}
Documents:
{docs_text}
Respond with JSON: {{"relevant": true/false, "reason": "..."}}"""
}]
)
import json
try:
return json.loads(response.content[0].text)
except:
return {"relevant": False, "reason": "Parse error"}
def corrective_rag(question: str, retriever, web_searcher):
"""Corrective RAG pipeline."""
# Step 1: Initial retrieval
docs = retriever.retrieve(question, top_k=5)
# Step 2: Evaluate
eval_result = evaluate_retrieval(question, docs)
# Step 3: Fallback if needed
if not eval_result["relevant"]:
print(f"Retrieved docs not relevant. Reason: {eval_result['reason']}")
print("Falling back to web search...")
web_docs = web_searcher.search(question, top_k=5)
docs = docs + web_docs # Combine
# Step 4: Generate
client = Anthropic()
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1000,
messages=[{
"role": "user",
"content": f"""Based on these documents, answer: {question}
{chr(10).join([f"- {d}" for d in docs])}"""
}]
)
return response.content[0].text
CRAG Advantage: Gracefully handles out-of-KB questions by falling back to web search. Prevents "I don't know" when external data could help.
SECTION 05
Self-RAG
Self-RAG extends the model with special tokens that control when to retrieve, evaluate relevance, and judge output quality. The model learns to use these tokens during fine-tuning.
Special Tokens in Self-RAG
- [Retrieve]: Signals that retrieval is needed
- [IsRel]: Judges if a retrieved doc is relevant (yes/no/irrelevant)
- [IsSup]: Checks if generated text is supported by docs (yes/no/partial)
- [IsUse]: Evaluates utility (yes/no)
Generation with Self-RAG
Question: "What's the latest in quantum error correction?"
Model generates:
"Recent advances in quantum error correction..."
[Retrieve: quantum error correction 2025 2026]
[Retrieved: 3 docs on quantum error correction]
[IsRel: yes] (docs are relevant)
"Surface codes have improved efficiency by..."
[IsRel: yes]
"The leading approach is topological codes, which..."
[IsSup: partial] (partially supported)
[IsUse: yes] (still useful)
Final answer synthesized from generation + retrieval steps.
Advantages Over Naive RAG
- Selective retrieval: Model only retrieves when needed, not always
- Segment-level granularity: Can evaluate each sentence's support
- Beam search: Top-k candidates considering retrieval quality
- Interpretable: Special tokens show reasoning: when to retrieve, if docs are relevant
Implementation Note
Self-RAG requires fine-tuning to teach the model when to use special tokens. Public models don't support this, but the paper (Asai et al. 2023) provides guidance. For production, you'd need to:
- Create training data with [Retrieve], [IsRel], etc. tokens annotated
- Fine-tune your base model (LLaMA, etc.) with this data
- Implement custom decoding to interpret special tokens
Complexity Tradeoff: Self-RAG is powerful but requires fine-tuning. For many use cases, agentic patterns (decomposition, iteration) are sufficient without custom fine-tuning.
SECTION 06
Building Agentic RAG with LangGraph
LangGraph is a framework for building agentic systems. It models workflows as graphs with nodes (compute steps) and edges (transitions). Here's how to build agentic RAG:
Architecture
- Nodes: Retrieve, Generate, Evaluate, Refine
- Edges: Conditional transitions based on evaluation
- State: Shared context (question, docs, answer)
Implementation Example
from langgraph.graph import StateGraph, END
from typing import TypedDict, Annotated
import anthropic
class RAGState(TypedDict):
question: str
docs: list[str]
answer: str
iterations: int
max_iterations: int
def retrieve_node(state: RAGState) -> RAGState:
"""Retrieve documents based on question."""
# In real code, use actual vector store
retriever = VectorStore() # Placeholder
docs = retriever.search(state["question"], top_k=5)
state["docs"] = docs
return state
def generate_node(state: RAGState) -> RAGState:
"""Generate answer from docs."""
client = anthropic.Anthropic()
prompt = f"""Based on these documents:
{chr(10).join([f"- {d}" for d in state["docs"]])}
Answer: {state["question"]}"""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
state["answer"] = response.content[0].text
return state
def evaluate_node(state: RAGState) -> str:
"""Evaluate if answer is grounded."""
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=100,
messages=[{
"role": "user",
"content": f"""Is this answer well-grounded?
Docs: {state["docs"]}
Answer: {state["answer"]}
Reply: grounded/needs_refinement"""
}]
)
verdict = response.content[0].text.lower()
if "grounded" in verdict:
return "done"
else:
return "refine"
def refine_node(state: RAGState) -> RAGState:
"""Refine question and re-retrieve."""
if state["iterations"] >= state["max_iterations"]:
return state
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=100,
messages=[{
"role": "user",
"content": f"""Rephrase for better retrieval:
Original: {state["question"]}
Answer failed. Suggest new query."""
}]
)
state["question"] = response.content[0].text
state["iterations"] += 1
return state
# Build graph
graph = StateGraph(RAGState)
graph.add_node("retrieve", retrieve_node)
graph.add_node("generate", generate_node)
graph.add_node("evaluate", evaluate_node)
graph.add_node("refine", refine_node)
graph.add_edge("retrieve", "generate")
graph.add_conditional_edges(
"evaluate",
lambda s: "done" if "grounded" in s else "refine"
)
graph.add_edge("refine", "retrieve")
graph.set_entry_point("retrieve")
rag_app = graph.compile()
# Run
result = rag_app.invoke({
"question": "How does quantum entanglement work?",
"docs": [],
"answer": "",
"iterations": 0,
"max_iterations": 3
})
print(result["answer"])
Graph Execution
LangGraph handles the loop: Retrieve → Generate → Evaluate → (Refine or Done). You define nodes and edges; the framework manages execution.
LangGraph Advantage: Explicit control flow. Easy to debug (see which node runs). Integrates with LangChain tools, models, and evaluation. Supports streaming and real-time updates.
SECTION 07
Evaluation & Tracing
Evaluating agentic RAG systems is more complex than naive RAG because multi-hop reasoning makes ground truth harder to define.
RAGAS Metrics (Adapted for Agentic)
| Metric |
Definition |
For Agentic RAG |
| Retrieval Precision |
% of retrieved docs that are relevant |
Measure per retrieval step; agentic may have 3+ steps |
| Context Precision |
Ranking of relevant docs in retrieval list |
Track across all retrieval iterations |
| Faithfulness |
Answer is supported by retrieved docs |
Judge on final answer + all intermediate docs |
| Answer Relevance |
Answer is relevant to question |
Standard metric; agentic decomposition improves this |
| Latency |
Time to generate answer |
Agentic adds overhead; monitor iterations vs quality |
Tracing Multi-Hop Reasoning
Key: Log each step to understand where errors occur:
Trace output (structured log):
{
"question": "What are the implications of quantum computing on blockchain?",
"steps": [
{
"step": 1,
"action": "decompose_query",
"sub_queries": [
"How does quantum computing work?",
"How does blockchain cryptography work?",
"How does quantum computing threaten blockchain?"
]
},
{
"step": 2,
"action": "retrieve",
"sub_query": "How does quantum computing work?",
"results": [
{"doc": "Quantum...", "score": 0.89},
{"doc": "Superposition...", "score": 0.87}
]
},
{
"step": 3,
"action": "evaluate_retrieval",
"relevant": true,
"reason": "Docs explain quantum fundamentals"
},
...
],
"final_answer": "...",
"total_steps": 8,
"total_tokens": 4500,
"latency_ms": 3200
}
Quality vs Latency Tradeoff
Scenario: Q&A system with agentic RAG
Config 1: No decomposition, 1 retrieval pass
- Latency: 500ms
- Accuracy: 72%
- Cost: $0.05 per query
Config 2: Query decomposition, iterative refinement
- Latency: 2500ms
- Accuracy: 88%
- Cost: $0.20 per query
Trade-off: 5x slower, 4x more expensive, 16pt accuracy gain.
For high-stakes (medical, legal): worth it. For customer chat: maybe not.
Debugging Agentic Pipelines
- Trace logs: Which step failed? Was retrieval insufficient? Was generation poor?
- Human evaluation: Sample 100 outputs; have humans rate 0-5 (incorrect, partially correct, correct, excellent)
- Failure analysis: Group failures by type (retrieval miss, reasoning error, hallucination)
- A/B tests: Compare agentic vs naive RAG on same questions
Best Practice: Instrument your agentic RAG with detailed logging. Every retrieval, evaluation, and generation decision should be traceable. Use LangSmith or similar for visualization and debugging.