Reasoning RAG

Beyond Standard RAG The Reasoning Layer Architecture Patterns Implementation Chunking for Reasoning RAG Evaluation Cost & Latency

SECTION 01

Beyond Standard RAG

Standard RAG has a fundamental limitation: it treats retrieval as a single-shot lookup. Embed the query, retrieve the top-k chunks, generate an answer. This works well when the answer is explicitly stated in a single chunk. It fails on questions that require connecting evidence across multiple documents, applying domain knowledge not present in any single chunk, or reasoning about relationships between retrieved facts.

Consider a question like: "Given our Q3 contract terms and the regulatory changes in the March circular, what adjustments do we need to make to our SLA commitments?" Answering this requires finding the relevant Q3 contract clauses, finding the March circular, understanding what changed, mapping the changes to SLA obligations, and synthesizing a recommendation. No single retrieved chunk contains this answer.

Reasoning RAG solves this by placing a reasoning-capable model (o1, o3-mini, DeepSeek R1, Claude with extended thinking) as the orchestrator. Rather than generating directly from retrieved context, the model first thinks: what sub-questions do I need to answer? What information gaps do I have? It then issues targeted retrieval calls, reasons over the results, and iterates until it has sufficient evidence to synthesize a grounded, accurate answer.

The key shift: In standard RAG, retrieval feeds generation. In Reasoning RAG, reasoning guides retrieval. The model decides what to retrieve, not just how to use what was retrieved.

SECTION 02

The Reasoning Layer

The reasoning layer is the component that plans multi-step retrieval. It can be implemented in several ways, each with different tradeoffs:

o1/o3-mini as orchestrator: OpenAI's reasoning models are trained with extended chain-of-thought that happens internally before responding. They excel at breaking down complex questions and knowing when they lack information. Use them as the orchestrator that decides what to retrieve; they call a retrieve(query) tool as part of their reasoning. High accuracy, high cost, higher latency.

DeepSeek R1 / Qwen-QwQ: Open-weight alternatives to o1 with explicit chain-of-thought (<think> tags). Can be self-hosted for cost control. Competitive with o1-mini on many reasoning benchmarks.

Claude with extended thinking: Claude Sonnet with thinking blocks enabled. The model produces visible reasoning steps before responding. Can be prompted to use a retrieve tool within this reasoning phase.

Standard model + explicit CoT prompt: The simplest approach: instruct a standard Claude/GPT model to "first plan what you need to retrieve, then retrieve, then answer." Less reliable than true reasoning models but requires no special APIs.

Reasoning RAG flow (conceptual): User: "What changed between our Q3 and Q4 SLAs and how does the March 2026 regulatory update affect compliance?" Model thinks (internal): - I need Q3 SLA terms - I need Q4 SLA terms - I need the March 2026 regulatory update - Then I need to map changes to compliance impact Model calls: retrieve("Q3 SLA terms") → gets 3 chunks Model calls: retrieve("Q4 SLA terms") → gets 3 chunks Model calls: retrieve("March 2026 regulatory update") → gets 2 chunks Model synthesizes: "Q3 had 99.5% uptime guarantee; Q4 raised this to 99.9%. The March 2026 update requires monthly compliance reporting for SLAs above 99.8%. Therefore, starting Q4, you are subject to the monthly reporting requirement..."

SECTION 03

Architecture Patterns

Several architecture patterns implement Reasoning RAG, ranging from simple to sophisticated:

IRCoT (Interleaved Retrieval + Chain of Thought): The model alternates between generating reasoning steps and retrieving supporting evidence. After each reasoning step, if the model identifies an information gap, it retrieves. This is the most natural pattern for reasoning models with tool use.

ReAct (Reason + Act): A general agent framework where the model alternates between Thought (reasoning about what to do), Action (calling a tool — in this case, retrieve), and Observation (reading the retrieved chunks). ReAct was one of the first patterns to show that LLMs could reliably plan multi-step retrieval.

Self-RAG with reasoning: The model generates a retrieval decision token first ("do I need to retrieve for this?"), retrieves if needed, then generates a relevance judgment, and finally generates the answer. Adding a reasoning model to this loop improves the accuracy of both the retrieval decision and the relevance judgment.

Hierarchical retrieval with reasoning: First retrieve at coarse granularity (documents), use the reasoning model to select which documents are relevant, then retrieve at fine granularity (passages within those documents). Dramatically reduces noise from irrelevant chunks.

IRCoT is the pragmatic starting point: Implement retrieval as a tool, give it to a reasoning-capable model, and let the model decide when to call it. This is the simplest implementation of Reasoning RAG and works surprisingly well out of the box.

SECTION 04

Implementation

A complete Reasoning RAG implementation with Claude using the tool use API:

import anthropic client = anthropic.Anthropic() # --- Mock vector store (replace with Pinecone, Weaviate, etc.) --- CORPUS = [ {"id": "q3-sla-1", "text": "Q3 SLA: Uptime guarantee 99.5%. Measured monthly. Credits apply if below 99.0%."}, {"id": "q4-sla-1", "text": "Q4 SLA: Uptime guarantee raised to 99.9%. Credits apply if below 99.5%."}, {"id": "reg-mar26", "text": "March 2026 regulatory update: SLAs exceeding 99.8% uptime require monthly compliance reports submitted to regulator."}, {"id": "transformer-1", "text": "Transformers use self-attention mechanisms to process sequences in parallel."}, {"id": "bert-1", "text": "BERT is pre-trained with masked language modeling on bidirectional context."}, ] def retrieve(query: str, top_k: int = 3) -> list[dict]: # Simple keyword match — replace with semantic search results = [d for d in CORPUS if any(w.lower() in d["text"].lower() for w in query.split())] return results[:top_k] if results else CORPUS[:top_k] TOOLS = [{ "name": "retrieve", "description": ( "Retrieve relevant document chunks from the knowledge base. " "Use multiple targeted queries to gather all information needed to answer the question." ), "input_schema": { "type": "object", "properties": { "query": {"type": "string", "description": "Focused search query"}, "top_k": {"type": "integer", "default": 3} }, "required": ["query"] } }] SYSTEM = """You are a research assistant with access to a knowledge base. Before answering, use the retrieve tool to gather all relevant information. Think carefully about what sub-questions you need to answer, and retrieve for each one. After gathering sufficient evidence, synthesize a comprehensive, grounded answer. If you are uncertain about something, say so explicitly.""" def reasoning_rag(question: str, max_turns: int = 8) -> str: messages = [{"role": "user", "content": question}] retrieved_context = [] for turn in range(max_turns): resp = client.messages.create( model="claude-opus-4-5", max_tokens=2048, system=SYSTEM, tools=TOOLS, messages=messages, ) messages.append({"role": "assistant", "content": resp.content}) if resp.stop_reason == "end_turn": for block in reversed(resp.content): if hasattr(block, "text"): return block.text break tool_results = [] for block in resp.content: if block.type == "tool_use" and block.name == "retrieve": chunks = retrieve(block.input["query"], block.input.get("top_k", 3)) retrieved_context.extend(chunks) content = "\n".join(f"[{c['id']}] {c['text']}" for c in chunks) print(f" Retrieved {len(chunks)} chunks for: {block.input['query']}") tool_results.append({ "type": "tool_result", "tool_use_id": block.id, "content": content or "No relevant documents found." }) if tool_results: messages.append({"role": "user", "content": tool_results}) return "Could not generate a complete answer within the turn budget." if __name__ == "__main__": answer = reasoning_rag( "What changed between Q3 and Q4 SLAs and how does the March 2026 " "regulatory update affect our compliance obligations?" ) print(answer)

SECTION 05

Chunking for Reasoning RAG

The optimal chunking strategy for Reasoning RAG differs from standard RAG because the reasoning model can issue multiple targeted queries and integrate results across chunks.

Larger chunks are acceptable: Standard RAG needs small chunks (200–400 tokens) to fit within the retrieved context window while maintaining density. In Reasoning RAG, the model can pull multiple targeted chunks and reason across them. Larger chunks (500–1000 tokens) that preserve complete ideas reduce the "split document" problem where a single concept is split across multiple chunks that get retrieved separately.

Rich metadata is critical: The reasoning model uses metadata to formulate targeted queries. Document date, author, topic tags, document type, and section headers all help the model construct precise retrieval queries like "March 2026 regulatory update" rather than generic semantic queries.

Hierarchical indexing: Index at multiple granularities — document level (for coarse filtering), section level (for medium precision), and paragraph level (for fine-grained facts). The reasoning model can navigate this hierarchy, first identifying which documents are relevant, then drilling into specific sections.

Common mistake: Using the same chunking strategy for Reasoning RAG as for standard RAG. Standard RAG needs small, dense chunks optimized for single-shot retrieval. Reasoning RAG benefits from larger, coherent chunks with rich metadata because the model is doing the integration work.

SECTION 06

Evaluation

Evaluating Reasoning RAG requires measuring both retrieval quality and reasoning quality — and their interaction.

Multi-hop accuracy: Test specifically on questions requiring 2–3 retrieval hops. Standard QA benchmarks (SQuAD, TriviaQA) don't test multi-hop retrieval. Use MuSiQue, HotpotQA, or 2WikiMultiHopQA for multi-hop evaluation.

Retrieval coverage: For a given question, what percentage of the "gold" supporting documents does the model actually retrieve? Low coverage → incomplete evidence → hallucinated gaps in the answer. Measure recall@k against manually annotated supporting document sets.

Faithfulness: Does each claim in the final answer trace back to a specific retrieved chunk? Use LLM-as-judge to verify that every factual claim in the answer is explicitly supported by retrieved evidence, not inferred from training knowledge.

Reasoning quality: For questions with known answers, measure exact match and F1 on the final answer. For complex synthesis questions, use LLM-as-judge with a detailed rubric comparing the answer to a reference answer written by a domain expert.

SECTION 07

Cost & Latency

Reasoning RAG is substantially more expensive and slower than standard RAG. Understanding the cost structure helps you decide when it's justified and how to control costs.

Cost drivers: (1) Reasoning model inference is typically 5–15x more expensive per token than a standard model. (2) Each retrieval round adds latency and context tokens. A 5-hop Reasoning RAG pipeline might use 20,000 input tokens vs 3,000 for standard RAG. (3) The reasoning model's internal chain-of-thought adds output tokens.

Cost reduction strategies:

Use a fast model for retrieval planning, strong model for synthesis: A cheaper model can generate good retrieval queries even if it can't do the final synthesis.
Cache retrieved chunks: If multiple users ask related questions, cache embedding lookups and document chunks. The retrieval cost is often larger than the LLM cost for standard documents.
Limit retrieval rounds: Set a hard cap (3–5 rounds). Most questions are answered with 2–3 targeted retrievals; additional rounds have diminishing returns.
Route by complexity: Use a classifier to route simple questions to standard RAG and only complex multi-hop questions to Reasoning RAG. This keeps average cost manageable.

When it's worth it: Reasoning RAG typically costs 10–30x more per query than standard RAG. It's justified when: (a) the answer genuinely requires multi-hop reasoning, (b) the cost of an incorrect answer (missed clause, wrong compliance interpretation) exceeds the additional inference cost, or (c) you can demonstrate a clear accuracy improvement on your specific domain.

Table of Contents

Beyond Standard RAG

The Reasoning Layer

Architecture Patterns

Implementation

Chunking for Reasoning RAG

Evaluation

Cost & Latency

Related concepts