Advanced Reasoning

ReAct Prompting

Interleave reasoning (Thought) and action (Act) steps with observed results β€” the foundational prompt pattern for tool-using agents.

Thought→Act→Observe
Loop
Tool use
Primary use
Yao et al. 2022
Paper

Table of Contents

SECTION 01

What ReAct is

A calculator can compute, but it doesn't know when to compute. A search engine can retrieve, but it doesn't know what to search for. ReAct is the pattern that connects language models to tools β€” it lets the model decide what action to take based on reasoning, observe the result, then reason about the next action.

ReAct stands for Reasoning + Acting. The model alternates between "thinking out loud" (Thought) and "calling a tool" (Action), with the tool's result (Observation) fed back in before the next thought. This loop continues until the model has enough information to give a final answer.

SECTION 02

The Thought-Action-Observation loop

Question: What is the population of the capital of France, and how does it compare to New York City? Thought: I need to find the capital of France first, then its population, then NYC's population. Action: search("capital of France") Observation: Paris is the capital of France. Thought: Now I need the population of Paris. Action: search("population of Paris 2024") Observation: Paris city population: ~2.1 million. Greater Paris area: ~12 million. Thought: Now I need NYC's population. Action: search("population of New York City 2024") Observation: NYC population: ~8.3 million. Greater NYC metro: ~20 million. Thought: I have all the information I need. Final Answer: Paris's city proper population (~2.1M) is about 1/4 of NYC's (~8.3M), though the broader metro areas are closer (Paris ~12M vs NYC ~20M).
SECTION 03

Why interleaving works

Two things make ReAct better than pure retrieval or pure reasoning:

This makes ReAct especially powerful for multi-hop questions ("who was the president when X happened and what did they do about Y") where each answer creates the next question.

SECTION 04

Implementing a ReAct agent

import anthropic import re client = anthropic.Anthropic() # Tool definitions def search(query: str) -> str: """Stub β€” replace with real search API.""" return f"[Search results for: {query}]" def calculator(expression: str) -> str: """Safe calculator.""" try: return str(eval(expression, {"__builtins__": {}}, {})) except Exception as e: return f"Error: {e}" TOOLS = {"search": search, "calculator": calculator} SYSTEM = """You have access to tools. Use them to answer questions. Available tools: - search(query) β€” web search - calculator(expression) β€” math Format: Thought: [your reasoning] Action: tool_name(argument) When ready: Final Answer: [your answer]""" def react_loop(question: str, max_steps: int = 8) -> str: messages = [{"role": "user", "content": question}] for step in range(max_steps): response = client.messages.create( model="claude-opus-4-6", max_tokens=500, system=SYSTEM, messages=messages ).content[0].text messages.append({"role": "assistant", "content": response}) if "Final Answer:" in response: return response.split("Final Answer:")[-1].strip() # Parse and execute action action_match = re.search(r'Action:\s*(\w+)\((.+?)\)', response) if action_match: tool_name, argument = action_match.group(1), action_match.group(2).strip('"'') if tool_name in TOOLS: observation = TOOLS[tool_name](argument) messages.append({"role": "user", "content": f"Observation: {observation}"}) return "Max steps reached without answer." print(react_loop("What year was the Eiffel Tower built, and how old is it in 2025?"))
SECTION 05

Prompt structure

## Minimal ReAct system prompt template You are an assistant with access to tools. Use them to answer questions accurately. ## Available Tools {tool_list} ## Format For each step, write: Thought: [why you're doing what you're doing] Action: [tool_name]([argument]) The tool result will be provided as: Observation: [result] Repeat Thought/Action/Observation until you have enough information. Then write: Final Answer: [complete answer to the original question] ## Rules - Always think before acting. - One action per turn. - If a tool returns an error, try a different approach. - Stop and answer once you have sufficient information. - Maximum {max_steps} steps.
Keep system prompt short: Unlike standard assistants, ReAct agents generate their own Thought traces. A long system prompt crowds out the space for the model's reasoning. The format instructions + tool list should be under 300 tokens.
SECTION 06

Safety and loop limits

import anthropic def safe_react_agent(question: str, max_steps: int = 10) -> dict: """ReAct with circuit breakers.""" steps_taken = [] total_tokens = 0 MAX_TOKENS = 20_000 # Hard limit for step in range(max_steps): # Check token budget if total_tokens > MAX_TOKENS: return { "answer": "Budget exceeded", "steps": steps_taken, "stopped_by": "token_limit" } # ... generate next step ... steps_taken.append({"step": step, "thought": "...", "action": "..."}) # Detect loops: same action 3 times in a row if len(steps_taken) >= 3: last_3_actions = [s["action"] for s in steps_taken[-3:]] if len(set(last_3_actions)) == 1: return { "answer": "Stuck in loop", "steps": steps_taken, "stopped_by": "loop_detection" } return {"answer": "Max steps", "steps": steps_taken, "stopped_by": "max_steps"} # Always cap iterations β€” unbounded ReAct loops can burn your API budget

ReAct vs alternative agent patterns

ReAct's interleaved reasoning-action pattern differs from pure chain-of-thought (reasoning only, no actions) and pure tool-use (actions without explicit reasoning traces). The key advantage of ReAct over tool-use without reasoning is interpretability β€” the thought trace before each action reveals why the agent chose that action, making debugging significantly easier when the agent makes incorrect tool calls. Compared to AutoGPT-style agents that plan extensively before acting, ReAct's step-by-step reasoning-action alternation allows the agent to update its plan based on actual observation results rather than committing to a full plan upfront.

PatternReasoningActionsBest for
Chain-of-ThoughtExplicit step-by-stepNoneMathematical reasoning
Tool-use onlyImplicitYesSimple single-tool tasks
ReActExplicit per-actionYes, interleavedMulti-step tool use
Plan-and-ExecuteFull upfront planSequential executionWell-defined complex tasks

ReAct failure modes and mitigations

The most common ReAct failure mode is reasoning hallucination β€” the model generates a plausible-looking thought that does not correctly reflect the available information, leading to incorrect tool calls that compound errors across subsequent steps. Mitigation strategies include: limiting reasoning to one inference step per action (preventing long reasoning chains from drifting), requiring the model to quote relevant information from previous observations in its reasoning, and implementing observation validation that flags when the model's stated belief contradicts what was actually observed in previous steps.

Tool design for ReAct agents significantly affects agent reliability. Tools with narrow, well-defined input schemas produce fewer malformed tool calls than tools with flexible or ambiguous parameters. Including parameter validation with informative error messages β€” rather than silently accepting invalid inputs β€” helps the agent self-correct when it provides incorrect arguments, because the error message in the observation feeds back to the reasoning step where the model can identify the problem. Each tool should have a single clear purpose; combining multiple actions in one tool increases the probability of partial success states that confuse the agent's subsequent reasoning.

Maximum step limits are a critical safety mechanism for production ReAct agents that prevent runaway loops from consuming unbounded tokens and incurring unlimited costs. Most ReAct frameworks accept a max_iterations or max_steps parameter that terminates the agent after a configurable number of thought-action-observation cycles. Setting this limit requires calibrating against the expected step count distribution for legitimate tasks β€” too low causes premature termination on complex tasks, too high allows runaway agents to consume excessive resources before stopping. Logging the distribution of step counts in production and setting the limit at the 99th percentile plus buffer provides a data-driven approach to limit calibration.

Prompt engineering for ReAct agents is substantially more complex than for single-turn prompts because the prompt must define the tool interface, reasoning format, and stopping criteria in addition to the task objective. The most reliable ReAct prompts use few-shot examples that demonstrate the exact thought-action-observation format for representative tasks, including examples of successful completion and examples of the agent recognizing when it cannot complete a task and gracefully stopping. Models that have been instruction-tuned for tool use (such as GPT-4 with function calling or Claude with tool use) require less prompt engineering than base models because the tool-use format is part of their instruction tuning.

ReAct agent evaluation requires task-specific success metrics beyond simple completion rate. For information-seeking tasks, precision and recall of the retrieved information relative to a reference answer measures quality. For action-taking tasks, task completion rate and the number of steps taken (efficiency) measure both success and cost. Trajectory evaluation β€” examining the full thought-action-observation sequence rather than just the final output β€” reveals systematic reasoning errors that completion metrics miss, such as consistently choosing suboptimal tools or making the same logical error on similar task types. Recording and analyzing agent trajectories in production is essential for identifying improvement opportunities.

Structured output constraints can improve ReAct agent reliability by enforcing a specific JSON format for actions that is parsed programmatically rather than extracted from free text. Having the model output actions as {"tool": "search", "query": "..."} rather than "I will call the search tool with query ..." eliminates the free-text parsing step that introduces errors when the model deviates from the expected format. JSON-mode or function calling interfaces natively support this structured output approach, and models with native function calling support (GPT-4, Claude) produce more reliable structured outputs than models that must be prompted to produce JSON through the system prompt alone.