Advanced Reasoning

Tree of Thoughts

Explore a tree of reasoning steps with evaluation and backtracking — for hard problems where the first approach reliably fails.

Tree search
Mechanism
Backtracking
Key capability
Rare use
Expensive

Table of Contents

SECTION 01

CoT vs ToT: what's the difference

Chain-of-thought is a single path: the model walks forward, step by step, and commits to each step as it goes. If it takes a wrong turn early, it keeps walking down that wrong path.

Tree of Thoughts (ToT) is a search: the model generates multiple candidate next steps at each point, evaluates which look most promising, pursues the best ones, and backtracks if they lead nowhere. It's the difference between a hiker who walks confidently in one direction vs one who explores different trails and turns back when blocked.

When the distinction matters: Some problems have "cliffs" — a wrong early decision makes the rest of the solution impossible, no matter how well you reason afterward. CoT falls off the cliff and keeps going. ToT can backtrack.
SECTION 02

How the tree works

Three components make ToT work:

Problem: Write 4 words, each starting with the next letter of the word "GAME". ToT search: State: [] Candidates: [Gorilla, Garden, Galaxy, Grab] Eval: All valid, choose Gorilla → State: [Gorilla] State: [Gorilla] Candidates: [Apple, Ancient, Argue, About] Eval: All valid, choose Ancient → State: [Gorilla, Ancient] State: [Gorilla, Ancient, ...] Candidates: [Mountain, Moon, Metal, Music] ...continue to E
SECTION 03

Practical ToT implementation

import anthropic from typing import List, Tuple client = anthropic.Anthropic() def generate_thoughts(problem: str, current_state: str, n: int = 3) -> List[str]: """Generate n candidate next reasoning steps.""" response = client.messages.create( model="claude-opus-4-6", max_tokens=400, messages=[{"role": "user", "content": f""" Problem: {problem} Progress so far: {current_state if current_state else "Just starting."} Generate {n} different possible next steps to take. Each step should be a distinct approach or direction. Format: one step per line, numbered. """}] ).content[0].text lines = [l.strip() for l in response.split("\n") if l.strip() and l[0].isdigit()] return lines[:n] def evaluate_thought(problem: str, state: str, thought: str) -> float: """Score a reasoning step 1-10.""" response = client.messages.create( model="claude-opus-4-6", max_tokens=50, messages=[{"role": "user", "content": f""" Problem: {problem} Current state: {state} Proposed next step: {thought} Rate how promising this step is for solving the problem. Reply with ONLY a number from 1 to 10. Nothing else. """}] ).content[0].text.strip() try: return float(response.split()[0]) except: return 5.0 def tree_of_thoughts(problem: str, max_depth: int = 3, beam_width: int = 2) -> str: """Beam search over reasoning tree.""" # Each beam: (state_string, score) beams = [("", 10.0)] for depth in range(max_depth): candidates = [] for state, _ in beams: thoughts = generate_thoughts(problem, state, n=3) for thought in thoughts: score = evaluate_thought(problem, state, thought) candidates.append((state + "\n" + thought, score)) # Keep top beam_width candidates candidates.sort(key=lambda x: x[1], reverse=True) beams = candidates[:beam_width] # Final answer from best beam best_state = beams[0][0] response = client.messages.create( model="claude-opus-4-6", max_tokens=300, messages=[{"role": "user", "content": f""" Problem: {problem} Reasoning steps taken: {best_state} Based on this reasoning, what is the final answer? """}] ).content[0].text return response
SECTION 04

When ToT is worth it

Reality check: Genuine ToT with backtracking requires 5–30× more API calls than CoT. For most tasks, CoT + self-consistency gets you 90% of the benefit at 5% of the cost. Use ToT for the 5% of problems where the first approach reliably fails and the domain allows meaningful branching.
SECTION 05

Simpler alternatives that often work

## Alternative 1: "Parallel CoT" (cheap ToT approximation) # Generate N full reasoning chains, pick the best (or majority-vote) # This is just self-consistency — captures some ToT benefit ## Alternative 2: "Plan first, then execute" # Step 1: Generate a plan / outline # Step 2: Execute each step of the plan # No backtracking, but forces the model to think about structure first prompt_plan_then_execute = """ Problem: [hard problem here] First, write a step-by-step plan (don't solve yet — just plan). Then execute each step of your plan in order. If a step doesn't work, note it and try an alternative before continuing. """ ## Alternative 3: "Devil's advocate" iteration # Step 1: Generate a solution # Step 2: Ask a separate call to find flaws in that solution # Step 3: Revise based on flaws # 2–3 iterations catches most issues without a full tree
SECTION 06

LangGraph for real ToT

from langgraph.graph import StateGraph, END from typing import TypedDict, List # Real ToT with backtracking needs a proper graph framework class ToTState(TypedDict): problem: str current_path: List[str] depth: int failed_paths: List[List[str]] final_answer: str def expand_node(state: ToTState) -> ToTState: """Generate and evaluate candidate next steps.""" # ... generate thoughts, score them, add best to path return state def should_backtrack(state: ToTState) -> str: """Decide: continue, backtrack, or conclude.""" if state["depth"] >= 4: return "conclude" if is_dead_end(state): return "backtrack" return "continue" def backtrack(state: ToTState) -> ToTState: """Remove last step from path, mark as failed.""" state["failed_paths"].append(state["current_path"].copy()) state["current_path"].pop() state["depth"] -= 1 return state # Build graph g = StateGraph(ToTState) g.add_node("expand", expand_node) g.add_node("backtrack", backtrack) g.add_conditional_edges("expand", should_backtrack, {"continue": "expand", "backtrack": "backtrack", "conclude": END}) g.add_edge("backtrack", "expand") g.set_entry_point("expand") graph = g.compile()

Tree of Thoughts Search Strategies

Tree of Thoughts (ToT) extends chain-of-thought reasoning by generating multiple candidate reasoning steps at each decision point, evaluating their promise, and searching through the resulting tree of possibilities using standard search algorithms. This explicit deliberative search process enables solving problems that require exploration and backtracking — game playing, creative writing with constraints, multi-step planning — where a single linear reasoning chain inevitably gets stuck in dead ends.

Search StrategyAlgorithmCompletenessToken CostBest For
Breadth-first searchBFS across thought treeComplete (bounded)HighShort horizon problems
Depth-first searchDFS with backtrackingComplete (bounded)MediumLong reasoning chains
Beam searchKeep top-K at each levelApproximateModerateQuality/cost balance
Monte Carlo Tree SearchUCB + simulationApproximateVery highComplex multi-step games

The thought evaluation step is the most critical and expensive component of Tree of Thoughts. After generating candidate next thoughts, the evaluator — typically another LLM call with a scoring prompt — assigns a promise score to each candidate, guiding the search toward more productive branches. Evaluation quality directly bounds solution quality: a poor evaluator that assigns high scores to wrong directions will cause the search to waste compute exploring dead-end branches. Using the same model as both generator and evaluator creates a self-consistent scoring system that tends to work better than using different models for the two roles.

ToT is most beneficial for tasks where intermediate states can be objectively evaluated — crossword puzzle completion (can check remaining valid words), code generation with test cases (can run intermediate code snippets), mathematical proof construction (can verify individual proof steps). For tasks with purely subjective quality — creative writing without hard constraints, open-ended question answering — the evaluation step becomes the bottleneck and standard CoT or self-consistency approaches often provide better quality-cost trade-offs than full tree search.

# Simplified ToT beam search
def tree_of_thoughts(problem, branching=3, depth=3, beam_width=2):
    thoughts = [{"state": problem, "path": []}]
    for _ in range(depth):
        candidates = []
        for t in thoughts:
            # Generate branching_factor next thoughts
            next_steps = generate_thoughts(t["state"], n=branching)
            for step in next_steps:
                score = evaluate_thought(t["state"], step)
                candidates.append({
                    "state": step, "path": t["path"] + [step],
                    "score": score
                })
        # Keep top beam_width candidates
        thoughts = sorted(candidates, key=lambda x: -x["score"])[:beam_width]
    return thoughts[0]["path"]  # Best path

ToT Implementation Considerations

Implementing Tree of Thoughts efficiently requires careful attention to the LLM API costs, which scale multiplicatively with branching factor and search depth. A branching factor of 3 with depth 4 and an evaluation call per node requires up to 3^4 = 81 generation calls plus 81 evaluation calls — significantly more than standard CoT. Caching identical intermediate states, pruning obviously dead branches early, and using a cheaper model for evaluation while reserving the full model for generation can reduce costs by 50–70% without significant quality loss.

State representation design is the key architectural decision in ToT implementations. The state must capture enough information for the LLM to continue reasoning from that point, but should be compact enough to fit in the context window alongside the task description and evaluation criteria. For complex multi-step tasks, states often include: the original problem, the reasoning steps taken so far, the current intermediate result, and a brief summary of attempted approaches that did not work — the last element being critical for avoiding repetitive exploration of dead-end branches.

Failure analysis of ToT search reveals that most failures stem from evaluator calibration rather than the search strategy itself. When the evaluator consistently underrates branches that lead to correct solutions, the search converges on wrong answers regardless of the search algorithm used. Improving evaluation quality — through better evaluation prompts, few-shot evaluation examples, or using a stronger model as the evaluator — typically produces larger gains than switching between BFS, DFS, and beam search strategies, since the search strategy is only as good as the evaluation function it optimizes.