SECTION 01
CoT vs ToT: what's the difference
Chain-of-thought is a single path: the model walks forward, step by step, and commits to each step as it goes. If it takes a wrong turn early, it keeps walking down that wrong path.
Tree of Thoughts (ToT) is a search: the model generates multiple candidate next steps at each point, evaluates which look most promising, pursues the best ones, and backtracks if they lead nowhere. It's the difference between a hiker who walks confidently in one direction vs one who explores different trails and turns back when blocked.
When the distinction matters: Some problems have "cliffs" — a wrong early decision makes the rest of the solution impossible, no matter how well you reason afterward. CoT falls off the cliff and keeps going. ToT can backtrack.
SECTION 02
How the tree works
Three components make ToT work:
- Thought generator: From the current state, generate k candidate next steps (e.g., "Which variable to assign first?", "Which constraint to address?"). Use sampling at temperature > 0 for diversity.
- State evaluator: For each candidate, ask the model (or a separate prompt) to score it: "Is this a promising direction? Rate 1–10." Prune low-scoring branches.
- Search strategy: BFS (explore all options at each depth before going deeper) or DFS (go deep on the best option first, backtrack if stuck).
Problem: Write 4 words, each starting with the next letter of the word "GAME".
ToT search:
State: []
Candidates: [Gorilla, Garden, Galaxy, Grab]
Eval: All valid, choose Gorilla → State: [Gorilla]
State: [Gorilla]
Candidates: [Apple, Ancient, Argue, About]
Eval: All valid, choose Ancient → State: [Gorilla, Ancient]
State: [Gorilla, Ancient, ...]
Candidates: [Mountain, Moon, Metal, Music]
...continue to E
SECTION 03
Practical ToT implementation
import anthropic
from typing import List, Tuple
client = anthropic.Anthropic()
def generate_thoughts(problem: str, current_state: str, n: int = 3) -> List[str]:
"""Generate n candidate next reasoning steps."""
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=400,
messages=[{"role": "user", "content": f"""
Problem: {problem}
Progress so far: {current_state if current_state else "Just starting."}
Generate {n} different possible next steps to take.
Each step should be a distinct approach or direction.
Format: one step per line, numbered.
"""}]
).content[0].text
lines = [l.strip() for l in response.split("\n") if l.strip() and l[0].isdigit()]
return lines[:n]
def evaluate_thought(problem: str, state: str, thought: str) -> float:
"""Score a reasoning step 1-10."""
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=50,
messages=[{"role": "user", "content": f"""
Problem: {problem}
Current state: {state}
Proposed next step: {thought}
Rate how promising this step is for solving the problem.
Reply with ONLY a number from 1 to 10. Nothing else.
"""}]
).content[0].text.strip()
try:
return float(response.split()[0])
except:
return 5.0
def tree_of_thoughts(problem: str, max_depth: int = 3, beam_width: int = 2) -> str:
"""Beam search over reasoning tree."""
# Each beam: (state_string, score)
beams = [("", 10.0)]
for depth in range(max_depth):
candidates = []
for state, _ in beams:
thoughts = generate_thoughts(problem, state, n=3)
for thought in thoughts:
score = evaluate_thought(problem, state, thought)
candidates.append((state + "\n" + thought, score))
# Keep top beam_width candidates
candidates.sort(key=lambda x: x[1], reverse=True)
beams = candidates[:beam_width]
# Final answer from best beam
best_state = beams[0][0]
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=300,
messages=[{"role": "user", "content": f"""
Problem: {problem}
Reasoning steps taken: {best_state}
Based on this reasoning, what is the final answer?
"""}]
).content[0].text
return response
SECTION 04
When ToT is worth it
- Constraint satisfaction: Sudoku, scheduling, resource allocation — problems where partial solutions can be proven invalid early
- Mathematical proofs: Multiple proof strategies need to be tried and some abandoned
- Code architecture: "Design the data model first" vs "design the API first" — two fundamentally different starting points
- Strategy/planning: Game playing, multi-step business decisions with explicit trade-offs
Reality check: Genuine ToT with backtracking requires 5–30× more API calls than CoT. For most tasks, CoT + self-consistency gets you 90% of the benefit at 5% of the cost. Use ToT for the 5% of problems where the first approach reliably fails and the domain allows meaningful branching.
SECTION 05
Simpler alternatives that often work
## Alternative 1: "Parallel CoT" (cheap ToT approximation)
# Generate N full reasoning chains, pick the best (or majority-vote)
# This is just self-consistency — captures some ToT benefit
## Alternative 2: "Plan first, then execute"
# Step 1: Generate a plan / outline
# Step 2: Execute each step of the plan
# No backtracking, but forces the model to think about structure first
prompt_plan_then_execute = """
Problem: [hard problem here]
First, write a step-by-step plan (don't solve yet — just plan).
Then execute each step of your plan in order.
If a step doesn't work, note it and try an alternative before continuing.
"""
## Alternative 3: "Devil's advocate" iteration
# Step 1: Generate a solution
# Step 2: Ask a separate call to find flaws in that solution
# Step 3: Revise based on flaws
# 2–3 iterations catches most issues without a full tree
SECTION 06
LangGraph for real ToT
from langgraph.graph import StateGraph, END
from typing import TypedDict, List
# Real ToT with backtracking needs a proper graph framework
class ToTState(TypedDict):
problem: str
current_path: List[str]
depth: int
failed_paths: List[List[str]]
final_answer: str
def expand_node(state: ToTState) -> ToTState:
"""Generate and evaluate candidate next steps."""
# ... generate thoughts, score them, add best to path
return state
def should_backtrack(state: ToTState) -> str:
"""Decide: continue, backtrack, or conclude."""
if state["depth"] >= 4:
return "conclude"
if is_dead_end(state):
return "backtrack"
return "continue"
def backtrack(state: ToTState) -> ToTState:
"""Remove last step from path, mark as failed."""
state["failed_paths"].append(state["current_path"].copy())
state["current_path"].pop()
state["depth"] -= 1
return state
# Build graph
g = StateGraph(ToTState)
g.add_node("expand", expand_node)
g.add_node("backtrack", backtrack)
g.add_conditional_edges("expand", should_backtrack,
{"continue": "expand", "backtrack": "backtrack", "conclude": END})
g.add_edge("backtrack", "expand")
g.set_entry_point("expand")
graph = g.compile()
Tree of Thoughts Search Strategies
Tree of Thoughts (ToT) extends chain-of-thought reasoning by generating multiple candidate reasoning steps at each decision point, evaluating their promise, and searching through the resulting tree of possibilities using standard search algorithms. This explicit deliberative search process enables solving problems that require exploration and backtracking — game playing, creative writing with constraints, multi-step planning — where a single linear reasoning chain inevitably gets stuck in dead ends.
| Search Strategy | Algorithm | Completeness | Token Cost | Best For |
| Breadth-first search | BFS across thought tree | Complete (bounded) | High | Short horizon problems |
| Depth-first search | DFS with backtracking | Complete (bounded) | Medium | Long reasoning chains |
| Beam search | Keep top-K at each level | Approximate | Moderate | Quality/cost balance |
| Monte Carlo Tree Search | UCB + simulation | Approximate | Very high | Complex multi-step games |
The thought evaluation step is the most critical and expensive component of Tree of Thoughts. After generating candidate next thoughts, the evaluator — typically another LLM call with a scoring prompt — assigns a promise score to each candidate, guiding the search toward more productive branches. Evaluation quality directly bounds solution quality: a poor evaluator that assigns high scores to wrong directions will cause the search to waste compute exploring dead-end branches. Using the same model as both generator and evaluator creates a self-consistent scoring system that tends to work better than using different models for the two roles.
ToT is most beneficial for tasks where intermediate states can be objectively evaluated — crossword puzzle completion (can check remaining valid words), code generation with test cases (can run intermediate code snippets), mathematical proof construction (can verify individual proof steps). For tasks with purely subjective quality — creative writing without hard constraints, open-ended question answering — the evaluation step becomes the bottleneck and standard CoT or self-consistency approaches often provide better quality-cost trade-offs than full tree search.
# Simplified ToT beam search
def tree_of_thoughts(problem, branching=3, depth=3, beam_width=2):
thoughts = [{"state": problem, "path": []}]
for _ in range(depth):
candidates = []
for t in thoughts:
# Generate branching_factor next thoughts
next_steps = generate_thoughts(t["state"], n=branching)
for step in next_steps:
score = evaluate_thought(t["state"], step)
candidates.append({
"state": step, "path": t["path"] + [step],
"score": score
})
# Keep top beam_width candidates
thoughts = sorted(candidates, key=lambda x: -x["score"])[:beam_width]
return thoughts[0]["path"] # Best path
ToT Implementation Considerations
Implementing Tree of Thoughts efficiently requires careful attention to the LLM API costs, which scale multiplicatively with branching factor and search depth. A branching factor of 3 with depth 4 and an evaluation call per node requires up to 3^4 = 81 generation calls plus 81 evaluation calls — significantly more than standard CoT. Caching identical intermediate states, pruning obviously dead branches early, and using a cheaper model for evaluation while reserving the full model for generation can reduce costs by 50–70% without significant quality loss.
State representation design is the key architectural decision in ToT implementations. The state must capture enough information for the LLM to continue reasoning from that point, but should be compact enough to fit in the context window alongside the task description and evaluation criteria. For complex multi-step tasks, states often include: the original problem, the reasoning steps taken so far, the current intermediate result, and a brief summary of attempted approaches that did not work — the last element being critical for avoiding repetitive exploration of dead-end branches.
Failure analysis of ToT search reveals that most failures stem from evaluator calibration rather than the search strategy itself. When the evaluator consistently underrates branches that lead to correct solutions, the search converges on wrong answers regardless of the search algorithm used. Improving evaluation quality — through better evaluation prompts, few-shot evaluation examples, or using a stronger model as the evaluator — typically produces larger gains than switching between BFS, DFS, and beam search strategies, since the search strategy is only as good as the evaluation function it optimizes.