Agents ยท Planning ยท Inference-Time Scaling

LLM Tree Search

Monte Carlo Tree Search applied to LLMs โ€” explore and backtrack through reasoning paths to find optimal solutions for hard problems

Explore + Score
Core Idea
Best-of-N โ†’ MCTS
Complexity Spectrum
Compute โ†‘ = Accuracy โ†‘
Key Property

Table of Contents

SECTION 01

The Core Idea

Standard LLM inference is greedy: the model generates one token at a time, committing to each choice before seeing what follows. This works well for most tasks but fails on problems where the optimal path requires looking ahead โ€” math competition problems, theorem proving, complex planning, and multi-step code synthesis.

LLM Tree Search decouples generation from selection. Instead of committing to a single reasoning path, the system expands multiple candidate continuations, scores them using a reward model, and selects the most promising branch to continue. Branches that score poorly are pruned or backtracked from. The result: inference compute is spent exploring the solution space rather than generating along a single (potentially wrong) trajectory.

This is the same insight behind AlphaGo and AlphaZero โ€” systematic tree search is how you find optimal solutions in large, combinatorial spaces. Applied to LLMs, the "moves" are reasoning steps or tokens, and the "position evaluation" is a reward model that scores how likely a partial chain is to lead to a correct final answer.

Why this matters: OpenAI o1/o3 and DeepSeek R1 are believed to use variants of tree search internally during training and/or inference. The dramatic jump in accuracy on hard math and coding benchmarks from o1 to o3 is largely attributed to increased inference-time compute โ€” more search โ€” rather than a larger model.
SECTION 02

From Best-of-N to MCTS

Tree search methods for LLMs exist on a spectrum of complexity and compute cost:

Best-of-N (BoN): The simplest form. Generate N independent completions for the same prompt, score each with a reward model, return the best. No backtracking, no branching in the middle. Surprisingly effective โ€” scaling N from 1 to 100 can match a model 10x larger.

Beam search: Maintain a beam of K partial completions. At each step, expand each beam member, score all candidates, keep the top K. Better than BoN because it allocates compute to promising prefixes rather than complete sequences. Still no backtracking.

Greedy best-first search: Always expand the highest-scored partial path. Efficient but can get stuck in local optima.

Monte Carlo Tree Search (MCTS): The full algorithm. Four phases: Selection (use UCB to pick which node to expand), Expansion (sample next steps from the LLM), Simulation (score the leaf), Backpropagation (update all ancestor node values). Handles the exploration-exploitation tradeoff explicitly.

Best-of-N (simplest, most practical): Generate 5 solutions independently: Solution 1: [chain of thought] ... answer: 42 (ORM score: 0.3) Solution 2: [chain of thought] ... answer: 48 (ORM score: 0.8) โ† pick this Solution 3: [chain of thought] ... answer: 42 (ORM score: 0.4) Solution 4: [chain of thought] ... answer: 51 (ORM score: 0.2) Solution 5: [chain of thought] ... answer: 48 (ORM score: 0.7) Return solution 2 (highest score). MCTS (most powerful, most complex): root / \ step A step B โ† PRM scores: 0.6, 0.3 / \ step A1 step A2 โ† PRM scores: 0.8, 0.2 โœ“ โœ— (prune) Expand A1 further; backtrack from A2. Return best terminal node.
Start here: Implement Best-of-N first. It captures ~70% of MCTS accuracy gains with a fraction of the complexity. Add beam search if you have a reliable PRM. Use full MCTS only if you have a well-calibrated process reward model and latency budget to match.
SECTION 03

Reward Models

A reward model scores how promising a partial or complete reasoning chain is. It's the "position evaluator" that makes tree search possible. Without reliable scores, search degenerates into random exploration.

Outcome Reward Model (ORM): Scores only complete answers. Simple to train โ€” you just need (problem, correct answer) pairs. Best for tasks with clear right/wrong answers. Used in Best-of-N. Limitation: provides no signal until the chain is complete, so can't guide search at intermediate steps.

Process Reward Model (PRM): Scores each reasoning step. Allows pruning of bad branches early in the search. Requires step-level annotations in training data โ€” much more expensive to create. OpenAI released PRM800K, a 800K step-level annotated math dataset, to support PRM research.

LLM-as-judge scorer: Use a separate LLM call to rate the quality of a partial reasoning chain. No separate training required. Slower and more expensive but surprisingly effective, especially during prototyping. Can prompt for a 0โ€“1 score with reasoning.

Self-consistency: Generate many solutions, count answer frequencies, return the majority answer. No explicit reward model needed. Works well for math where wrong paths lead to diverse wrong answers while correct paths converge to the same answer.

# LLM-as-judge PRM example import anthropic client = anthropic.Anthropic() def score_reasoning_step(problem: str, steps_so_far: str) -> float: resp = client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=10, system=( "You are a math reasoning evaluator. " "Rate how likely the current reasoning chain leads to a correct final answer. " "Respond with ONLY a number between 0.0 and 1.0." ), messages=[{ "role": "user", "content": f"Problem: {problem}\n\nReasoning so far:\n{steps_so_far}" }] ) try: return float(resp.content[0].text.strip()) except ValueError: return 0.5
SECTION 04

MCTS for LLMs

Full Monte Carlo Tree Search applied to LLM reasoning has four phases that repeat until the compute budget is exhausted.

Selection: Starting from the root, traverse the tree by selecting the child node with the highest UCB1 score: value + C * sqrt(ln(parent_visits) / node_visits). This balances exploitation (nodes with high value) and exploration (nodes with few visits).

Expansion: At the selected leaf node, sample one or more next reasoning steps from the LLM. These become new child nodes. The LLM is prompted with the problem and all reasoning steps on the path from root to the leaf.

Simulation (Rollout): From the new node, either score it immediately with a PRM or run a quick rollout to a terminal state and score with an ORM.

Backpropagation: Update the value and visit count of all nodes on the path from the new node back to the root. This propagates the score upward so future selection can use it.

MCTS pseudocode for LLM reasoning: tree = Node(state=problem, value=0, visits=0, children=[]) for iteration in range(budget): # 1. Selection node = select_ucb(tree) # traverse using UCB1 # 2. Expansion: sample next step from LLM next_steps = llm_expand(node.state, n=3) for step in next_steps: child = Node(state=node.state + step) node.children.append(child) # 3. Simulation: score with PRM score = prm_score(child.state) # 4. Backpropagation node = child while node is not None: node.value = (node.value * node.visits + score) / (node.visits + 1) node.visits += 1 node = node.parent # Return path to highest-value terminal node
SECTION 05

Implementation

A practical greedy best-first search implementation using Claude:

import anthropic from dataclasses import dataclass, field from typing import Optional client = anthropic.Anthropic() @dataclass class Node: chain: str score: float depth: int children: list = field(default_factory=list) parent: Optional[object] = None def llm_score(problem: str, chain: str) -> float: resp = client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=5, system="Rate this reasoning chain 0.0-1.0. Reply with number only.", messages=[{"role": "user", "content": f"Problem: {problem}\n\nChain:\n{chain}"}] ) try: return float(resp.content[0].text.strip()) except ValueError: return 0.5 def expand(problem: str, chain: str, n: int = 3) -> list[str]: resp = client.messages.create( model="claude-opus-4-5", max_tokens=400, system=f"Generate {n} different next reasoning steps, separated by '|||'. Each step should advance toward solving the problem.", messages=[{"role": "user", "content": f"Problem: {problem}\n\nReasoning so far:\n{chain}\n\nNext steps:"}] ) return [s.strip() for s in resp.content[0].text.split("|||") if s.strip()] def tree_search(problem: str, max_depth: int = 4, beam_width: int = 3) -> str: root = Node(chain="", score=1.0, depth=0) beam = [root] for depth in range(max_depth): candidates = [] for node in beam: steps = expand(problem, node.chain, n=beam_width) for step in steps: new_chain = (node.chain + "\n" + step).strip() score = llm_score(problem, new_chain) child = Node(chain=new_chain, score=score, depth=depth+1, parent=node) node.children.append(child) candidates.append(child) beam = sorted(candidates, key=lambda n: n.score, reverse=True)[:beam_width] print(f"Depth {depth+1}: best score = {beam[0].score:.2f}") # Final answer from best leaf best = beam[0] resp = client.messages.create( model="claude-opus-4-5", max_tokens=500, messages=[{"role": "user", "content": f"Problem: {problem}\n\nReasoning:\n{best.chain}\n\nFinal answer:"}] ) return resp.content[0].text if __name__ == "__main__": problem = "A snail climbs 3 feet up a wall during the day and slides back 2 feet at night. The wall is 10 feet tall. How many days to reach the top?" print(tree_search(problem))
SECTION 06

When to Use Tree Search

Tree search is not universally beneficial โ€” it requires significant compute overhead. Use it when the task characteristics justify the cost.

Good candidates for tree search:

Poor candidates:

Cost warning: MCTS with a branching factor of 3 and depth of 4 requires 3^4 = 81 LLM calls in the worst case (before pruning). With a PRM adding another call per step, tree search can easily cost 100โ€“200x a single inference. Design your scoring budget carefully.
SECTION 07

Inference-Time Scaling

Tree search is a specific mechanism enabling a broader phenomenon: inference-time scaling โ€” the empirical finding that spending more compute at inference time (rather than training time) can substantially improve accuracy, and that these gains follow predictable scaling laws.

The 2024 paper "Scaling LLM Test-Time Compute Optimally" (Snell et al.) showed that optimal allocation of a fixed compute budget between Best-of-N, beam search, and MCTS depends strongly on problem difficulty and the quality of the available PRM. For easy problems, a single well-calibrated sample is optimal. For hard problems, MCTS with a strong PRM consistently wins.

A key finding: for a fixed compute budget, a smaller model with tree search can match or exceed a much larger model running greedy inference. This creates a fundamental tradeoff between model size (training cost) and inference search (per-query cost). The right point on this tradeoff depends on query volume and latency requirements.

Practical implication: If you have a latency-insensitive batch workload requiring high accuracy (e.g. overnight scientific calculations, due diligence analysis), inference-time tree search on a mid-sized model may be more cost-effective than training or fine-tuning a larger model.