Chain-of-thought, tree search, and test-time compute scaling — how models think step by step
A standard LLM operates in one forward pass: input → output. No intermediate deliberation. The model generates the final answer directly, token by token, in a single left-to-right pass.
Reasoning LLMs fundamentally change this: they generate explicit intermediate steps (tokens) before the final answer. These intermediate tokens represent "thinking" — the model shows its work, explores paths, checks assumptions, and validates logic before committing to a conclusion.
Two axes define reasoning approaches: where it happens (in the prompt via CoT, in model weights from training, or via search algorithms) and when it happens (during training time with special loss functions, or at inference time with compute budget).
The key insight: more thinking tokens = better answers on hard tasks, at the cost of latency and token spend. This trades inference compute for quality — a new scaling dimension beyond just model size and pretraining tokens.
Wei et al. 2022 showed a simple discovery: appending "Let's think step by step" to a prompt unlocks multi-step reasoning in large models. This cost nearly nothing — no retraining, no new data. Just better prompting.
Zero-shot CoT: Append "Let's think step by step" to any prompt. Works reliably on models ≥ ~100B parameters. The model's training has naturally learned reasoning, just not by default.
Few-shot CoT: Provide worked examples showing explicit reasoning chains. More expensive to construct, but more reliable. Show the model a question, then step-by-step reasoning, then the answer. Few-shot versions reliably outperform zero-shot.
Self-consistency: Don't just sample one CoT chain. Sample multiple independent chains (typically 5–10), let the model reason differently each time, then take a majority vote on the final answer. Adds 3–5% accuracy on math benchmarks with just sampling cost.
Instead of prompting for CoT, train the model to produce reasoning as part of its learned output distribution. OpenAI's o1 approach: train on long CoT traces, then hide the reasoning scratchpad from users at inference time.
This shifts reasoning from a prompting trick to a learned behavior. The model learns when to reason, how deep to reason, and how to validate its own work.
Two reward modeling approaches exist: Outcome Reward Model (ORM) scores only the final answer — correct or incorrect. Simple to label. But it can't distinguish between wrong-answer steps that luckily arrive at the right conclusion.
Process Reward Model (PRM) annotates every reasoning step, scoring each independently. Much more expensive to label (each solution requires dozens of step annotations). But dramatically better for hard math — it teaches the model which intermediate reasoning paths are sound, not just whether the final answer is right.
| Aspect | ORM | PRM |
|---|---|---|
| What it scores | Final answer | Each step |
| Data needed | (prompt, answer) pairs | Step-by-step annotations |
| Good for | General tasks | Math, code, logic |
| Weakness | Can't catch wrong steps that luck into right answer | Expensive to label |
| Used by | Most RLHF pipelines | DeepMind, OpenAI o1 (rumored) |
PRMs represent the frontier of reasoning supervision. DeepSeek's o1 successor and OpenAI's research papers hint at using PRMs to unlock significantly higher accuracy on competition-level mathematics and formal proofs.
Instead of one left-to-right reasoning chain, explore a tree of possible reasoning paths. Different intermediate steps lead to different continuations. Some paths are correct; some are dead ends.
Best-of-N (BoN): Generate N independent solutions, run each through a verifier (e.g., "does the code run?", "does the math check out?"), pick the highest-scored. Simple, embarrassingly parallel, but expensive — N=16 means 16× compute.
Beam search over reasoning steps: Maintain a fixed set of k "active" partial chains. At each reasoning step, expand each chain, score all branches, prune to top k. More efficient than BoN but harder to parallelize within a single request.
Monte Carlo Tree Search (MCTS): Sample reasoning paths, backpropagate outcome rewards through the tree structure. Focuses compute on promising branches. Used in AlphaCode 2, rStar-Math, and other competition systems. Highest quality but requires 50–200× compute.
| Method | Compute | Quality | Parallelizable | When to use |
|---|---|---|---|---|
| Greedy (no search) | 1× | Baseline | N/A | Latency-critical |
| Best-of-N (N=16) | 16× | +5–15% on math | Yes | Offline eval, high accuracy |
| Beam search (k=4) | 4× | +3–8% | Partial | Batch inference |
| MCTS | 50–200× | +15–25% | Partial | Research, competition math |
DeepSeek-R1 demonstrated a profound finding: you can train reasoning capability via pure reinforcement learning, without supervised CoT traces. Start from a base model. Apply a simple RL algorithm. Reasoning emerges.
GRPO (Group Relative Policy Optimization): Sample a group of G responses. Score each with a rule-based reward (e.g., 1 if final answer correct). Compute advantage for each response relative to the group mean, not global baseline. Update policy to increase likelihood of high-advantage responses.
GRPO avoids reward model collapse seen in PPO — where the reward model starts overfitting and giving inflated scores. It's simpler: no separate reward model, no reference model, just group-relative advantage. And empirically, it works: DeepSeek-R1 shows reasoning ability without ever seeing annotated CoT data during training.
A key emergent phenomenon: the model develops an "aha moment" behavior — spontaneously re-examining its work mid-chain when it notices a mistake, then correcting course. This self-correction emerges purely from the RL signal, not from explicit training.
| Method | Needs reward model | Sample efficiency | Reasoning quality |
|---|---|---|---|
| PPO | Yes | Medium | High |
| GRPO | No (rule-based reward) | High | High (DeepSeek-R1) |
| DPO | No | High | Medium (better for alignment than reasoning) |
GRPO's success suggests reasoning doesn't need supervised annotations — just a clear outcome signal and enough RL iterations. This is a major paradigm shift in alignment.
Historical scaling laws focused on training compute: more parameters, more tokens, better model. But for reasoning tasks, scaling inference compute yields a different curve.
Snell et al. 2024 showed: a smaller model with 16× more inference compute can match a much larger model on math benchmarks. You can spend compute at inference time to compensate for model size. This creates a new product dimension.
Instead of training one massive model, train a smaller base and route queries intelligently: easy queries go to a cheap, fast model. Hard queries route to a reasoning-heavy path with more thinking tokens. High-stakes queries get maximum compute. Each query's difficulty is matched to appropriate resources.
| Approach | Model | Tokens per answer | Latency | Cost/query |
|---|---|---|---|---|
| Standard | GPT-4o | ~300 | 2s | Low |
| Reasoning | o1-mini | ~2000 | 15s | Medium |
| Heavy reasoning | o1 | ~5000 | 40s | High |
| Hybrid | Route by difficulty | varies | varies | Optimal |
Reasoning models are now production-ready. The ecosystem has matured rapidly. Multiple options exist for different budget and latency constraints.
from openai import OpenAI
from pydantic import BaseModel
client = OpenAI()
class ReasoningTrace(BaseModel):
thinking: str
answer: str
confidence: float
COT_SYSTEM = """You are a careful reasoner. For every question:
1. Think step-by-step in the 'thinking' field
2. Only then give your final 'answer'
3. Rate your 'confidence' from 0.0 to 1.0
Show your work explicitly — do not skip steps."""
def cot_solve(problem: str) -> ReasoningTrace:
result = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": COT_SYSTEM},
{"role": "user", "content": problem}
],
response_format=ReasoningTrace,
temperature=0.0
)
return result.choices[0].message.parsed
# Example: multi-step math problem
trace = cot_solve(
"A train leaves city A at 60mph. Another leaves city B "
"(300 miles away) at 40mph heading toward A. When do they meet?"
)
print(f"Thinking: {trace.thinking[:200]}")
print(f"Answer: {trace.answer}")
print(f"Confidence: {trace.confidence:.0%}")
# Process Reward Model pattern: score each reasoning step
def score_reasoning_steps(steps: list[str], question: str) -> list[float]:
"""PRM: score each intermediate reasoning step for correctness."""
scores = []
for i, step in enumerate(steps):
prompt = (f"Q: {question}
Reasoning so far:
"
+ "
".join(steps[:i+1])
+ "
Is this step correct and helpful? Score 0.0-1.0:")
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=5, temperature=0.0
).choices[0].message.content.strip()
try: scores.append(float(resp))
except: scores.append(0.5)
return scores
Multi-step math: Algebra, calculus, probability. Reasoning models excel. Include a verifier (sympy) when possible.
Code debugging: The model can trace execution, identify the bug, propose a fix. Better than prompt engineering alone.
Logical deduction: Logic puzzles, constraint satisfaction. Reasoning helps systematically explore the solution space.
Long legal/medical analysis: Complex documents with many dependencies. Reasoning helps track constraints and cross-references.
Simple retrieval: "What is the capital of France?" No reasoning needed. 100ms model beats 10s reasoning model.
Summarization: Reasoning adds latency with no quality gain. SFT-optimized models are faster and sufficient.
Classification: Is this email spam? Reasoning is overkill. A small classifier or few-shot prompt is faster and cheaper.
Measure the impact: add reasoning only where it measurably improves accuracy or enables capabilities you can't otherwise achieve.