REASONING & INFERENCE

Advanced Reasoning in LLMs

Chain-of-thought, tree search, and test-time compute scaling — how models think step by step

CoT → o1 → R1 evolution of reasoning
think before answering the core insight
scale inference not just training the paradigm shift
Contents
  1. What is reasoning
  2. Chain-of-thought prompting
  3. Scratchpad training & PRM
  4. Tree search at inference
  5. DeepSeek-R1 & GRPO
  6. Test-time compute scaling
  7. Practical implementation
01 — Foundation

What Is Reasoning in LLMs?

A standard LLM operates in one forward pass: input → output. No intermediate deliberation. The model generates the final answer directly, token by token, in a single left-to-right pass.

Reasoning LLMs fundamentally change this: they generate explicit intermediate steps (tokens) before the final answer. These intermediate tokens represent "thinking" — the model shows its work, explores paths, checks assumptions, and validates logic before committing to a conclusion.

Two axes define reasoning approaches: where it happens (in the prompt via CoT, in model weights from training, or via search algorithms) and when it happens (during training time with special loss functions, or at inference time with compute budget).

The key insight: more thinking tokens = better answers on hard tasks, at the cost of latency and token spend. This trades inference compute for quality — a new scaling dimension beyond just model size and pretraining tokens.

⚠️ Reasoning models trade latency for accuracy. Always benchmark both metrics before choosing. A 50-second reasoning model is useless in a latency-critical chat interface, but essential in batch scoring or research contexts.
02 — Prompting Method

Chain-of-Thought Prompting

Wei et al. 2022 showed a simple discovery: appending "Let's think step by step" to a prompt unlocks multi-step reasoning in large models. This cost nearly nothing — no retraining, no new data. Just better prompting.

Zero-shot CoT: Append "Let's think step by step" to any prompt. Works reliably on models ≥ ~100B parameters. The model's training has naturally learned reasoning, just not by default.

Few-shot CoT: Provide worked examples showing explicit reasoning chains. More expensive to construct, but more reliable. Show the model a question, then step-by-step reasoning, then the answer. Few-shot versions reliably outperform zero-shot.

Self-consistency: Don't just sample one CoT chain. Sample multiple independent chains (typically 5–10), let the model reason differently each time, then take a majority vote on the final answer. Adds 3–5% accuracy on math benchmarks with just sampling cost.

Direct answer vs CoT: Direct: Q: If 3 items cost $4.50 each and you have $20, what's left? A: $6.50 CoT: Q: [same question] Let's think step by step. Step 1: Cost = 3 × 4.50 = 13.50 Step 2: Change = 20 - 13.50 = 6.50 A: $6.50 (same answer, but auditable)
💡 CoT helps mainly on multi-step math, logic, and code. It rarely helps on factual recall or single-lookup tasks. Don't use CoT for everything — measure the impact on your specific task.
03 — Training Method

Scratchpad Training and Process Reward Models

Instead of prompting for CoT, train the model to produce reasoning as part of its learned output distribution. OpenAI's o1 approach: train on long CoT traces, then hide the reasoning scratchpad from users at inference time.

This shifts reasoning from a prompting trick to a learned behavior. The model learns when to reason, how deep to reason, and how to validate its own work.

Two reward modeling approaches exist: Outcome Reward Model (ORM) scores only the final answer — correct or incorrect. Simple to label. But it can't distinguish between wrong-answer steps that luckily arrive at the right conclusion.

Process Reward Model (PRM) annotates every reasoning step, scoring each independently. Much more expensive to label (each solution requires dozens of step annotations). But dramatically better for hard math — it teaches the model which intermediate reasoning paths are sound, not just whether the final answer is right.

AspectORMPRM
What it scoresFinal answerEach step
Data needed(prompt, answer) pairsStep-by-step annotations
Good forGeneral tasksMath, code, logic
WeaknessCan't catch wrong steps that luck into right answerExpensive to label
Used byMost RLHF pipelinesDeepMind, OpenAI o1 (rumored)

PRMs represent the frontier of reasoning supervision. DeepSeek's o1 successor and OpenAI's research papers hint at using PRMs to unlock significantly higher accuracy on competition-level mathematics and formal proofs.

05 — Pure RL Approach

DeepSeek-R1 and GRPO

DeepSeek-R1 demonstrated a profound finding: you can train reasoning capability via pure reinforcement learning, without supervised CoT traces. Start from a base model. Apply a simple RL algorithm. Reasoning emerges.

GRPO (Group Relative Policy Optimization): Sample a group of G responses. Score each with a rule-based reward (e.g., 1 if final answer correct). Compute advantage for each response relative to the group mean, not global baseline. Update policy to increase likelihood of high-advantage responses.

GRPO avoids reward model collapse seen in PPO — where the reward model starts overfitting and giving inflated scores. It's simpler: no separate reward model, no reference model, just group-relative advantage. And empirically, it works: DeepSeek-R1 shows reasoning ability without ever seeing annotated CoT data during training.

A key emergent phenomenon: the model develops an "aha moment" behavior — spontaneously re-examining its work mid-chain when it notices a mistake, then correcting course. This self-correction emerges purely from the RL signal, not from explicit training.

GRPO update rule (pseudocode): For each prompt p, sample G responses {r_1, ..., r_G} score_i = reward_fn(r_i) # e.g., 1 if final answer correct advantage_i = (score_i - mean) / std loss = -E[advantage_i * log π(r_i | p)]
MethodNeeds reward modelSample efficiencyReasoning quality
PPOYesMediumHigh
GRPONo (rule-based reward)HighHigh (DeepSeek-R1)
DPONoHighMedium (better for alignment than reasoning)

GRPO's success suggests reasoning doesn't need supervised annotations — just a clear outcome signal and enough RL iterations. This is a major paradigm shift in alignment.

06 — Scaling Laws

Test-Time Compute Scaling

Historical scaling laws focused on training compute: more parameters, more tokens, better model. But for reasoning tasks, scaling inference compute yields a different curve.

Snell et al. 2024 showed: a smaller model with 16× more inference compute can match a much larger model on math benchmarks. You can spend compute at inference time to compensate for model size. This creates a new product dimension.

Instead of training one massive model, train a smaller base and route queries intelligently: easy queries go to a cheap, fast model. Hard queries route to a reasoning-heavy path with more thinking tokens. High-stakes queries get maximum compute. Each query's difficulty is matched to appropriate resources.

ApproachModelTokens per answerLatencyCost/query
StandardGPT-4o~3002sLow
Reasoningo1-mini~200015sMedium
Heavy reasoningo1~500040sHigh
HybridRoute by difficultyvariesvariesOptimal
💡 The emerging pattern: use a fast model to classify query difficulty, then route hard queries to reasoning model. This optimizes total cost while maintaining quality SLAs. A 10-token classifier on every request costs almost nothing but determines the full inference strategy downstream.
07 — Production

Practical Implementation

Reasoning models are now production-ready. The ecosystem has matured rapidly. Multiple options exist for different budget and latency constraints.

Hosted API
OpenAI o1 / o3
Fully trained reasoning models. Highest quality on benchmarks. High latency (40s+), high cost. State-of-the-art.
Hosted API
DeepSeek-R1
Open-source reasoning via API. Strong math/code performance. Lower cost than o1. GRPO-trained.
Open Source
QwQ-32B
Alibaba's reasoning model. Can run locally. Smaller than o1 but solid performance on logic puzzles.
Hosted API
Gemini 2.0 Flash Thinking
Google's reasoning model. Extended thinking mode reveals step-by-step work. Good cost/performance ratio.
Framework
vLLM
Open-source serving for reasoning models locally. Handles long reasoning chains efficiently.
Framework
SGLang
Structured generation framework. Supports tree search APIs and reasoning workflows.
Python · Chain-of-thought prompting with structured reasoning traces
from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class ReasoningTrace(BaseModel):
    thinking: str
    answer: str
    confidence: float

COT_SYSTEM = """You are a careful reasoner. For every question:
1. Think step-by-step in the 'thinking' field
2. Only then give your final 'answer'  
3. Rate your 'confidence' from 0.0 to 1.0

Show your work explicitly — do not skip steps."""

def cot_solve(problem: str) -> ReasoningTrace:
    result = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": COT_SYSTEM},
            {"role": "user", "content": problem}
        ],
        response_format=ReasoningTrace,
        temperature=0.0
    )
    return result.choices[0].message.parsed

# Example: multi-step math problem
trace = cot_solve(
    "A train leaves city A at 60mph. Another leaves city B "
    "(300 miles away) at 40mph heading toward A. When do they meet?"
)
print(f"Thinking: {trace.thinking[:200]}")
print(f"Answer: {trace.answer}")
print(f"Confidence: {trace.confidence:.0%}")

# Process Reward Model pattern: score each reasoning step
def score_reasoning_steps(steps: list[str], question: str) -> list[float]:
    """PRM: score each intermediate reasoning step for correctness."""
    scores = []
    for i, step in enumerate(steps):
        prompt = (f"Q: {question}
Reasoning so far:
"
                  + "
".join(steps[:i+1])
                  + "
Is this step correct and helpful? Score 0.0-1.0:")
        resp = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=5, temperature=0.0
        ).choices[0].message.content.strip()
        try: scores.append(float(resp))
        except: scores.append(0.5)
    return scores

When to Use Reasoning Models

Multi-step math: Algebra, calculus, probability. Reasoning models excel. Include a verifier (sympy) when possible.

Code debugging: The model can trace execution, identify the bug, propose a fix. Better than prompt engineering alone.

Logical deduction: Logic puzzles, constraint satisfaction. Reasoning helps systematically explore the solution space.

Long legal/medical analysis: Complex documents with many dependencies. Reasoning helps track constraints and cross-references.

When NOT to Use

Simple retrieval: "What is the capital of France?" No reasoning needed. 100ms model beats 10s reasoning model.

Summarization: Reasoning adds latency with no quality gain. SFT-optimized models are faster and sufficient.

Classification: Is this email spam? Reasoning is overkill. A small classifier or few-shot prompt is faster and cheaper.

Measure the impact: add reasoning only where it measurably improves accuracy or enables capabilities you can't otherwise achieve.

08 — Further Reading

References

Academic Papers
Documentation & Guides
Practitioner Resources
  • Blog OpenAI. (2024). Reasoning with o1. Architectural insights and benchmark results. — openai.com/research ↗
  • Blog DeepSeek. (2025). DeepSeek-R1: Our Reasoning Model. Technical blog post with performance analysis. — deepseek.com/blog ↗