Advanced Reasoning in LLMs

Contents

What is reasoning
Chain-of-thought prompting
Scratchpad training & PRM
Tree search at inference
DeepSeek-R1 & GRPO
Test-time compute scaling
Practical implementation

01 — Foundation

What Is Reasoning in LLMs?

A standard LLM operates in one forward pass: input → output. No intermediate deliberation. The model generates the final answer directly, token by token, in a single left-to-right pass.

Reasoning LLMs fundamentally change this: they generate explicit intermediate steps (tokens) before the final answer. These intermediate tokens represent "thinking" — the model shows its work, explores paths, checks assumptions, and validates logic before committing to a conclusion.

Two axes define reasoning approaches: where it happens (in the prompt via CoT, in model weights from training, or via search algorithms) and when it happens (during training time with special loss functions, or at inference time with compute budget).

The key insight: more thinking tokens = better answers on hard tasks, at the cost of latency and token spend. This trades inference compute for quality — a new scaling dimension beyond just model size and pretraining tokens.

⚠️ Reasoning models trade latency for accuracy. Always benchmark both metrics before choosing. A 50-second reasoning model is useless in a latency-critical chat interface, but essential in batch scoring or research contexts.

02 — Prompting Method

Chain-of-Thought Prompting

Wei et al. 2022 showed a simple discovery: appending "Let's think step by step" to a prompt unlocks multi-step reasoning in large models. This cost nearly nothing — no retraining, no new data. Just better prompting.

Zero-shot CoT: Append "Let's think step by step" to any prompt. Works reliably on models ≥ ~100B parameters. The model's training has naturally learned reasoning, just not by default.

Few-shot CoT: Provide worked examples showing explicit reasoning chains. More expensive to construct, but more reliable. Show the model a question, then step-by-step reasoning, then the answer. Few-shot versions reliably outperform zero-shot.

Self-consistency: Don't just sample one CoT chain. Sample multiple independent chains (typically 5–10), let the model reason differently each time, then take a majority vote on the final answer. Adds 3–5% accuracy on math benchmarks with just sampling cost.

Direct answer vs CoT: Direct: Q: If 3 items cost $4.50 each and you have $20, what's left? A: $6.50 CoT: Q: [same question] Let's think step by step. Step 1: Cost = 3 × 4.50 = 13.50 Step 2: Change = 20 - 13.50 = 6.50 A: $6.50 ✓ (same answer, but auditable)

💡 CoT helps mainly on multi-step math, logic, and code. It rarely helps on factual recall or single-lookup tasks. Don't use CoT for everything — measure the impact on your specific task.

03 — Training Method

Scratchpad Training and Process Reward Models

Instead of prompting for CoT, train the model to produce reasoning as part of its learned output distribution. OpenAI's o1 approach: train on long CoT traces, then hide the reasoning scratchpad from users at inference time.

This shifts reasoning from a prompting trick to a learned behavior. The model learns when to reason, how deep to reason, and how to validate its own work.

Two reward modeling approaches exist: Outcome Reward Model (ORM) scores only the final answer — correct or incorrect. Simple to label. But it can't distinguish between wrong-answer steps that luckily arrive at the right conclusion.

Process Reward Model (PRM) annotates every reasoning step, scoring each independently. Much more expensive to label (each solution requires dozens of step annotations). But dramatically better for hard math — it teaches the model which intermediate reasoning paths are sound, not just whether the final answer is right.

Aspect	ORM	PRM
What it scores	Final answer	Each step
Data needed	(prompt, answer) pairs	Step-by-step annotations
Good for	General tasks	Math, code, logic
Weakness	Can't catch wrong steps that luck into right answer	Expensive to label
Used by	Most RLHF pipelines	DeepMind, OpenAI o1 (rumored)

PRMs represent the frontier of reasoning supervision. DeepSeek's o1 successor and OpenAI's research papers hint at using PRMs to unlock significantly higher accuracy on competition-level mathematics and formal proofs.

04 — Search Methods

Tree Search at Inference Time

Instead of one left-to-right reasoning chain, explore a tree of possible reasoning paths. Different intermediate steps lead to different continuations. Some paths are correct; some are dead ends.

Best-of-N (BoN): Generate N independent solutions, run each through a verifier (e.g., "does the code run?", "does the math check out?"), pick the highest-scored. Simple, embarrassingly parallel, but expensive — N=16 means 16× compute.

Beam search over reasoning steps: Maintain a fixed set of k "active" partial chains. At each reasoning step, expand each chain, score all branches, prune to top k. More efficient than BoN but harder to parallelize within a single request.

Monte Carlo Tree Search (MCTS): Sample reasoning paths, backpropagate outcome rewards through the tree structure. Focuses compute on promising branches. Used in AlphaCode 2, rStar-Math, and other competition systems. Highest quality but requires 50–200× compute.

Method	Compute	Quality	Parallelizable	When to use
Greedy (no search)	1×	Baseline	N/A	Latency-critical
Best-of-N (N=16)	16×	+5–15% on math	Yes	Offline eval, high accuracy
Beam search (k=4)	4×	+3–8%	Partial	Batch inference
MCTS	50–200×	+15–25%	Partial	Research, competition math

⚠️ MCTS at inference time is only practical for tasks where a verifier exists. Math with sympy, code with test suites, formal proofs with automated checkers. For open-ended tasks (essay writing, creative fiction), you have no verifier, so search has no signal.

05 — Pure RL Approach

DeepSeek-R1 and GRPO

DeepSeek-R1 demonstrated a profound finding: you can train reasoning capability via pure reinforcement learning, without supervised CoT traces. Start from a base model. Apply a simple RL algorithm. Reasoning emerges.

GRPO (Group Relative Policy Optimization): Sample a group of G responses. Score each with a rule-based reward (e.g., 1 if final answer correct). Compute advantage for each response relative to the group mean, not global baseline. Update policy to increase likelihood of high-advantage responses.

GRPO avoids reward model collapse seen in PPO — where the reward model starts overfitting and giving inflated scores. It's simpler: no separate reward model, no reference model, just group-relative advantage. And empirically, it works: DeepSeek-R1 shows reasoning ability without ever seeing annotated CoT data during training.

A key emergent phenomenon: the model develops an "aha moment" behavior — spontaneously re-examining its work mid-chain when it notices a mistake, then correcting course. This self-correction emerges purely from the RL signal, not from explicit training.

GRPO update rule (pseudocode): For each prompt p, sample G responses {r_1, ..., r_G} score_i = reward_fn(r_i) # e.g., 1 if final answer correct advantage_i = (score_i - mean) / std loss = -E[advantage_i * log π(r_i | p)]

Method	Needs reward model	Sample efficiency	Reasoning quality
PPO	Yes	Medium	High
GRPO	No (rule-based reward)	High	High (DeepSeek-R1)
DPO	No	High	Medium (better for alignment than reasoning)

GRPO's success suggests reasoning doesn't need supervised annotations — just a clear outcome signal and enough RL iterations. This is a major paradigm shift in alignment.

06 — Scaling Laws

Test-Time Compute Scaling

Historical scaling laws focused on training compute: more parameters, more tokens, better model. But for reasoning tasks, scaling inference compute yields a different curve.

Snell et al. 2024 showed: a smaller model with 16× more inference compute can match a much larger model on math benchmarks. You can spend compute at inference time to compensate for model size. This creates a new product dimension.

Instead of training one massive model, train a smaller base and route queries intelligently: easy queries go to a cheap, fast model. Hard queries route to a reasoning-heavy path with more thinking tokens. High-stakes queries get maximum compute. Each query's difficulty is matched to appropriate resources.

Approach	Model	Tokens per answer	Latency	Cost/query
Standard	GPT-4o	~300	2s	Low
Reasoning	o1-mini	~2000	15s	Medium
Heavy reasoning	o1	~5000	40s	High
Hybrid	Route by difficulty	varies	varies	Optimal

💡 The emerging pattern: use a fast model to classify query difficulty, then route hard queries to reasoning model. This optimizes total cost while maintaining quality SLAs. A 10-token classifier on every request costs almost nothing but determines the full inference strategy downstream.

07 — Production

Practical Implementation

Reasoning models are now production-ready. The ecosystem has matured rapidly. Multiple options exist for different budget and latency constraints.

Hosted API

OpenAI o1 / o3

Fully trained reasoning models. Highest quality on benchmarks. High latency (40s+), high cost. State-of-the-art.

Hosted API

DeepSeek-R1

Open-source reasoning via API. Strong math/code performance. Lower cost than o1. GRPO-trained.

Open Source

QwQ-32B

Alibaba's reasoning model. Can run locally. Smaller than o1 but solid performance on logic puzzles.

Hosted API

Gemini 2.0 Flash Thinking

Google's reasoning model. Extended thinking mode reveals step-by-step work. Good cost/performance ratio.

Framework

vLLM

Open-source serving for reasoning models locally. Handles long reasoning chains efficiently.

Framework

SGLang

Structured generation framework. Supports tree search APIs and reasoning workflows.

Python · Chain-of-thought prompting with structured reasoning traces

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class ReasoningTrace(BaseModel):
    thinking: str
    answer: str
    confidence: float

COT_SYSTEM = """You are a careful reasoner. For every question:
1. Think step-by-step in the 'thinking' field
2. Only then give your final 'answer'  
3. Rate your 'confidence' from 0.0 to 1.0

Show your work explicitly — do not skip steps."""

def cot_solve(problem: str) -> ReasoningTrace:
    result = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": COT_SYSTEM},
            {"role": "user", "content": problem}
        ],
        response_format=ReasoningTrace,
        temperature=0.0
    )
    return result.choices[0].message.parsed

# Example: multi-step math problem
trace = cot_solve(
    "A train leaves city A at 60mph. Another leaves city B "
    "(300 miles away) at 40mph heading toward A. When do they meet?"
)
print(f"Thinking: {trace.thinking[:200]}")
print(f"Answer: {trace.answer}")
print(f"Confidence: {trace.confidence:.0%}")

# Process Reward Model pattern: score each reasoning step
def score_reasoning_steps(steps: list[str], question: str) -> list[float]:
    """PRM: score each intermediate reasoning step for correctness."""
    scores = []
    for i, step in enumerate(steps):
        prompt = (f"Q: {question}
Reasoning so far:
"
                  + "
".join(steps[:i+1])
                  + "
Is this step correct and helpful? Score 0.0-1.0:")
        resp = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=5, temperature=0.0
        ).choices[0].message.content.strip()
        try: scores.append(float(resp))
        except: scores.append(0.5)
    return scores

When to Use Reasoning Models

Multi-step math: Algebra, calculus, probability. Reasoning models excel. Include a verifier (sympy) when possible.

Code debugging: The model can trace execution, identify the bug, propose a fix. Better than prompt engineering alone.

Logical deduction: Logic puzzles, constraint satisfaction. Reasoning helps systematically explore the solution space.

Long legal/medical analysis: Complex documents with many dependencies. Reasoning helps track constraints and cross-references.

When NOT to Use

Simple retrieval: "What is the capital of France?" No reasoning needed. 100ms model beats 10s reasoning model.

Summarization: Reasoning adds latency with no quality gain. SFT-optimized models are faster and sufficient.

Classification: Is this email spam? Reasoning is overkill. A small classifier or few-shot prompt is faster and cheaper.

Measure the impact: add reasoning only where it measurably improves accuracy or enables capabilities you can't otherwise achieve.

08 — Further Reading

References

Academic Papers

Paper Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903. — arxiv:2201.11903 ↗
Paper Wang, X. et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171. — arxiv:2203.11171 ↗
Paper DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948. — arxiv:2501.12948 ↗
Paper Snell, C. et al. (2024). Scaling LLM Test-Time Compute Optimally can Mitigate Loss of In-Context Learning. arXiv:2408.03314. — arxiv:2408.03314 ↗
Paper Lightman, H. et al. (2023). Let's Verify Step by Step. OpenAI. PRM800K dataset. arXiv:2305.20050. — arxiv:2305.20050 ↗

Documentation & Guides

Docs OpenAI. o1 Model Documentation. platform.openai.com/docs ↗
Docs DeepSeek. R1 API Reference. api-docs.deepseek.com ↗
Docs vLLM Documentation. docs.vllm.ai ↗
Docs SGLang. Structured Generation Language. sgl-project.github.io ↗

Practitioner Resources

Blog OpenAI. (2024). Reasoning with o1. Architectural insights and benchmark results. — openai.com/research ↗
Blog DeepSeek. (2025). DeepSeek-R1: Our Reasoning Model. Technical blog post with performance analysis. — deepseek.com/blog ↗

Advanced Reasoning in LLMs

What Is Reasoning in LLMs?

Chain-of-Thought Prompting

Scratchpad Training and Process Reward Models

Tree Search at Inference Time

DeepSeek-R1 and GRPO

Test-Time Compute Scaling

Practical Implementation

When to Use Reasoning Models

When NOT to Use

References

Related concepts