Advanced Reasoning

Extended Thinking

Give the model a private reasoning space before it commits to an answer — dramatically improves accuracy on hard math, logic, and multi-step problems.

Private
Reasoning space
Harder problems
Best use case
budget_tokens
Key parameter

Table of Contents

SECTION 01

Scratchpad vs visible CoT

Standard chain-of-thought asks the model to reason out loud in its response — what you see is what it thought. Extended thinking goes further: the model gets a private scratch space to work through the problem before producing any visible output. The reasoning happens in a separate, hidden layer.

Think of it like an exam. Regular CoT is "show your work in the answer." Extended thinking is "take as long as you need in a private rough-work section, then write your final answer." The rough work is discarded; only the final answer is returned — unless you explicitly ask to see the thinking too.

SECTION 02

How extended thinking works

When extended thinking is enabled, the model generates a special "thinking" block that doesn't appear in the response unless you request it. This block can contain exploratory reasoning, rejected hypotheses, backtracking, and re-evaluation — things the model wouldn't include in a polished answer.

When it matters most: Complex multi-step problems where the first approach often fails — hard math, competitive programming, multi-hop reasoning, strategic planning. For simple tasks, there's no benefit and extra cost.
SECTION 03

The Anthropic API

import anthropic client = anthropic.Anthropic() # Enable extended thinking with budget_tokens response = client.messages.create( model="claude-opus-4-6", # Extended thinking requires Opus 3.5+ max_tokens=16000, # Must be > budget_tokens thinking={ "type": "enabled", "budget_tokens": 10000 # Max tokens for the internal reasoning }, messages=[{ "role": "user", "content": """Solve this logic puzzle: Five people (Alice, Bob, Carol, Dave, Eve) each have a different job (doctor, lawyer, teacher, engineer, chef) and a different pet (cat, dog, bird, fish, rabbit)... [puzzle constraints here] Who has the fish?""" }] ) # The response has multiple content blocks for block in response.content: if block.type == "thinking": print("=== REASONING ===") print(block.thinking[:500]) # First 500 chars of reasoning print("...") elif block.type == "text": print("=== ANSWER ===") print(block.text)
SECTION 04

Controlling reasoning depth

import anthropic client = anthropic.Anthropic() def think_hard(problem: str, budget: int = 8000) -> str: """Solve a hard problem with extended thinking.""" response = client.messages.create( model="claude-opus-4-6", max_tokens=budget + 2000, # Allow room for answer after thinking thinking={"type": "enabled", "budget_tokens": budget}, messages=[{"role": "user", "content": problem}] ) # Return only the text answer, not the thinking for block in response.content: if block.type == "text": return block.text return "" # Budget guidance: # 1,000 tokens — simple multi-step problems # 5,000 tokens — moderate complexity (logic puzzles, medium math) # 10,000 tokens — hard problems (competition math, complex code) # 30,000 tokens — very hard (research-level, multi-constraint planning) # Quick test easy = think_hard("What is 23 × 47?", budget=500) # Overkill but works hard = think_hard(""" Write a Python function that finds the longest palindromic subsequence in a string. Optimise for O(n²) time and O(n) space. Explain your approach. """, budget=15000)
SECTION 05

When to use it

Use it forSkip it for
Competition-level math problemsSimple factual questions
Complex logic puzzles (many constraints)Creative writing
Hard coding challenges (algorithm design)Summarisation or extraction
Multi-step strategic planningAny task where you need fast response
Debugging subtle errors in codeHigh-volume API calls (cost too high)
SECTION 06

Cost & latency trade-offs

# Pricing model (approximate, Anthropic Claude) # Thinking tokens are billed at the same rate as output tokens # A 10,000-token thinking budget at claude-opus-4-6 pricing (~$15/M output tokens) # = up to $0.15 per call just for thinking # vs. standard call without thinking = $0.01-0.03 per call # Latency: thinking adds ~5-15s depending on budget # Not suitable for: real-time chat, high-volume batch jobs # Best for: offline analysis, one-shot hard problems, low-volume high-stakes decisions # Rule of thumb: # budget_tokens < 2,000 → latency ~3-5s cost ~$0.03 # budget_tokens ~ 8,000 → latency ~10-20s cost ~$0.12 # budget_tokens ~ 20,000 → latency ~30-60s cost ~$0.30
Practical default: Start with budget_tokens=8000 for hard problems. If still getting wrong answers, double it. If already correct at lower budget, halve it. There's no benefit to giving more budget than the problem needs.
SECTION 07

Measuring and optimizing thinking time

Extended thinking consumes tokens and latency, so cost scales with thinking_budget. A budget of 50K tokens can add 5–20 seconds of latency per request. For production systems, start with a conservative budget (10K), measure accuracy improvement, and increase only if needed. Track thinking_tokens in your logs to understand per-request costs.

When extended thinking isn't worth it

Extended thinking shines on hard reasoning tasks: math olympiad problems, code generation with tricky edge cases, multi-step logic puzzles, counterfactual reasoning. For simple classification, summarization, or formatting tasks, regular thinking is usually enough. In low-latency scenarios (real-time chat, streaming responses), the overhead may not fit your SLA.

Task TypeExtended Thinking BenefitCost-Justifiable?
Math proof verificationVery high (+20–40% accuracy)Yes
Complex reasoning chainsHigh (+10–20%)Yes, for critical tasks
Code generation (tricky bugs)Medium (+5–10%)Maybe (depends on bug severity)
SummarizationLow (minimal improvement)No
Simple classificationMinimalNo
Real-time chatVaries, but latency hit is steepUsually no

Extended thinking infrastructure: From the user's perspective, extended thinking is transparent: you set a budget and receive a response. Behind the scenes, the model is allocating tokens to visible reasoning (the response) versus hidden thinking. The thinking process is not exposed in the API by default, but you can enable "thinking" in the message to see a summary. Long thinking processes can improve accuracy on tasks like code review, proof verification, and complex reasoning, but the cost in tokens is real.

Teams using extended thinking report best results when the task has a clear, verifiable ground truth. Math proofs, code correctness, and logical consistency are natural fits. For subjective tasks (writing, creative ideation), extended thinking helps less because there's no "correct" answer to deliberate toward. Hybrid approaches—extended thinking for validation, regular thinking for generation—are emerging as best practice in production systems.

Comparing extended thinking to other reasoning approaches: Before extended thinking, teams used chain-of-thought (CoT) prompting: ask the model to explain its reasoning before answering. CoT improves accuracy but relies on visible reasoning—the model might make an error that's hidden in its explanation. Extended thinking is invisible—the model deliberates internally, and you only see the final answer. Early results suggest extended thinking outperforms visible CoT by a small margin (2–5%) on math and logic tasks.

Another approach: debate or multi-agent reasoning, where multiple agents take positions and critique each other. This is conceptually interesting but requires many API calls (expensive). Extended thinking achieves similar accuracy improvements with a single model and single API call, making it more practical.

The long-term vision: hybrid reasoning systems that combine extended thinking (for hard problems), fast heuristics (for easy problems), and external tools (for factual lookup). Teams are starting to implement this—detect problem difficulty, route accordingly, and optimize both quality and cost. As extended thinking becomes cheaper and more widely available, this becomes the standard approach.

Computational Complexity

Extended thinking requires significantly more inference-time computation compared to standard chain-of-thought. The model generates detailed internal reasoning tokens that aren't part of the final output. This computational overhead translates directly to increased latency and cost. For cost-sensitive applications, you must balance the improved quality against this computational expense. The relationship between thinking budget and final answer quality follows a log relationship—initial increases provide substantial improvements, but diminishing returns set in.

Optimizing extended thinking involves careful budget selection. Too small a budget underutilizes the capability; too large wastes resources. Empirical testing on your specific use case is essential for finding the optimal trade-off between performance and cost.

Industry Adoption Trends

Major research institutions and AI labs continue to explore extended reasoning approaches. OpenAI's o1 model family represents the most prominent commercial implementation. Research from Anthropic, OpenAI, and others demonstrates that scaling compute specifically allocated to reasoning improves performance on challenging problems. This trend reflects a shift from attempting to improve performance through larger models to improving it through more computation allocated at inference time.

As extended thinking becomes more prevalent, best practices are emerging. Organizations are discovering which tasks benefit most from this approach—typically reasoning-heavy problems including mathematics, coding, and complex analysis. For simpler tasks like classification or straightforward summarization, extended thinking provides minimal benefit and should be avoided to control costs.

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-3-7-sonnet-20250219",
    max_tokens=8000,
    thinking={
        "type": "enabled",
        "budget_tokens": 5000  # Allocate thinking budget
    },
    messages=[{
        "role": "user",
        "content": "Solve this complex problem..."
    }]
)

# Use response.content for final answer

As reasoning models become more common, tools and frameworks are evolving to support them effectively. Integration with existing ML pipelines requires careful consideration of the extended latency these models introduce. Production systems must account for variable response times depending on problem complexity and allocated thinking budget.