Give the model a private reasoning space before it commits to an answer — dramatically improves accuracy on hard math, logic, and multi-step problems.
Standard chain-of-thought asks the model to reason out loud in its response — what you see is what it thought. Extended thinking goes further: the model gets a private scratch space to work through the problem before producing any visible output. The reasoning happens in a separate, hidden layer.
Think of it like an exam. Regular CoT is "show your work in the answer." Extended thinking is "take as long as you need in a private rough-work section, then write your final answer." The rough work is discarded; only the final answer is returned — unless you explicitly ask to see the thinking too.
When extended thinking is enabled, the model generates a special "thinking" block that doesn't appear in the response unless you request it. This block can contain exploratory reasoning, rejected hypotheses, backtracking, and re-evaluation — things the model wouldn't include in a polished answer.
| Use it for | Skip it for |
|---|---|
| Competition-level math problems | Simple factual questions |
| Complex logic puzzles (many constraints) | Creative writing |
| Hard coding challenges (algorithm design) | Summarisation or extraction |
| Multi-step strategic planning | Any task where you need fast response |
| Debugging subtle errors in code | High-volume API calls (cost too high) |
Extended thinking consumes tokens and latency, so cost scales with thinking_budget. A budget of 50K tokens can add 5–20 seconds of latency per request. For production systems, start with a conservative budget (10K), measure accuracy improvement, and increase only if needed. Track thinking_tokens in your logs to understand per-request costs.
Extended thinking shines on hard reasoning tasks: math olympiad problems, code generation with tricky edge cases, multi-step logic puzzles, counterfactual reasoning. For simple classification, summarization, or formatting tasks, regular thinking is usually enough. In low-latency scenarios (real-time chat, streaming responses), the overhead may not fit your SLA.
| Task Type | Extended Thinking Benefit | Cost-Justifiable? |
|---|---|---|
| Math proof verification | Very high (+20–40% accuracy) | Yes |
| Complex reasoning chains | High (+10–20%) | Yes, for critical tasks |
| Code generation (tricky bugs) | Medium (+5–10%) | Maybe (depends on bug severity) |
| Summarization | Low (minimal improvement) | No |
| Simple classification | Minimal | No |
| Real-time chat | Varies, but latency hit is steep | Usually no |
Extended thinking infrastructure: From the user's perspective, extended thinking is transparent: you set a budget and receive a response. Behind the scenes, the model is allocating tokens to visible reasoning (the response) versus hidden thinking. The thinking process is not exposed in the API by default, but you can enable "thinking" in the message to see a summary. Long thinking processes can improve accuracy on tasks like code review, proof verification, and complex reasoning, but the cost in tokens is real.
Teams using extended thinking report best results when the task has a clear, verifiable ground truth. Math proofs, code correctness, and logical consistency are natural fits. For subjective tasks (writing, creative ideation), extended thinking helps less because there's no "correct" answer to deliberate toward. Hybrid approaches—extended thinking for validation, regular thinking for generation—are emerging as best practice in production systems.
Comparing extended thinking to other reasoning approaches: Before extended thinking, teams used chain-of-thought (CoT) prompting: ask the model to explain its reasoning before answering. CoT improves accuracy but relies on visible reasoning—the model might make an error that's hidden in its explanation. Extended thinking is invisible—the model deliberates internally, and you only see the final answer. Early results suggest extended thinking outperforms visible CoT by a small margin (2–5%) on math and logic tasks.
Another approach: debate or multi-agent reasoning, where multiple agents take positions and critique each other. This is conceptually interesting but requires many API calls (expensive). Extended thinking achieves similar accuracy improvements with a single model and single API call, making it more practical.
The long-term vision: hybrid reasoning systems that combine extended thinking (for hard problems), fast heuristics (for easy problems), and external tools (for factual lookup). Teams are starting to implement this—detect problem difficulty, route accordingly, and optimize both quality and cost. As extended thinking becomes cheaper and more widely available, this becomes the standard approach.
Extended thinking requires significantly more inference-time computation compared to standard chain-of-thought. The model generates detailed internal reasoning tokens that aren't part of the final output. This computational overhead translates directly to increased latency and cost. For cost-sensitive applications, you must balance the improved quality against this computational expense. The relationship between thinking budget and final answer quality follows a log relationship—initial increases provide substantial improvements, but diminishing returns set in.
Optimizing extended thinking involves careful budget selection. Too small a budget underutilizes the capability; too large wastes resources. Empirical testing on your specific use case is essential for finding the optimal trade-off between performance and cost.
Major research institutions and AI labs continue to explore extended reasoning approaches. OpenAI's o1 model family represents the most prominent commercial implementation. Research from Anthropic, OpenAI, and others demonstrates that scaling compute specifically allocated to reasoning improves performance on challenging problems. This trend reflects a shift from attempting to improve performance through larger models to improving it through more computation allocated at inference time.
As extended thinking becomes more prevalent, best practices are emerging. Organizations are discovering which tasks benefit most from this approach—typically reasoning-heavy problems including mathematics, coding, and complex analysis. For simpler tasks like classification or straightforward summarization, extended thinking provides minimal benefit and should be avoided to control costs.
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=8000,
thinking={
"type": "enabled",
"budget_tokens": 5000 # Allocate thinking budget
},
messages=[{
"role": "user",
"content": "Solve this complex problem..."
}]
)
# Use response.content for final answer
As reasoning models become more common, tools and frameworks are evolving to support them effectively. Integration with existing ML pipelines requires careful consideration of the extended latency these models introduce. Production systems must account for variable response times depending on problem complexity and allocated thinking budget.