Advanced Reasoning

Chain-of-Thought Prompting

Prompt the model to reason step by step before answering. A single phrase reliably improves accuracy on multi-step math, logic, and planning tasks by 20–40%.

20–40%
Accuracy boost
"Think step by step"
Trigger phrase
Free
No extra cost

Table of Contents

SECTION 01

Why reasoning out loud helps

When you ask someone a hard question, they think before they answer. A student solving a geometry problem writes intermediate steps on paper. A doctor works through a differential diagnosis before landing on a conclusion. The act of writing down intermediate thoughts helps humans — and it turns out to help LLMs too.

Without CoT, a model predicts the answer token by token in one pass — it has to "think" and "answer" at the same time. With CoT, it generates a reasoning trace first. Each reasoning token becomes additional context that informs the next one. The model can "see" its own work as it goes.

The key insight: LLMs generate tokens left to right. If you force intermediate reasoning steps to be written out, later tokens (the answer) are conditioned on those steps. You're giving the model more "computation" — measured in tokens — to work with.
SECTION 02

Zero-shot CoT: the phrase that changed AI

In 2022, Kojima et al. showed that adding "Let's think step by step" to a prompt improved accuracy on multi-step arithmetic by 40% — with no examples, no fine-tuning, nothing else changed. It's one of the highest-leverage prompt tricks ever discovered.

❌ Without CoT: Q: "Roger has 5 tennis balls. He buys 2 more cans of 3 balls each. How many does he have?" A: 11 (might be wrong for harder variants) ✓ With zero-shot CoT — just add the phrase: Q: "Roger has 5 tennis balls. He buys 2 more cans of 3 balls each. How many does he have? Think step by step." A: Roger starts with 5 balls. He buys 2 cans × 3 balls = 6 balls. Total: 5 + 6 = 11. The answer is 11.

Common trigger phrases (all roughly equivalent):

SECTION 03

Few-shot CoT: show a reasoning chain

Instead of just prompting for reasoning, show the model an example of good reasoning. This is more reliable than zero-shot CoT for complex or domain-specific tasks.

Q: A train leaves London at 9am going 120 km/h. Another leaves Paris (450km away) at 10am going 90 km/h towards London. When do they meet? A: Let me work through this: - Train 1 starts at t=0 (9am), speed 120 km/h - Train 2 starts at t=1 (10am), speed 90 km/h, from 450km away - After Train 2 departs, combined closing speed = 120 + 90 = 210 km/h - At 10am, Train 1 has already covered 120km → gap = 450 - 120 = 330 km - Time to close 330km at 210 km/h = 330/210 ≈ 1.57 hours = 1h34m - They meet at approximately 11:34am. --- Now solve: [NEW PROBLEM] A: [Model now reasons in the same style before answering]

For domain-specific tasks (legal reasoning, medical diagnosis), providing a domain-appropriate reasoning chain is much more reliable than zero-shot CoT.

SECTION 04

When CoT helps (and doesn't)

Task typeCoT helps?Why
Multi-step arithmetic✅ StronglyEach step narrows error; intermediate values matter
Logical deduction✅ StronglyExplicit premise → conclusion trace
Code generation✅ Helps"Plan before coding" reduces structural errors
Simple factual lookup❌ No benefit"What's the capital of France?" — chain is wasteful
Single-word classification❌ Can hurtModel may overthink and change a correct answer
Creative writing⚠️ MixedPlanning helps structure; can hurt spontaneity
When CoT goes wrong: Models can produce confident-looking but wrong reasoning chains — especially on unfamiliar domains. The reasoning looks good; the answer is still wrong. Always validate with known test cases.
SECTION 05

CoT + self-consistency

Run CoT multiple times with temperature > 0, then take the majority answer. This stack (CoT × self-consistency) is one of the most reliable prompting strategies available without fine-tuning.

import anthropic from collections import Counter client = anthropic.Anthropic() def cot_with_consistency(question: str, n_samples: int = 5) -> str: """Generate N chain-of-thought reasoning paths, return majority answer.""" prompt = f"""{question} Think step by step. At the very end, write: FINAL ANSWER: [your answer]""" answers = [] for _ in range(n_samples): response = client.messages.create( model="claude-opus-4-6", max_tokens=500, temperature=0.7, # Diversity needed for voting to help messages=[{"role": "user", "content": prompt}] ).content[0].text # Extract final answer if "FINAL ANSWER:" in response: answer = response.split("FINAL ANSWER:")[-1].strip().split("\n")[0] answers.append(answer) # Majority vote if not answers: return "Could not extract answer" return Counter(answers).most_common(1)[0][0] result = cot_with_consistency( "A store has a 20% off sale. A jacket normally costs $85. " "With an additional 10% loyalty discount on the sale price, " "what is the final price?" ) print(result) # More reliable than single-sample CoT
SECTION 06

Implementation patterns

import anthropic client = anthropic.Anthropic() # Pattern 1: Zero-shot CoT suffix def ask_with_cot(question: str) -> str: return client.messages.create( model="claude-opus-4-6", max_tokens=800, messages=[{"role": "user", "content": f"{question}\n\nThink step by step."}] ).content[0].text # Pattern 2: Separate reasoning from answer (structured CoT) def structured_cot(question: str) -> dict: response = client.messages.create( model="claude-opus-4-6", max_tokens=800, messages=[{"role": "user", "content": f""" {question} First, reason through this carefully in a block. Then give your final answer in an block. [your step-by-step reasoning here] [concise final answer only] """}] ).content[0].text import re thinking = re.search(r'(.*?)', response, re.DOTALL) answer = re.search(r'(.*?)', response, re.DOTALL) return { "reasoning": thinking.group(1).strip() if thinking else "", "answer": answer.group(1).strip() if answer else response } result = structured_cot("If I invest $10,000 at 7% annual return for 20 years, what do I end up with?") print("Answer:", result["answer"]) # print("Reasoning:", result["reasoning"]) # for debugging

Chain-of-Thought Prompting Variants

Chain-of-thought (CoT) prompting improves LLM performance on multi-step reasoning tasks by encouraging the model to generate intermediate reasoning steps before producing a final answer. The explicit reasoning trace allows the model to decompose complex problems into manageable sub-steps, dramatically improving accuracy on arithmetic, logical inference, and commonsense reasoning tasks.

VariantHow TriggeredControlToken CostBest For
Zero-shot CoT"Think step by step"MinimalModerateQuick deployment
Few-shot CoTExample reasoning tracesHighModerate + examplesDomain-specific tasks
Auto-CoTAuto-generated examplesMediumHigh (generation)Large task diversity
Self-consistencyMultiple CoT paths sampledHighVery high (N× samples)High-stakes answers
Program-aidedCode generation + executionHighModerate + executionMath, structured tasks

Self-consistency CoT generates multiple independent reasoning chains for the same problem (typically 10–40 samples) and selects the answer by majority vote across the chains. This approach leverages the insight that while any single reasoning path may contain errors, the correct answer tends to appear more frequently than any specific incorrect answer when sampling multiple paths. Self-consistency reliably improves accuracy on tasks with verifiable correct answers at the cost of a linear increase in inference cost proportional to the number of sampled paths.

Program-aided language models (PAL) combine chain-of-thought reasoning with code execution. Instead of performing arithmetic or logical operations in the token stream — where LLMs are error-prone — the model is prompted to write executable code that computes the answer. The code is executed by a Python interpreter and the result is returned as the final answer. This approach sidesteps the fundamental limitation of LLMs performing multi-step arithmetic in their forward pass, achieving near-perfect accuracy on math word problems that confound standard CoT approaches.

CoT Prompting Best Practices

Effective chain-of-thought prompts share several design principles that maximize the quality of the reasoning trace. Reasoning steps should be concrete and verifiable rather than abstract — "multiply 347 by 6 to get 2082" is better than "perform the multiplication." The final answer should be clearly delineated from the reasoning using explicit markers like "Therefore, the answer is..." or "Final Answer:" so that answer extraction from the reasoning trace is unambiguous.

Few-shot CoT example selection matters significantly for downstream quality. Examples that share structural similarity with the target task — same number of reasoning steps, same type of logical operations — produce better reasoning chains than generic examples chosen for variety. For domain-specific tasks, examples drawn from the target domain outperform general reasoning examples even when the general examples are of higher quality, suggesting that the model uses the examples to calibrate its reasoning style and domain vocabulary as much as to learn the reasoning format itself.

Negative examples — demonstrations of incorrect reasoning that the model should avoid — can be incorporated into few-shot CoT prompts to explicitly teach the model what constitutes a reasoning error. This is particularly useful for tasks where models have a known systematic failure mode, such as ignoring units in physics problems or confusing necessary and sufficient conditions in logical reasoning tasks.

Verification steps improve chain-of-thought reliability by prompting the model to explicitly check its work after producing an initial answer. A two-step prompt structure — first generate a solution with reasoning, then prompt the model to verify whether the solution is correct and identify any errors — catches a significant fraction of errors that slipped through the initial reasoning chain. The verification step works because generating a critique of an answer activates different reasoning patterns than generating the answer itself, making the model sensitive to errors it would overlook when reading its own reasoning as context for the next token.