Prompt the model to reason step by step before answering. A single phrase reliably improves accuracy on multi-step math, logic, and planning tasks by 20–40%.
When you ask someone a hard question, they think before they answer. A student solving a geometry problem writes intermediate steps on paper. A doctor works through a differential diagnosis before landing on a conclusion. The act of writing down intermediate thoughts helps humans — and it turns out to help LLMs too.
Without CoT, a model predicts the answer token by token in one pass — it has to "think" and "answer" at the same time. With CoT, it generates a reasoning trace first. Each reasoning token becomes additional context that informs the next one. The model can "see" its own work as it goes.
In 2022, Kojima et al. showed that adding "Let's think step by step" to a prompt improved accuracy on multi-step arithmetic by 40% — with no examples, no fine-tuning, nothing else changed. It's one of the highest-leverage prompt tricks ever discovered.
Common trigger phrases (all roughly equivalent):
Instead of just prompting for reasoning, show the model an example of good reasoning. This is more reliable than zero-shot CoT for complex or domain-specific tasks.
For domain-specific tasks (legal reasoning, medical diagnosis), providing a domain-appropriate reasoning chain is much more reliable than zero-shot CoT.
| Task type | CoT helps? | Why |
|---|---|---|
| Multi-step arithmetic | ✅ Strongly | Each step narrows error; intermediate values matter |
| Logical deduction | ✅ Strongly | Explicit premise → conclusion trace |
| Code generation | ✅ Helps | "Plan before coding" reduces structural errors |
| Simple factual lookup | ❌ No benefit | "What's the capital of France?" — chain is wasteful |
| Single-word classification | ❌ Can hurt | Model may overthink and change a correct answer |
| Creative writing | ⚠️ Mixed | Planning helps structure; can hurt spontaneity |
Run CoT multiple times with temperature > 0, then take the majority answer. This stack (CoT × self-consistency) is one of the most reliable prompting strategies available without fine-tuning.
Chain-of-thought (CoT) prompting improves LLM performance on multi-step reasoning tasks by encouraging the model to generate intermediate reasoning steps before producing a final answer. The explicit reasoning trace allows the model to decompose complex problems into manageable sub-steps, dramatically improving accuracy on arithmetic, logical inference, and commonsense reasoning tasks.
| Variant | How Triggered | Control | Token Cost | Best For |
|---|---|---|---|---|
| Zero-shot CoT | "Think step by step" | Minimal | Moderate | Quick deployment |
| Few-shot CoT | Example reasoning traces | High | Moderate + examples | Domain-specific tasks |
| Auto-CoT | Auto-generated examples | Medium | High (generation) | Large task diversity |
| Self-consistency | Multiple CoT paths sampled | High | Very high (N× samples) | High-stakes answers |
| Program-aided | Code generation + execution | High | Moderate + execution | Math, structured tasks |
Self-consistency CoT generates multiple independent reasoning chains for the same problem (typically 10–40 samples) and selects the answer by majority vote across the chains. This approach leverages the insight that while any single reasoning path may contain errors, the correct answer tends to appear more frequently than any specific incorrect answer when sampling multiple paths. Self-consistency reliably improves accuracy on tasks with verifiable correct answers at the cost of a linear increase in inference cost proportional to the number of sampled paths.
Program-aided language models (PAL) combine chain-of-thought reasoning with code execution. Instead of performing arithmetic or logical operations in the token stream — where LLMs are error-prone — the model is prompted to write executable code that computes the answer. The code is executed by a Python interpreter and the result is returned as the final answer. This approach sidesteps the fundamental limitation of LLMs performing multi-step arithmetic in their forward pass, achieving near-perfect accuracy on math word problems that confound standard CoT approaches.
Effective chain-of-thought prompts share several design principles that maximize the quality of the reasoning trace. Reasoning steps should be concrete and verifiable rather than abstract — "multiply 347 by 6 to get 2082" is better than "perform the multiplication." The final answer should be clearly delineated from the reasoning using explicit markers like "Therefore, the answer is..." or "Final Answer:" so that answer extraction from the reasoning trace is unambiguous.
Few-shot CoT example selection matters significantly for downstream quality. Examples that share structural similarity with the target task — same number of reasoning steps, same type of logical operations — produce better reasoning chains than generic examples chosen for variety. For domain-specific tasks, examples drawn from the target domain outperform general reasoning examples even when the general examples are of higher quality, suggesting that the model uses the examples to calibrate its reasoning style and domain vocabulary as much as to learn the reasoning format itself.
Negative examples — demonstrations of incorrect reasoning that the model should avoid — can be incorporated into few-shot CoT prompts to explicitly teach the model what constitutes a reasoning error. This is particularly useful for tasks where models have a known systematic failure mode, such as ignoring units in physics problems or confusing necessary and sufficient conditions in logical reasoning tasks.
Verification steps improve chain-of-thought reliability by prompting the model to explicitly check its work after producing an initial answer. A two-step prompt structure — first generate a solution with reasoning, then prompt the model to verify whether the solution is correct and identify any errors — catches a significant fraction of errors that slipped through the initial reasoning chain. The verification step works because generating a critique of an answer activates different reasoning patterns than generating the answer itself, making the model sensitive to errors it would overlook when reading its own reasoning as context for the next token.