Self-Consistency

The idea in one sentence
Why it works
How to implement it
Choosing N and temperature
When it helps most
Cost vs quality curve

SECTION 01

The idea in one sentence

Ask the same question multiple times with slight randomness, and take the most common answer. If 7 out of 10 independent reasoning chains arrive at the same conclusion, that conclusion is probably right.

It's the same principle as asking five doctors for a second opinion before major surgery, or using an ensemble of models in machine learning — diversity of reasoning paths + majority vote = reliability.

SECTION 02

Why it works

LLMs are probabilistic. At temperature > 0, the same prompt produces different outputs each time. Some of those reasoning paths are wrong. But if the correct answer is reachable via multiple valid reasoning chains, it will show up more often in a large sample than any individual wrong answer.

Example: "What is 15% of 340?" Run 7 times (temperature 0.7): Path 1: 15/100 × 340 = 51 → Answer: 51 ✓ Path 2: 0.15 × 300 = 45... + 0.15×40 = 6 = 51 → Answer: 51 ✓ Path 3: 15 × 34 / 10 = 51 → Answer: 51 ✓ Path 4: 34 × 0.15 = 5.1... ← arithmetic error → Answer: 5.1 ✗ Path 5: 340 × 0.15 = 51 → Answer: 51 ✓ Path 6: 10% = 34, 5% = 17, total = 51 → Answer: 51 ✓ Path 7: 340/100 × 15 = 51 → Answer: 51 ✓ Majority: 51 (6 out of 7). The wrong path is outvoted.

SECTION 03

How to implement it

import anthropic import re from collections import Counter client = anthropic.Anthropic() def self_consistent_answer( question: str, n: int = 7, temperature: float = 0.7 ) -> dict: """ Run N CoT paths, extract final answers, return majority vote. """ prompt = f"""{question} Think step by step. At the very end, write exactly: ANSWER: [your final answer only, no explanation]""" raw_answers = [] for i in range(n): response = client.messages.create( model="claude-opus-4-6", max_tokens=600, temperature=temperature, messages=[{"role": "user", "content": prompt}] ).content[0].text # Extract the ANSWER line match = re.search(r'ANSWER:\s*(.+)', response, re.IGNORECASE) if match: raw_answers.append(match.group(1).strip()) if not raw_answers: return {"answer": None, "confidence": 0, "distribution": {}} counts = Counter(raw_answers) winner, count = counts.most_common(1)[0] confidence = count / len(raw_answers) return { "answer": winner, "confidence": confidence, # 1.0 = unanimous, 0.14 = only 1 in 7 "distribution": dict(counts), "n_valid": len(raw_answers) } result = self_consistent_answer( "A car travels 60 km in 45 minutes. What is its speed in km/h?", n=7 ) print(f"Answer: {result['answer']} (confidence: {result['confidence']:.0%})")

SECTION 04

Choosing N and temperature

Parameter	Recommendation	Reasoning
N (samples)	5 is the sweet spot	Diminishing returns after 7. At N=5 you need 3/5 agreement; unlikely to be random.
Temperature	0.5–0.8	Too low (0.1) → nearly identical paths, no diversity. Too high (1.0) → paths are random noise.
Confidence threshold	> 0.6 for high stakes	If majority answer only has 3/7, consider escalating or flagging for human review.

SECTION 05

When it helps most

Math and arithmetic: Works extremely well. Calculation errors cancel out across paths.
Logical deduction: Multiple paths to the same conclusion increases confidence.
Factual questions with objective answers: Strong signal from majority.
Classification: Good for borderline cases where the model is uncertain.

When it doesn't help:

Open-ended generation (essays, code): No single "right" answer to vote on.
Tasks where all paths make the same systematic error: If the model is consistently wrong (e.g., wrong knowledge), majority vote just confirms the wrong answer more confidently.
Questions where you can verify the answer directly: Just check the output instead.

SECTION 06

Cost vs quality curve

# Self-consistency multiplies cost by N # GPT-4o: ~$5/M input tokens, ~$15/M output tokens # For a question with 200-token input + 400-token CoT output: # Single call: (200 × 5 + 400 × 15) / 1M = $0.007 per call # 5× self-consistency: $0.035 per call # 10× self-consistency: $0.07 per call # Break-even analysis: # Self-consistency improves accuracy by ~10-15% on reasoning tasks # Is +10-15% accuracy worth 5× the cost? # → For high-stakes decisions (medical triage, financial analysis): YES # → For bulk low-stakes classification: NO # Practical alternative: use self-consistency only when confidence is low result = self_consistent_answer(question, n=3) # Cheap first pass if result["confidence"] < 0.7: # Run more samples only if uncertain result = self_consistent_answer(question, n=10) # Expensive follow-up

Cheap version: Run 3 samples first. If 3/3 agree, you're done. Only run 7+ samples when there's disagreement. This keeps average cost low while still catching unreliable cases.

Computational Cost Considerations

Self-consistency requires N forward passes through the model, multiplying your inference cost by a factor of N. For large models, this cost becomes substantial. The improvement curve typically shows diminishing returns after N=5-10, making this the practical sweet spot for most applications. Understanding this trade-off is essential when deciding whether self-consistency is appropriate for your use case and budget constraints.

Cost optimization involves several strategies. Temperature adjustment can make diverse sampling more efficient at producing quality variance. Early stopping strategies that detect converged solutions can reduce necessary passes. For production systems, caching and batching multiple self-consistency jobs together can reduce per-query overhead.

N Paths	Relative Cost	Typical Quality Gain	Best For
1	1x	Baseline	Speed-critical tasks
3	3x	+5-10%	Balanced approach
5	5x	+8-15%	Most applications
10	10x	+10-18%	Quality-critical

Organizations implementing self-consistency benefit from careful measurement and experimentation. Different models respond differently to this technique, and different domains see varying amounts of improvement. Thorough benchmarking on representative examples from your specific domain is essential before deploying self-consistency in production.

The effectiveness of self-consistency depends significantly on model capability. Stronger models that generate more diverse correct solutions benefit more than weaker models. The technique works particularly well for reasoning tasks where multiple valid solution paths exist. For classification tasks with single correct answers, self-consistency provides minimal benefit. Practitioners should measure improvements empirically on their specific tasks rather than assuming uniform gains across all domains.

Research into self-consistency has explored numerous variations and improvements. Using embeddings to cluster similar solutions before voting, employing confidence scores for weighted voting, and combining self-consistency with chain-of-thought prompting all show promise. The core insight—that multiple independent reasoning attempts can improve accuracy—has proven robust across many applications and model families.

Implementation considerations include caching and parallelization strategies. Multiple forward passes through large models require significant computation, making it important to optimize resource usage. Batching multiple self-consistency queries together reduces per-query overhead. For latency-sensitive applications, understanding the computation-quality trade-off is essential for deploying self-consistency effectively in production systems.

The mathematical intuition behind self-consistency involves ensemble methods and diversity in predictions. When diverse models or diverse samples from the same model provide complementary strengths, aggregating their outputs improves performance. Self-consistency captures this principle through controlled randomness in sampling, making it a practical implementation of ensemble learning at inference time.

Real-world deployments of self-consistency must balance quality improvements against computational costs. For high-volume applications processing thousands of queries daily, the 5-10x cost multiplier becomes significant. Understanding the diminishing returns curve for your specific task helps determine optimal N values. Smaller N values like 3 often provide 80% of the benefit at significantly lower cost, making them pragmatic choices for production systems.

The temperature parameter significantly affects diversity in self-consistency. Higher temperatures increase randomness, promoting diverse solutions. Lower temperatures concentrate probability mass on the most likely outputs, reducing diversity. Tuning temperature to balance diversity and quality requires empirical testing. Too high temperatures produce poor solutions; too low temperatures eliminate the diversity benefits of self-consistency.

Future improvements in self-consistency include learned aggregation mechanisms where neural networks learn optimal voting strategies rather than using fixed majority voting. Combining self-consistency with other ensemble techniques and with retrieval-augmented generation shows promise for further improvements. As systems become more complex, principled approaches to combining multiple signals become increasingly valuable.

This approach continues to inspire new research directions and practical improvements in reasoning systems.

Organizations implementing self-consistency should measure empirical improvements on their specific domains rather than assuming universal benefits. The technique has proven valuable across numerous applications, yet optimization for your exact use case typically yields the best results. Continued research into ensemble methods and voting mechanisms promises even more sophisticated approaches in the future.

Self-Consistency

Table of Contents

The idea in one sentence

Why it works

How to implement it

Choosing N and temperature

When it helps most

Cost vs quality curve

Voting Mechanisms & Aggregation

Computational Cost Considerations

Self-Consistency

Table of Contents

The idea in one sentence

Why it works

How to implement it

Choosing N and temperature

When it helps most

Cost vs quality curve

Voting Mechanisms & Aggregation

Computational Cost Considerations

Related concepts