Self-consistency: sample many, take the majority
Self-consistency runs the same Chain-of-Thought prompt multiple times and picks the most common answer. Learn how it works, when it pays off, and how to keep the cost down.
Run the same Chain-of-Thought prompt five times on a hard math problem. Watch what happens. You'll see five different reasoning traces. Maybe two arrive at 42, one at 38, one at 45, one bails out halfway. Which is right?
Take a vote. The majority wins. That's self-consistency — the simplest reliable accuracy boost you can bolt onto any reasoning prompt. It works because wrong reasoning paths are usually diverse (lots of ways to be wrong) while correct reasoning paths converge (the right answer is usually right for the same reason every time).
Trade tokens for accuracy. On problems where being right matters more than being fast, the trade is almost always favorable.
The whole idea in one line
The mental model: wisdom of the crowd, in one mind#
Ask ten experts the same question independently and average their answers — the average is usually better than any individual expert's answer. This is the "wisdom of the crowd" effect, first formalized by Francis Galton in 1907.
Self-consistency applies the same principle to a single model. By raising the temperature, you surface different reasoning paths the model could take. By voting on final answers, you let the right paths dominate. It's the crowd, but inside one head.
The mechanism#
Three steps:
- Run the same CoT prompt N times at a temperature high enough to get diverse reasoning chains (typically 0.7).
- Parse the final answer from each run.
- Take the most common one. Ties break by confidence, recency, or just pick the first.
The intuition: there are usually many wrong reasoning paths that lead to many different wrong answers. There's usually a smaller set of correct reasoning paths that converge on the same correct answer. Voting amplifies the signal.
A worked example#
Imagine you ran a CoT prompt for a math question 5 times. The reasoning chains differ; the final answers cluster:
| Run | Final answer | Path summary |
|---|---|---|
| 1 | 42 | Set up equation, solved directly |
| 2 | 42 | Worked backwards from constraint |
| 3 | 42 | Substitution, two-step verification |
| 4 | 38 | Arithmetic error in step 3 |
| 5 | 42 | Different framing, same answer |
Majority answer: 42, with a 4/5 vote — high confidence. Three different valid paths arrived there; one outlier had an arithmetic slip. Pure CoT might have surfaced any of these runs as "the answer." Voting picks the consensus.
The prompt itself doesn't change#
Self-consistency is an orchestration pattern, not a prompting one. You write a normal CoT prompt, then call it N times in your application code (or batch evaluator). The only nudge inside the prompt is to make the final answer easy to parse.
Solve the problem below. Show your reasoning step by step,
then end with a single line in the exact format:
Final answer: <value>
Problem:
{{question}}
Let's think step by step.In your code, you sample this prompt N times at temperature 0.7, regex out the line after Final answer:, then count.
Picking N — how many samples?#
Choosing N
| Use case | Suggested N | Reasoning |
|---|---|---|
| Quick A/B test, see if self-consistency helps at all | 3 | Minimum useful — catches obvious outliers |
| Most production tasks where you want the lift | 5 | Sweet spot — meaningful noise reduction at 5x cost |
| High-stakes outputs (financial, legal, medical extraction) | 10 | More samples = more confidence; cost typically justified |
| Research benchmarks, paper-quality results | 20–40 | Diminishing returns past 10 but matters for SoTA claims |
| Real-time interactive (latency-sensitive) | Skip self-consistency | Latency cost is unacceptable; accept single-run accuracy |
Why temperature matters — and why 0 breaks the technique#
If you run the prompt N times at temperature 0, you'll get N nearly-identical outputs. Voting on identical samples is useless. You need temperature high enough to produce diverse reasoning chains but not so high that the model produces incoherent output.
- Temperature 0.7 is a sane default for most models. Produces meaningful diversity without going off the rails.
- Temperature 1.0 when 0.7 samples look too similar (the model is being deterministic on this task).
- Temperature below 0.5 defeats the purpose; the samples cluster too tightly to produce useful voting signal.
Going further: extensions worth knowing#
Weighted voting#
If your prompt produces a confidence score per run, weight votes by confidence instead of treating each run as equal. Useful when some reasoning chains are obviously sloppy (the model itself rates them as low-confidence) and shouldn't carry equal weight.
Verifier reranking#
Run a second prompt that checks each candidate's reasoning for consistency, then pick the highest-scored. Heavier than voting but more accurate — particularly when wrong-but-confident paths are common (which is exactly the failure mode pure voting struggles with).
Early stopping#
If your first 3 samples all agree, skip the rest. Saves tokens on easy problems while still paying full cost on hard ones (where samples diverge). Drops average per-query cost significantly while preserving accuracy on the cases that matter.
Mixed-temperature voting#
Run some samples at temperature 0 (the model's "most likely" answer) and others at higher temperature (diverse paths). The T=0 sample anchors the "default" reasoning; the higher-T samples explore. If the anchor agrees with the majority of the diverse samples, you have very high confidence. If they disagree, the case is genuinely hard and worth escalating.
Common mistakes#
- Forgetting to raise the temperature. Self-consistency at temperature 0 is just a single CoT run with extra steps and extra cost.
- Voting on the full output, not the answer. Two runs with different reasoning but the same final answer are agreeing. Voting on full text counts them as different. Always parse the answer first.
- Using it on tasks with continuous outputs. Voting works for discrete answers (numbers, classifications, multiple-choice). For free-text outputs, you'd need a similarity metric or an LLM judge — at that point, plain prompting is usually cheaper.
- Logging only the winner. When the vote is split, the disagreement IS information. Log all samples; downstream alerts can flag low-confidence decisions for human review.
- Applying it without measuring the lift. 5x cost is real. Run an A/B test against single-run CoT first — see A/B testing prompts — to confirm the accuracy gain justifies the cost on your task.
Quick reference#
The 60-second summary
What it is: run the same CoT prompt N times at temperature 0.7+, parse the final answer from each, take the majority.
The intuition: wrong paths diverge, correct paths converge. Voting amplifies the signal.
Default N: 5 for most cases. 3 for cheap experiments. 10+ for high-stakes work.
Temperature: 0.7 default. Lower defeats the purpose; higher trades coherence for diversity.
When to skip: latency-critical interactive use, free-text outputs, tasks where single-run CoT already hits 95%+ accuracy.
What to read next#
Self-consistency builds on Chain-of-Thought — read that first if you haven't. For tasks where wrong early reasoning steps invalidate the chain, see Tree of Thoughts. And for the practical "is the new technique actually better?" question, A/B testing prompts. For the original paper (Wang et al., 2022), see papers.
Put this guide to work
Save your prompts, version every change, and share them with your team — free for up to 200 prompts.