Chain-of-Thought prompting: make the model show its work
Chain-of-Thought (CoT) prompting asks the model to reason step by step before answering. Learn the zero-shot trick, the few-shot pattern, and exactly when CoT outperforms a direct prompt.
Try this with any non-reasoning model. Ask it: "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?" About 70% of the time, you'll get a confident, slightly wrong number — the model jumps to an answer without thinking.
Now try the exact same question with six extra words at the end: "Let's think step by step." The model now produces something like "Roger started with 5. He bought 2 cans × 3 balls = 6. Total: 5 + 6 = 11." Confident. Right. Visible reasoning.
That gap is Chain-of-Thought (CoT) prompting in its simplest form. It's one of the highest-leverage techniques in modern prompt engineering — and one of the easiest to misuse. This guide explains why it works, when it earns its keep, when it makes things worse, and the variants worth knowing.
The whole idea in one line
The mental model: thinking in tokens#
Every token a language model generates is a forward-pass calculation through its weights. The model has a finite amount of "thinking room" per pass.
When you ask a hard question and demand an immediate answer, you're asking the model to do all the reasoning silently, in a single pass. For problems that legitimately require multiple steps, this is asking too much of the model's computational budget.
CoT solves this by giving the model tokens to think with. Each intermediate reasoning step is a token the model generates AND a token it can attend to when generating the next step. Working memory in token space.
Why this works
Two flavors: zero-shot and few-shot CoT#
CoT comes in two practical varieties. Both work; pick based on how much you need to lock in the reasoning style.
Zero-shot CoT — the magic phrase#
Append a short instruction to your prompt that asks the model to reason before answering. The 2022 paper that named the technique used the phrase "Let's think step by step" — and it still works on every modern frontier model.
Cheap (one phrase, no examples), works on most reasoning tasks, but you have less control over the reasoning shape.
A bag has 12 apples. You eat 3, give 4 to a friend, and buy 7 more. How many apples do you have? Answer:
A bag has 12 apples. You eat 3, give 4 to a friend, and buy 7 more. How many apples do you have? Let's think step by step.
With the second, the model produces something like: "Start with 12. Eat 3 → 9. Give 4 → 5. Buy 7 → 12. So the answer is 12." Visible reasoning, traceable to a correct answer.
Few-shot CoT — demonstrate the reasoning shape#
For tasks where the style of reasoning matters — domain-specific analysis, structured decisions, classification with rationale — show the model what good reasoning looks like by including 2-3 worked examples.
More expensive (extra tokens) but more reliable. The examples teach the model not just to reason, but to reason like you would.
Classify each support ticket as P0, P1, or P2.
Show your reasoning before the answer.
Ticket: "Site is down, customers can't log in."
Reasoning: Production is down, all users impacted, revenue blocked.
That's a system-wide outage.
Priority: P0
Ticket: "Export button fails on Safari only."
Reasoning: Single-browser issue, has a workaround (use Chrome), no
revenue impact. Annoying but not urgent.
Priority: P2
Ticket: "Billing dashboard shows wrong totals for some customers."
Reasoning: Affects multiple customers, financial accuracy issue,
no workaround. Not full outage but high impact.
Priority: P1
Ticket: "{{ticket_body}}"
Reasoning:The few-shot examples teach the model what counts as P0 vs P1 vs P2 — and how to think about it. Worth more than any number of bullet-point definitions in an instruction.
When CoT pays off — and when it doesn't#
CoT is not free — it adds tokens and latency. The value comes from the reasoning part, so the question is whether your task actually needs reasoning.
When to use CoT
| If your situation is… | Reach for… | Why |
|---|---|---|
| Math word problems, multi-step arithmetic | Yes | Direct answers fail unpredictably; reasoning fixes most |
| Multi-step logic puzzles, planning tasks | Yes | Wrong intermediate steps are visible and debuggable |
| Decisions with stated criteria (priority, severity, scoring) | Yes (few-shot) | Examples teach the rubric; reasoning shows it being applied |
| Code analysis or generation with edge cases | Yes | Surfaces assumptions before the model commits to code |
| Sentiment classification, simple labels | No | Reasoning often degrades simple intuitions into overanalysis |
| Translation, format conversion (JSON ↔ CSV) | No | Mechanical task; reasoning adds tokens without lifting quality |
| Direct creative writing (a poem, a tagline) | No | Reasoning produces explanatory prose, not creative output |
| Single-fact lookups | No | You're paying tokens for no benefit |
Reasoning models change the math#
The newer class of reasoning models — OpenAI o1/o3, Claude with extended thinking, Gemini with thinking mode — does CoT internally. The model reasons before producing its visible output, without you asking.
On these models:
- Don't add "think step by step". At best a no-op; sometimes interferes with the internal reasoning loop.
- Few-shot CoT can hurt instead of help. Reasoning models often perform better with concise, zero-shot prompts than with examples — counter to everything that's true on regular models.
- Trust the model's built-in reasoning. State the task clearly, give essential context, and let the model do its thing. Less is more.
For when each is right, see the per-model guides: ChatGPT, Claude, Gemini.
Going further: extensions worth knowing#
Self-consistency#
Run a CoT prompt N times at non-zero temperature, take the majority answer. Wrong reasoning paths are diverse; correct ones converge. A 5x cost increase often produces measurably better accuracy on hard tasks. See Self-consistency.
Tree of Thoughts#
CoT in a straight line; ToT in a tree. For problems where wrong first steps cascade (planning, multi-step puzzles), ToT lets the model branch, evaluate, and backtrack. See Tree of Thoughts.
ReAct#
CoT plus tools. The model alternates reasoning steps with actions (search, calculate, look up). The blueprint behind almost every modern agent. See ReAct.
Hidden CoT — get the reasoning, hide it from users#
You want CoT's accuracy benefit but don't want users seeing the reasoning. Two patterns:
- Structured output — wrap reasoning in
<reasoning>tags and the final answer in<answer>tags; your application code strips the reasoning before showing the user. - Two-call chain — run a CoT prompt to get the answer, then a separate "polish this for the user" prompt that takes the answer and produces clean output. Costs more; gives cleaner separation. See prompt chaining.
Common mistakes#
- Asking for "just the answer" while keeping CoT enabled. The reasoning IS the technique. Suppress it from the visible output via tags or chained prompts — don't skip it altogether.
- Using CoT on tasks that don't need it. Sentiment classification with CoT often gets worse — the model talks itself into nuance the simple answer didn't need.
- Examples with sloppy reasoning. Few-shot CoT with hand-wavy examples teaches the model to be hand-wavy. The bar for example quality is higher than for instruction quality.
- Forgetting to specify final-answer format. Without explicit structure, the model returns "...so the answer is 12" in prose. Specify:
Final answer: <value>— parseable and consistent. - Adding CoT to reasoning models. Already covered above, but worth repeating: o1, o3, Claude thinking mode, Gemini thinking mode — these don't need it.
Quick reference#
The 60-second summary
What it is: asking the model to reason step by step before answering. Two flavors — zero-shot (one magic phrase) and few-shot (worked examples).
What it solves: single-pass reasoning failures on multi-step problems — math, logic, structured decisions, multi-step analysis.
What to remember: use CoT when the task requires reasoning; skip it on lookups, simple labels, mechanical conversions, and reasoning models.
Cost: tokens, latency. Worth it on hard tasks; waste on easy ones.
What to read next#
For the high-accuracy version of CoT, Self-consistency. For exploration-heavy tasks where wrong first steps cascade, Tree of Thoughts. For tasks where reasoning needs to interleave with tool calls, ReAct. For the original 2022 paper and other primary sources, see our papers list.
Put this guide to work
Save your prompts, version every change, and share them with your team — free for up to 200 prompts.