Chain-of-Thought prompting: make the model show its work

Chain-of-Thought (CoT) prompting asks the model to reason step by step before answering. Learn the zero-shot trick, the few-shot pattern, and exactly when CoT outperforms a direct prompt.

schedule9 min readLast updated May 1, 2026

Try this with any non-reasoning model. Ask it: "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?" About 70% of the time, you'll get a confident, slightly wrong number — the model jumps to an answer without thinking.

Now try the exact same question with six extra words at the end: "Let's think step by step." The model now produces something like "Roger started with 5. He bought 2 cans × 3 balls = 6. Total: 5 + 6 = 11." Confident. Right. Visible reasoning.

That gap is Chain-of-Thought (CoT) prompting in its simplest form. It's one of the highest-leverage techniques in modern prompt engineering — and one of the easiest to misuse. This guide explains why it works, when it earns its keep, when it makes things worse, and the variants worth knowing.

The whole idea in one line

CoT asks the model to reason step by step before answering. Visible reasoning produces correct answers more often than direct answers — on tasks that actually require reasoning.

The mental model: thinking in tokens#

Every token a language model generates is a forward-pass calculation through its weights. The model has a finite amount of "thinking room" per pass.

When you ask a hard question and demand an immediate answer, you're asking the model to do all the reasoning silently, in a single pass. For problems that legitimately require multiple steps, this is asking too much of the model's computational budget.

CoT solves this by giving the model tokens to think with. Each intermediate reasoning step is a token the model generates AND a token it can attend to when generating the next step. Working memory in token space.

Why this works

The model isn't getting smarter — it's getting more compute. You're trading tokens (cost, latency) for accuracy. On hard problems, the trade is almost always worth it. On easy problems, you're paying for nothing.

Two flavors: zero-shot and few-shot CoT#

CoT comes in two practical varieties. Both work; pick based on how much you need to lock in the reasoning style.

Zero-shot CoT — the magic phrase#

Append a short instruction to your prompt that asks the model to reason before answering. The 2022 paper that named the technique used the phrase "Let's think step by step" — and it still works on every modern frontier model.

Cheap (one phrase, no examples), works on most reasoning tasks, but you have less control over the reasoning shape.

terminalPromptDirect prompt — often wrong

Any (non-reasoning)

A bag has 12 apples. You eat 3, give 4 to a friend, and buy 7 more.
How many apples do you have?

Answer:

play_arrowTry in PromptShip

terminalPromptZero-shot CoT — usually right

Any (non-reasoning)

A bag has 12 apples. You eat 3, give 4 to a friend, and buy 7 more.
How many apples do you have?

Let's think step by step.

play_arrowTry in PromptShip

With the second, the model produces something like: "Start with 12. Eat 3 → 9. Give 4 → 5. Buy 7 → 12. So the answer is 12." Visible reasoning, traceable to a correct answer.

Few-shot CoT — demonstrate the reasoning shape#

For tasks where the style of reasoning matters — domain-specific analysis, structured decisions, classification with rationale — show the model what good reasoning looks like by including 2-3 worked examples.

More expensive (extra tokens) but more reliable. The examples teach the model not just to reason, but to reason like you would.

terminalPromptCustomer-ticket priority — few-shot CoT

Any

Classify each support ticket as P0, P1, or P2.
Show your reasoning before the answer.

Ticket: "Site is down, customers can't log in."
Reasoning: Production is down, all users impacted, revenue blocked.
That's a system-wide outage.
Priority: P0

Ticket: "Export button fails on Safari only."
Reasoning: Single-browser issue, has a workaround (use Chrome), no
revenue impact. Annoying but not urgent.
Priority: P2

Ticket: "Billing dashboard shows wrong totals for some customers."
Reasoning: Affects multiple customers, financial accuracy issue,
no workaround. Not full outage but high impact.
Priority: P1

Ticket: "{{ticket_body}}"
Reasoning:

play_arrowTry in PromptShip

The few-shot examples teach the model what counts as P0 vs P1 vs P2 — and how to think about it. Worth more than any number of bullet-point definitions in an instruction.

When CoT pays off — and when it doesn't#

CoT is not free — it adds tokens and latency. The value comes from the reasoning part, so the question is whether your task actually needs reasoning.

When to use CoT

If your situation is…	Reach for…	Why
Math word problems, multi-step arithmetic	Yes	Direct answers fail unpredictably; reasoning fixes most
Multi-step logic puzzles, planning tasks	Yes	Wrong intermediate steps are visible and debuggable
Decisions with stated criteria (priority, severity, scoring)	Yes (few-shot)	Examples teach the rubric; reasoning shows it being applied
Code analysis or generation with edge cases	Yes	Surfaces assumptions before the model commits to code
Sentiment classification, simple labels	No	Reasoning often degrades simple intuitions into overanalysis
Translation, format conversion (JSON ↔ CSV)	No	Mechanical task; reasoning adds tokens without lifting quality
Direct creative writing (a poem, a tagline)	No	Reasoning produces explanatory prose, not creative output
Single-fact lookups	No	You're paying tokens for no benefit

Reasoning models change the math#

The newer class of reasoning models — OpenAI o1/o3, Claude with extended thinking, Gemini with thinking mode — does CoT internally. The model reasons before producing its visible output, without you asking.

On these models:

Don't add "think step by step". At best a no-op; sometimes interferes with the internal reasoning loop.
Few-shot CoT can hurt instead of help. Reasoning models often perform better with concise, zero-shot prompts than with examples — counter to everything that's true on regular models.
Trust the model's built-in reasoning. State the task clearly, give essential context, and let the model do its thing. Less is more.

For when each is right, see the per-model guides: ChatGPT, Claude, Gemini.

Going further: extensions worth knowing#

Self-consistency#

Run a CoT prompt N times at non-zero temperature, take the majority answer. Wrong reasoning paths are diverse; correct ones converge. A 5x cost increase often produces measurably better accuracy on hard tasks. See Self-consistency.

Tree of Thoughts#

CoT in a straight line; ToT in a tree. For problems where wrong first steps cascade (planning, multi-step puzzles), ToT lets the model branch, evaluate, and backtrack. See Tree of Thoughts.

ReAct#

CoT plus tools. The model alternates reasoning steps with actions (search, calculate, look up). The blueprint behind almost every modern agent. See ReAct.

Hidden CoT — get the reasoning, hide it from users#

You want CoT's accuracy benefit but don't want users seeing the reasoning. Two patterns:

Structured output — wrap reasoning in <reasoning> tags and the final answer in <answer> tags; your application code strips the reasoning before showing the user.
Two-call chain — run a CoT prompt to get the answer, then a separate "polish this for the user" prompt that takes the answer and produces clean output. Costs more; gives cleaner separation. See prompt chaining.

Common mistakes#

Asking for "just the answer" while keeping CoT enabled. The reasoning IS the technique. Suppress it from the visible output via tags or chained prompts — don't skip it altogether.
Using CoT on tasks that don't need it. Sentiment classification with CoT often gets worse — the model talks itself into nuance the simple answer didn't need.
Examples with sloppy reasoning. Few-shot CoT with hand-wavy examples teaches the model to be hand-wavy. The bar for example quality is higher than for instruction quality.
Forgetting to specify final-answer format. Without explicit structure, the model returns "...so the answer is 12" in prose. Specify: Final answer: <value> — parseable and consistent.
Adding CoT to reasoning models. Already covered above, but worth repeating: o1, o3, Claude thinking mode, Gemini thinking mode — these don't need it.

Quick reference#

The 60-second summary

What it is: asking the model to reason step by step before answering. Two flavors — zero-shot (one magic phrase) and few-shot (worked examples).

What it solves: single-pass reasoning failures on multi-step problems — math, logic, structured decisions, multi-step analysis.

What to remember: use CoT when the task requires reasoning; skip it on lookups, simple labels, mechanical conversions, and reasoning models.

Cost: tokens, latency. Worth it on hard tasks; waste on easy ones.

What to read next#

For the high-accuracy version of CoT, Self-consistency. For exploration-heavy tasks where wrong first steps cascade, Tree of Thoughts. For tasks where reasoning needs to interleave with tool calls, ReAct. For the original 2022 paper and other primary sources, see our papers list.

Put this guide to work

Save your prompts, version every change, and share them with your team — free for up to 200 prompts.

Start free Browse the prompt library

Chain-of-Thought prompting: make the model show its work

Chain-of-Thought (CoT) prompting asks the model to reason step by step before answering. Learn the zero-shot trick, the few-shot pattern, and exactly when CoT outperforms a direct prompt.

schedule9 min readLast updated May 1, 2026

The whole idea in one line

CoT asks the model to reason step by step before answering. Visible reasoning produces correct answers more often than direct answers — on tasks that actually require reasoning.

The mental model: thinking in tokens#

Every token a language model generates is a forward-pass calculation through its weights. The model has a finite amount of "thinking room" per pass.

Why this works

Two flavors: zero-shot and few-shot CoT#

CoT comes in two practical varieties. Both work; pick based on how much you need to lock in the reasoning style.

Zero-shot CoT — the magic phrase#

Cheap (one phrase, no examples), works on most reasoning tasks, but you have less control over the reasoning shape.

terminalPromptDirect prompt — often wrong

Any (non-reasoning)

A bag has 12 apples. You eat 3, give 4 to a friend, and buy 7 more.
How many apples do you have?

Answer:

play_arrowTry in PromptShip

terminalPromptZero-shot CoT — usually right

Any (non-reasoning)

A bag has 12 apples. You eat 3, give 4 to a friend, and buy 7 more.
How many apples do you have?

Let's think step by step.

play_arrowTry in PromptShip

With the second, the model produces something like: "Start with 12. Eat 3 → 9. Give 4 → 5. Buy 7 → 12. So the answer is 12." Visible reasoning, traceable to a correct answer.

Few-shot CoT — demonstrate the reasoning shape#

More expensive (extra tokens) but more reliable. The examples teach the model not just to reason, but to reason like you would.

terminalPromptCustomer-ticket priority — few-shot CoT

Any

Classify each support ticket as P0, P1, or P2.
Show your reasoning before the answer.

Ticket: "Site is down, customers can't log in."
Reasoning: Production is down, all users impacted, revenue blocked.
That's a system-wide outage.
Priority: P0

Ticket: "Export button fails on Safari only."
Reasoning: Single-browser issue, has a workaround (use Chrome), no
revenue impact. Annoying but not urgent.
Priority: P2

Ticket: "Billing dashboard shows wrong totals for some customers."
Reasoning: Affects multiple customers, financial accuracy issue,
no workaround. Not full outage but high impact.
Priority: P1

Ticket: "{{ticket_body}}"
Reasoning:

play_arrowTry in PromptShip

The few-shot examples teach the model what counts as P0 vs P1 vs P2 — and how to think about it. Worth more than any number of bullet-point definitions in an instruction.

When CoT pays off — and when it doesn't#

CoT is not free — it adds tokens and latency. The value comes from the reasoning part, so the question is whether your task actually needs reasoning.

When to use CoT

If your situation is…	Reach for…	Why
Math word problems, multi-step arithmetic	Yes	Direct answers fail unpredictably; reasoning fixes most
Multi-step logic puzzles, planning tasks	Yes	Wrong intermediate steps are visible and debuggable
Decisions with stated criteria (priority, severity, scoring)	Yes (few-shot)	Examples teach the rubric; reasoning shows it being applied
Code analysis or generation with edge cases	Yes	Surfaces assumptions before the model commits to code
Sentiment classification, simple labels	No	Reasoning often degrades simple intuitions into overanalysis
Translation, format conversion (JSON ↔ CSV)	No	Mechanical task; reasoning adds tokens without lifting quality
Direct creative writing (a poem, a tagline)	No	Reasoning produces explanatory prose, not creative output
Single-fact lookups	No	You're paying tokens for no benefit

Reasoning models change the math#

On these models:

Don't add "think step by step". At best a no-op; sometimes interferes with the internal reasoning loop.
Few-shot CoT can hurt instead of help. Reasoning models often perform better with concise, zero-shot prompts than with examples — counter to everything that's true on regular models.
Trust the model's built-in reasoning. State the task clearly, give essential context, and let the model do its thing. Less is more.

For when each is right, see the per-model guides: ChatGPT, Claude, Gemini.

Going further: extensions worth knowing#

Self-consistency#

Tree of Thoughts#

CoT in a straight line; ToT in a tree. For problems where wrong first steps cascade (planning, multi-step puzzles), ToT lets the model branch, evaluate, and backtrack. See Tree of Thoughts.

ReAct#

CoT plus tools. The model alternates reasoning steps with actions (search, calculate, look up). The blueprint behind almost every modern agent. See ReAct.

Hidden CoT — get the reasoning, hide it from users#

You want CoT's accuracy benefit but don't want users seeing the reasoning. Two patterns:

Structured output — wrap reasoning in <reasoning> tags and the final answer in <answer> tags; your application code strips the reasoning before showing the user.
Two-call chain — run a CoT prompt to get the answer, then a separate "polish this for the user" prompt that takes the answer and produces clean output. Costs more; gives cleaner separation. See prompt chaining.

Common mistakes#

Asking for "just the answer" while keeping CoT enabled. The reasoning IS the technique. Suppress it from the visible output via tags or chained prompts — don't skip it altogether.
Using CoT on tasks that don't need it. Sentiment classification with CoT often gets worse — the model talks itself into nuance the simple answer didn't need.
Examples with sloppy reasoning. Few-shot CoT with hand-wavy examples teaches the model to be hand-wavy. The bar for example quality is higher than for instruction quality.
Forgetting to specify final-answer format. Without explicit structure, the model returns "...so the answer is 12" in prose. Specify: Final answer: <value> — parseable and consistent.
Adding CoT to reasoning models. Already covered above, but worth repeating: o1, o3, Claude thinking mode, Gemini thinking mode — these don't need it.

Quick reference#

The 60-second summary

What it is: asking the model to reason step by step before answering. Two flavors — zero-shot (one magic phrase) and few-shot (worked examples).

What it solves: single-pass reasoning failures on multi-step problems — math, logic, structured decisions, multi-step analysis.

What to remember: use CoT when the task requires reasoning; skip it on lookups, simple labels, mechanical conversions, and reasoning models.

Cost: tokens, latency. Worth it on hard tasks; waste on easy ones.

Put this guide to work

Save your prompts, version every change, and share them with your team — free for up to 200 prompts.

Start free Browse the prompt library