Tree of Thoughts: when reasoning needs to backtrack

Tree of Thoughts (ToT) explores multiple reasoning paths and lets the model abandon dead ends. Learn the structure, when ToT is worth the orchestration cost, and a simpler approximation you can ship today.

schedule8 min readLast updated May 1, 2026

You're solving a 5-move chess puzzle. Move 1: an obvious developing move. Move 2: three reasonable choices, each leading to very different positions. You spend 30 seconds on each option, sketching the next 2-3 moves to see which line looks best. The first option hits a dead end at move 4. You back up. Try option 2. Better. Commit.

That's exactly how Tree of Thoughts (ToT) wants the model to think. Chain-of-Thought reasons in a straight line — if move 2 is wrong, every move after is doomed. ToT explores multiple reasoning paths at decision points, evaluates partial solutions, and abandons the ones that don't pan out. The way humans solve hard problems they've never seen before.

It costs more orchestration than CoT and more tokens than self-consistency. But on problems where wrong first steps can't be recovered from — strategic planning, multi-step puzzles, complex code refactors — ToT outperforms both.

The whole idea in one line

For each step, generate K candidate next-steps, evaluate each, keep the best M, repeat. Branch on hard decisions, backtrack when a branch dead-ends. Heavier than CoT — worth it when wrong first steps cascade.

The mental model: search, not a straight line#

Chain-of-Thought is a depth-first walk: take one step, then the next, then the next, never looking back. Tree of Thoughts is a search: explore multiple branches at each node, score them, expand the best ones, abandon the rest.

The same algorithm has been used in problem-solving forever — chess engines use it (alpha-beta search), planners use it (A* search), human experts use it informally. ToT applies it to LLM reasoning.

Two requirements for ToT to be worth the complexity:

The problem decomposes into discrete steps. No steps, no tree.
You can score partial solutions. Without an evaluator, you can't prune — every branch grows forever.

Why CoT alone isn't enough#

Imagine the model is solving a logic puzzle. CoT produces one move, then the next, then the next — a single line of reasoning. If move 2 is wrong, moves 3-5 are doomed and the model has no way to know.

ToT lets the model say: "at move 2, I have three options. Let me explore each one shallowly, evaluate the resulting position, and pick the most promising." If option A leads to a dead end three moves deeper, the model backtracks to move 2 and tries option B.

The ToT loop, abstractly#

Decompose. Break the problem into a sequence of decisions (moves, steps, sub-goals).
Generate. At the current step, ask the model for K candidate next-steps.
Evaluate. Ask the model (or a separate evaluator prompt) to score each candidate on promise / feasibility / progress.
Prune. Keep the top M candidates.
Repeat. For each surviving candidate, go back to step 2.
Stop. When you reach a complete solution or a budget limit.

A practical, prompt-only version#

The full ToT framework requires orchestration code — a tree data structure, evaluator calls, pruning logic. That's worth building for production use. But there's a lighter-weight version that runs in a single prompt and captures most of the value:

terminalPromptSingle-prompt ToT — works for many planning tasks

Claude / GPT-4o

Solve the problem below using a tree-of-thoughts approach.

For each step, do this:
1. Generate 3 candidate next-moves.
2. For each candidate, briefly assess: would this likely lead to a good
   solution? Output a score 1-5 and one-sentence reasoning.
3. Pick the highest-scored candidate. If two tie, prefer the simpler one.
4. Continue from that move.

If you reach a dead end, backtrack to the most recent step where you
had a >=4-rated alternative you didn't take. Mark this clearly with
"BACKTRACK to step N: <reason>".

End when you have a complete solution. Final line:
Final answer: <answer>

Problem:
{{problem}}

play_arrowTry in PromptShip

This isn't the full algorithm — there's no real tree, just sequential exploration with explicit backtracking. But it captures the most useful behaviors: considering alternatives at each step, evaluating before committing, and going back when stuck.

When ToT is worth the effort#

ToT vs. its alternatives

If your situation is…	Reach for…	Why
Problem has a search structure (branching choices)	ToT	The whole point: explore multiple paths, prune the bad ones
Wrong first decisions can't be recovered from	ToT	Pure CoT doesn't backtrack; ToT does
You can score partial solutions	ToT	Without an evaluator, ToT can't prune — and is just expensive CoT
Linear reasoning, no branching	CoT	No tree to explore — ToT pays branching cost for nothing
Hard reasoning where single-run errors are common	Self-consistency	Cheaper than ToT; runs CoT N times and votes
Latency-sensitive interactive use	CoT or reasoning model	ToT is 2–3× the latency of CoT; usually too slow
Production planning / agent loops	Real ToT (orchestrated)	Single-prompt approximation breaks at scale; build the real thing

When ToT is overkill#

Single-step tasks. If the problem doesn't decompose into discrete decisions, there is no tree to explore.
Tasks where self-consistency already gets you to 95%+ accuracy. Don't introduce ToT's complexity for marginal gains.
Latency-sensitive applications. Even the lightweight prompt-only version typically takes 2-3x the tokens of plain CoT.

Going further: production ToT#

Build the real thing for production use#

If you're building a production agent that solves problems requiring real planning — code generation across large codebases, tool-use with rollback, simulation-based decision-making — invest in the full ToT orchestration: a search tree in code, separate evaluator calls, depth-first or beam search, explicit budget. That's a system, not a prompt.

LLM-as-evaluator#

For non-trivial problems, the evaluator is its own LLM call — separate from the generator. Different prompt, narrowly scoped to scoring. Separation prevents the generator's biases from leaking into the evaluator's scores. Roughly 2× the cost; substantially better pruning.

Beam search vs. greedy#

Pure greedy search keeps only the top-scored candidate at each step. Beam search keeps the top K — wider but more expensive. For high-stakes problems where the "obvious" best move sometimes leads astray, beam search's extra exploration often pays off. Standard in chess engines and neural decoders.

Budget management#

ToT can run forever if you let it. Set explicit budgets:

Depth budget — max steps in any branch.
Width budget — max candidates per step.
Token budget — overall cost ceiling.

When any budget is hit, return the best solution found so far rather than failing. Anytime-algorithm semantics.

Common mistakes#

Trying ToT on a task that's really linear. If there's only one reasonable move at each step, you're paying for branching that doesn't exist.
No clear evaluation criteria. Without a way to score candidates, the model picks arbitrarily — you're burning tokens for nothing.
Letting the tree grow unbounded. Set a depth limit and a candidate-count budget. ToT can run forever if you let it.
Skipping the backtracking instruction. Without explicit permission to backtrack, the model pushes through dead ends because that's what its CoT habits trained it to do.
Reaching for ToT before trying CoT and self-consistency. Most reasoning tasks are sufficiently solved by cheaper techniques. ToT is the last-resort tool, not the first-reach one.

Quick reference#

The 60-second summary

What it is: reasoning as search — generate multiple candidates per step, score them, expand the best, backtrack from dead ends.

When it shines: branching decisions, hard planning, problems where wrong first steps invalidate later ones.

When to skip: linear tasks, cases where self-consistency already works, latency-sensitive interactive use.

Two requirements: the problem decomposes into steps; you can score partial solutions.

The discipline: always set depth, width, and token budgets. Always use a separate evaluator for scoring on non-trivial problems.

What to read next#

Before going deeper, make sure Chain-of-Thought and self-consistency are part of your toolkit — they're cheaper and sufficient for 80% of reasoning tasks. For tasks involving tool calls or external lookups, ReAct is the next pattern to learn. For the original ToT paper (Yao et al., 2023), see papers.

Put this guide to work

Save your prompts, version every change, and share them with your team — free for up to 200 prompts.

Start free Browse the prompt library