Tree-of-thoughts (ToT) is a reasoning pattern that turns a single linear chain of thought into a deliberate search over many possible reasoning paths. The model proposes several candidate next "thoughts," evaluates each one, and explores the most promising branches further, backtracking when a branch looks like a dead end. It is what chess engines have done for decades, applied to natural-language reasoning steps.
The pattern was formalized in May 2023 by Shunyu Yao and collaborators in "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." Before ToT, the standard answer to "the model got it wrong" was either more few-shot examples or self-consistency (sample N chains, majority vote). ToT showed that an explicit search over self-generated thoughts, with the model itself acting as both generator and evaluator, dramatically outperforms majority voting on tasks where the solution space has branching structure.
In 2026, ToT lives in an interesting niche. The biggest reasoning models do a version of this internally (GPT-5.5's reasoning passes, Claude 4.7's extended thinking). But for production agent loops, planning tasks, and hard combinatorial puzzles where you want explicit control over how the search proceeds, ToT remains a useful tool, especially when paired with a strong evaluator model and a cheap proposer model.
TL;DR
- What it is: a search procedure where an LLM generates multiple candidate "thoughts" at each step, evaluates them, and explores the most promising branches.
- Origin: Yao et al., "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" (May 2023, NeurIPS 2023).
- Key ingredients: a thought generator, a thought evaluator (same or different model), and a search strategy (BFS or DFS).
- Best on: puzzles, planning, hard math, creative writing with constraints. Game of 24, crossword solving, 5x5 mini-Sudoku are the canonical demos.
- Cost: 10x to 100x the tokens of a single CoT call. Use only when single-pass reasoning fails.
- 2026 context: frontier reasoning models do this internally; explicit ToT is most valuable when you need transparency, control, or a cheaper proposer with an expensive evaluator.
Origin: Yao et al., 2023
The canonical reference is Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan, "Tree of Thoughts: Deliberate Problem Solving with Large Language Models," NeurIPS 2023. The paper benchmarks three tasks designed to be hard for chain-of-thought:
- Game of 24: given four numbers, reach 24 using basic arithmetic and parentheses.
- Creative writing: generate a coherent passage with constrained sentence endings.
- 5x5 mini crosswords.
On Game of 24, CoT solves around 4 percent of instances. ToT with breadth-first search and a thought evaluator solves 74 percent. The gap is the whole story: branching plus evaluation plus search beats one shot at the answer.
The framework borrows from Newell and Simon's classic "deliberate problem solving" view: System 2 reasoning is search, not pattern completion. ToT is the LLM version of that.
How ToT works
A ToT run has four conceptual pieces.
1. Thought decomposition. Break the problem into discrete "thoughts." For Game of 24, a thought is one arithmetic step ("3 + 5 = 8"). For creative writing, a thought is a plan for one paragraph. The granularity matters: too coarse and the tree is shallow but each node is hard; too fine and the tree explodes.
2. Thought generator. At each tree node, ask the model to propose k candidate next thoughts. You can sample with temperature > 0 to get diversity, or use a "propose prompt" that explicitly asks for multiple options.
3. Thought evaluator. Score each candidate. The simplest form asks the model "is this thought a) sure to lead to a solution, b) possible, c) impossible?" The evaluator can be the same model or a different one. Yao et al. used GPT-4 as both proposer and evaluator.
4. Search algorithm. Decide which nodes to expand next. Two common choices:
- BFS: keep the top
bcandidates at each depth, expand all of them, repeat. Good for problems with bounded depth. - DFS: go deep on the most promising branch, backtrack when it looks unpromising. Good for problems with variable depth.
ToT is not magic. It is a systematic way to spend more compute in exchange for higher accuracy on tasks where a single forward pass cannot fit the search.
Implementation sketch in Python
A minimal Game of 24 ToT.
from openai import OpenAI
client = OpenAI()
PROPOSE_PROMPT = """You are solving the Game of 24. Given the current numbers,
propose 3 possible next arithmetic steps. Each step combines two numbers using
+, -, *, or /. Output one per line in the format:
<a> <op> <b> = <result> (remaining: <list>)
Current numbers: {numbers}"""
EVAL_PROMPT = """Given current numbers {numbers}, can we reach 24 using +,-,*,/?
Answer 'sure', 'likely', or 'impossible' with one short reason."""
def propose(numbers):
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user",
"content": PROPOSE_PROMPT.format(numbers=numbers)}],
temperature=0.7,
)
return parse_steps(resp.choices[0].message.content)
def evaluate(numbers):
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user",
"content": EVAL_PROMPT.format(numbers=numbers)}],
)
text = resp.choices[0].message.content.lower()
return {"sure": 1.0, "likely": 0.5, "impossible": 0.0}.get(
next((k for k in ("sure", "likely", "impossible") if k in text), "likely"),
0.5,
)
def tot_bfs(start_numbers, beam_width=5, max_depth=3):
frontier = [(start_numbers, [])]
for depth in range(max_depth):
candidates = []
for numbers, path in frontier:
for step in propose(numbers):
new_numbers = apply_step(numbers, step)
score = evaluate(new_numbers)
candidates.append((new_numbers, path + [step], score))
candidates.sort(key=lambda c: c[2], reverse=True)
frontier = [(n, p) for n, p, _ in candidates[:beam_width]]
for numbers, path in frontier:
if len(numbers) == 1 and abs(numbers[0] - 24) < 1e-6:
return path
return NoneThe helpers parse_steps and apply_step are problem-specific arithmetic plumbing. The interesting part is the loop: propose, evaluate, prune, repeat.
This skeleton easily becomes a generic ToT runner if you parametrize the proposer prompt and the evaluator prompt. The Princeton authors released a reference implementation that follows roughly this shape.
ToT vs CoT vs least-to-most
These three patterns form a clean progression.
Chain-of-thought. One linear reasoning trace. Cheap. Works when the right answer is reachable in one pass with intermediate steps. See chain-of-thought prompting.
Least-to-most. Decompose the problem into ordered subproblems, then chain. The structure is linear, but the work is split. Better than CoT on tasks that benefit from explicit decomposition. See least-to-most prompting.
Tree-of-thoughts. The decomposition has branching: at each step, multiple options, evaluated, with backtracking. ToT subsumes both CoT (a tree with beam width 1) and least-to-most (a chain with explicit subproblems).
Cost scales the same way. CoT is 1x. Least-to-most is roughly k x for k subproblems. ToT with beam width b and depth d is O(b * d * proposals) calls. A typical Game of 24 run costs 50 to 100 model calls per puzzle.
BFS vs DFS
The Yao et al. paper compares both.
BFS maintains a fixed-width frontier at each depth. Good when:
- the solution depth is roughly known,
- you want bounded worst-case cost,
- you want easy parallelism (all frontier nodes can be expanded concurrently).
DFS greedily descends the highest-scored branch and backtracks on failure. Good when:
- the solution depth varies,
- you want to stop early when a sure path is found,
- you have a strong evaluator that reliably ranks branches.
In practice, BFS with width 5 and depth 3 to 5 is a reasonable default. If you have a cheap evaluator and an expensive proposer, BFS amortizes well. If the proposer is cheap and the evaluator is expensive, DFS often wins.
When ToT is worth the cost
A short checklist.
- The problem has clear branching structure (multiple plausible next steps).
- Single-pass CoT, even with self-consistency, fails or barely solves the task.
- You can write a useful evaluator prompt. If "is this thought good?" cannot be answered locally, ToT will not work; the search has no signal.
- The task tolerates 5 to 60 second latency.
- The unit economics tolerate 10x to 100x the per-task token cost.
Tasks where ToT shines: combinatorial puzzles, multi-step math, route planning, code generation with constraints, retrieval-heavy agents that need to choose between query reformulations, structured creative writing.
Tasks where ToT is overkill: any question a reasoning model gets right in one pass, anything latency-sensitive, anything where the evaluator cannot distinguish good from bad locally.
TypeScript variant
The same skeleton in TS, for teams already in Node land.
import OpenAI from "openai";
const client = new OpenAI();
type Node = { state: number[]; path: string[]; score: number };
async function propose(state: number[]): Promise<string[]> {
const resp = await client.chat.completions.create({
model: "gpt-4o-mini",
temperature: 0.7,
messages: [
{ role: "user", content: `Propose 3 next steps for: ${state.join(", ")}` },
],
});
return (resp.choices[0].message.content ?? "").split("\n").filter(Boolean);
}
async function evaluate(state: number[]): Promise<number> {
const resp = await client.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{ role: "user", content: `Can we reach 24 from ${state.join(", ")}? Answer sure, likely, or impossible.` },
],
});
const text = (resp.choices[0].message.content ?? "").toLowerCase();
if (text.includes("sure")) return 1.0;
if (text.includes("impossible")) return 0.0;
return 0.5;
}You wire propose and evaluate into a BFS or DFS loop just as in the Python version.
Tracing a ToT run
The hardest part of operating a ToT system is debugging it. With dozens of model calls per task, you need observability. Every proposal, every evaluation, every pruning decision should be a span.
from respan import respan, observe
respan.init()
@observe()
def propose(numbers): ...
@observe()
def evaluate(numbers): ...
@observe()
def tot_bfs(start_numbers): ...That gives you a nested trace where each call shows the prompt, response, tokens, and latency. When ToT fails on a puzzle, you scroll the tree and see which evaluation was overconfident, which branch was pruned too early. See LLM tracing for the wiring details.
2026 context: do reasoning models obsolete ToT?
Partially. GPT-5.5, Claude Opus 4.7 with extended thinking, and Gemini 3.1 Pro all run a learned, internal search at inference time. For many problems where you would have reached for ToT in 2023, you now just pass the prompt to a reasoning model and accept a few thousand reasoning tokens of cost. The pattern lives inside the model.
But explicit ToT still wins in three cases.
- You need transparency. A reasoning model's internal tokens are opaque or summarized. ToT gives you every branch, every score, every decision, in your own logs.
- You want a cheap proposer with an expensive evaluator (or vice versa). Reasoning models bundle proposal and evaluation. ToT lets you mix and match: Sonnet 4.6 as proposer, Opus 4.7 as evaluator, for instance.
- The search structure is domain-specific. If you know the branching factor, depth, or pruning rules of your problem, an explicit ToT loop encodes that knowledge directly. The model does not have to rediscover it on every call.
The honest 2026 default is: try a reasoning model first, with a strong prompt. If that fails or is too expensive, build ToT.
FAQ
Is tree-of-thoughts the same as ReAct? No. ReAct interleaves reasoning with external tool calls. ToT interleaves reasoning with self-evaluation over multiple branches. They compose: a ReAct agent can use ToT internally to plan its next action.
Do I need a separate evaluator model? Not strictly, but it helps. Using the same model for both propose and evaluate works on easy tasks. On harder tasks, an evaluator with different priors (or a different temperature, or a different system prompt framing it as a critic) catches more bad branches.
How many candidates per step? Three to five is typical. More candidates means broader search but quadratically more calls if you expand all of them. Beam-style BFS with width 5 and per-step proposals of 5 is a reasonable starting point.
Is ToT worth using on GPT-5.5? Usually not for general reasoning. GPT-5.5's internal reasoning matches or exceeds explicit ToT on most benchmarks. Use it when you need the audit trail, custom search structure, or proposer/evaluator separation.
Can ToT work with smaller models? Yes, and that is one of its strongest use cases. ToT with GPT-4o-mini as proposer and evaluator beats vanilla CoT with GPT-4o on Game of 24 in many runs, at a fraction of the cost per puzzle.
How does ToT interact with retrieval? Cleanly. Each "thought" can include a retrieval action ("look up X, then continue"). The evaluator scores not just the reasoning but the quality of the retrieved evidence. This is the entry point to agentic RAG.
Is there an open-source library for ToT? The Princeton authors released a reference implementation alongside the paper. LangGraph and DSPy both support ToT-style patterns, though usually you end up writing a custom loop for production. See LangChain vs LangGraph for context.