Chain-of-thought (CoT) prompting is the technique of asking a language model to produce its intermediate reasoning steps before committing to a final answer. Instead of "What is 17 times 24?" returning "408" directly, the model writes out "17 times 20 is 340, 17 times 4 is 68, 340 plus 68 is 408." That visible reasoning trace, it turns out, makes the final answer dramatically more accurate on multi-step problems.
The technique came out of Google Research in early 2022, before ChatGPT existed. Back then, scaling alone was not enough to crack arithmetic word problems or symbolic reasoning. CoT was the first widely adopted prompting trick that materially closed that gap, and it set the template for everything that followed: least-to-most, self-consistency, tree-of-thoughts, ReAct.
In 2026, the picture is more nuanced. Frontier models like GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro have reasoning baked in. Explicit "let's think step by step" no longer helps the way it did in 2022, and sometimes hurts. But CoT remains essential for smaller models, for structured tasks where you want auditable traces, and as the conceptual foundation for every modern reasoning pattern.
TL;DR
- What it is: prompting an LLM to produce intermediate reasoning steps before a final answer.
- Origin: Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (Google Research, 2022).
- Two flavors: zero-shot CoT ("Let's think step by step") and few-shot CoT (worked-example demonstrations).
- When it helps: arithmetic, symbolic reasoning, multi-hop questions on non-reasoning models, and any task where you want an auditable trace.
- When it hurts: native reasoning models (GPT-5.5, Claude 4.7 with extended thinking) often regress when forced into explicit CoT; latency and cost go up; verbose answers fail strict output schemas.
- Modern role: still useful for small models, evals, and as the basis for least-to-most, ToT, and ReAct.
Origin: Wei et al., 2022
The canonical reference is Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou, "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," presented at NeurIPS 2022. The paper showed that when you prompt a sufficiently large model with a handful of input-output pairs that include reasoning steps, accuracy on math word problems (GSM8K), commonsense reasoning (StrategyQA), and symbolic tasks jumps significantly. The effect was emergent: small models did not benefit, but performance broke out sharply once you crossed roughly 100B parameters.
A companion paper, Kojima et al., "Large Language Models are Zero-Shot Reasoners" (2022), showed you do not even need few-shot examples. Appending the literal string "Let's think step by step." was enough to unlock the same reasoning trace on many tasks. That one sentence became the most copy-pasted line in prompt engineering for two years.
How CoT works mechanically
There are two views, and both are correct.
The autoregressive view. Language models predict the next token conditioned on everything that came before. If you force the model to produce a sequence of partial computations first, each partial result becomes part of the conditioning context for the final answer. The model is, in effect, computing on a longer "working tape," which lets it solve problems that need more than one forward pass of internal computation.
The distributional view. Pretraining data contains lots of worked examples: textbook solutions, Stack Overflow answers, step-by-step tutorials. When you prompt the model with a few-shot CoT example, you push it into the region of its distribution where step-by-step explanations are the norm. The model produces text that looks like a worked example, and worked-example text tends to be correct.
Neither view is the whole story, but both predict the same thing: visible reasoning helps when the underlying task needs more compute than one token-prediction step provides.
Zero-shot CoT
Zero-shot CoT is the lightest possible version. You add one trigger phrase and let the model improvise.
from openai import OpenAI
client = OpenAI()
prompt = """A juggler can juggle 16 balls. Half of the balls are golf balls,
and half of the golf balls are blue. How many blue golf balls are there?
Let's think step by step."""
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
)
print(resp.choices[0].message.content)On a small or older model, that single trailing sentence typically takes accuracy on GSM8K from the 20s to the 50s. On GPT-5.5 in 2026, it usually has no effect or slightly hurts, because the model already routes hard prompts through its internal reasoning stage.
A small variant: ask for the answer in a specific final format so you can parse it.
prompt += "\n\nThink step by step, then output the final answer on a line starting with 'Answer:'."That keeps zero-shot CoT compatible with downstream parsing.
Few-shot CoT
Few-shot CoT is the original Wei et al. recipe. You include 2 to 8 worked examples in the prompt, each showing input, reasoning, and answer. The model imitates the pattern.
fewshot = """Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls.
5 + 6 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more,
how many apples do they have?
A: The cafeteria had 23 apples originally. They used 20 to make lunch, so they had
23 - 20 = 3. They bought 6 more, so they had 3 + 6 = 9. The answer is 9.
Q: {question}
A:"""
question = "A store had 50 shirts. They sold 18 and received a shipment of 25 more. How many do they have?"
prompt = fewshot.format(question=question)Few-shot CoT outperforms zero-shot on most benchmarks because the demonstrations anchor both the reasoning style and the output format. It costs more tokens, which matters for high-volume workloads.
CoT in TypeScript
Same idea, different language.
import OpenAI from "openai";
const client = new OpenAI();
async function cotAnswer(question: string) {
const resp = await client.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content:
"You solve problems step by step. End with a line: 'Answer: <value>'.",
},
{ role: "user", content: question },
],
});
const text = resp.choices[0].message.content ?? "";
const match = text.match(/Answer:\s*(.+)/);
return { reasoning: text, answer: match?.[1]?.trim() };
}When you log the reasoning field to a tracing platform, you get an auditable trail of how the model got to each answer. See LLM tracing for how teams wire this into production.
CoT vs least-to-most vs tree-of-thoughts
CoT is the simplest reasoning prompt. Two follow-ups push it further.
Least-to-most prompting (Zhou et al., 2022, "Least-to-Most Prompting Enables Complex Reasoning in Large Language Models") breaks a hard problem into a list of subproblems, then solves them in order, feeding each answer into the next prompt. It outperforms vanilla CoT on tasks that require longer compositional reasoning, like SCAN-style command parsing or multi-hop QA. See least-to-most prompting for a full walkthrough.
Tree-of-thoughts (Yao et al., 2023) generalizes CoT into a search procedure: at each step, the model proposes multiple candidate thoughts, evaluates them, and keeps the most promising branches. It is more expensive and more powerful, especially on planning and puzzle-style tasks.
A useful mental model:
- CoT: one linear chain of reasoning.
- Least-to-most: an explicit decomposition followed by a chain.
- Tree-of-thoughts: a branching search over chains, with a critic at each step.
You can think of CoT as "imagine one solution path." Least-to-most as "imagine the outline first, then fill it in." ToT as "imagine many paths and search."
When CoT helps, when it hurts
Helps:
- Older or smaller models (under roughly 30B parameters, or any model without a built-in reasoning mode).
- Arithmetic and symbolic tasks where the model needs to manipulate values across multiple steps.
- Multi-hop questions that require chaining facts the model already knows.
- Audit trails. When you need to show a reviewer or eval system how the model got to its answer.
Hurts or no-ops:
- Native reasoning models. GPT-5.5 has reasoning baked in. Claude Opus 4.7 has extended thinking. Gemini 3.1 Pro has its own internal scratchpad. Forcing them into explicit CoT often produces redundant output and occasionally regresses accuracy because they were not tuned to expose intermediate reasoning the same way.
- Latency-sensitive paths. A 300-token reasoning trace is 300 tokens of latency on top of the final answer.
- Strict schemas. If your application expects a JSON object, CoT prose has to be parsed out or stripped. Use structured outputs or function calling instead.
- Tasks the model would get right anyway. "Translate this sentence to French" does not benefit from CoT.
A useful rule for 2026: if you are on GPT-5.5 or Claude 4.7 with a single-step task, skip explicit CoT. If you are on Sonnet 4.6, Haiku, GPT-5 nano, or any open-source model under 70B for a multi-step task, try CoT first.
Measuring whether CoT helps
Do not assume. Run an eval.
from respan import respan
respan.init()
cases = load_eval_set("gsm8k_subset.jsonl")
for case in cases:
plain = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": case["question"]}],
)
cot = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user",
"content": case["question"] + "\n\nLet's think step by step."}
],
)
log_pair(case, plain, cot)Wire that into prompt evaluation so you can see the lift (or regression) per model and per task type. Teams often find CoT helps on one slice (arithmetic) and hurts on another (factual lookup) within the same product.
FAQ
Is chain-of-thought still useful in 2026? Yes, for smaller models, for tasks that need auditable reasoning, and as the conceptual basis for least-to-most and ToT. On frontier reasoning models (GPT-5.5, Claude Opus 4.7 with extended thinking, Gemini 3.1 Pro), explicit CoT is often unnecessary and sometimes hurts.
What is the difference between CoT and "extended thinking"? Extended thinking (Anthropic) and reasoning tokens (OpenAI o-series, GPT-5.5) are server-side, model-trained versions of the same idea. The model produces internal reasoning tokens that you may or may not see in the response, depending on the API. CoT is the prompt-level technique you apply yourself.
Should I use CoT with structured outputs?
Not directly. CoT produces prose; structured outputs want JSON. Either run a two-step prompt (CoT first, then a "now extract the answer as JSON" call) or use a reasoning-capable model that supports a separate reasoning field.
Does CoT increase hallucinations? It can. A model that reasons confidently down a wrong path will commit to a wrong answer. Self-consistency (sampling multiple chains and majority-voting) mitigates this on hard tasks.
What is self-consistency? Wang et al., 2022, "Self-Consistency Improves Chain of Thought Reasoning in Language Models." Sample N CoT chains at temperature > 0, take the majority final answer. Cheap, robust, often beats single-shot CoT.
Does CoT work with images and multimodal inputs? Yes. Multimodal CoT (Zhang et al., 2023) showed that asking a vision-language model to describe relevant image content before answering improves accuracy on chart and diagram questions.
Is "Let's think step by step" really the magic phrase? It was, on 2022-era models. On modern models the exact phrase matters less than the structure. "Show your work" or "Explain your reasoning, then give the final answer" work as well.