Least-to-most prompting is a technique where you ask a model to first decompose a hard problem into a sequence of easier sub-problems, then solve each sub-problem in order, with each answer feeding the next. It was introduced by Zhou et al. in the 2022 paper "Least-to-Most Prompting Enables Complex Reasoning in Large Language Models." The result that put it on the map was a jump from roughly 16 percent to over 99 percent on the SCAN compositional generalization benchmark with GPT-3 class models, a number that did not move much with vanilla Chain-of-Thought.
Four years later, the technique is still useful, but for different reasons. Modern reasoning models like GPT-5.5 and Claude Opus 4.7 do a lot of internal decomposition on their own. You do not need to spell out the steps to multiply two-digit numbers. Where least-to-most still pays off is on tasks the model has not seen at training time, on domain-specific compositional problems, and as an explicit prompt structure that improves reliability on long-tail inputs.
This article covers the original idea, a worked example, the comparison with Chain-of-Thought and Tree-of-Thoughts, and where the technique fits in 2026 when the strongest models already think before they answer. For broader prompt-engineering context, see Best Prompt Engineering Tools.
TL;DR
- What it is. A two-stage prompt: first decompose the problem into ordered sub-problems, then solve each in order, conditioning on prior answers.
- Origin. Zhou et al., "Least-to-Most Prompting Enables Complex Reasoning in Large Language Models," 2022.
- Key result. Solved compositional generalization tasks (SCAN, last-letter concatenation, complex math word problems) where Chain-of-Thought stalled.
- vs CoT. Chain-of-Thought reasons inside one answer. Least-to-most explicitly factors the problem first, then solves each piece.
- vs Tree-of-Thoughts. Tree-of-Thoughts explores alternative reasoning branches. Least-to-most stays linear.
- In 2026. Less needed for well-known task families because reasoning models decompose internally. Still valuable for novel domains and as a reliability scaffold.
The original idea
The Zhou et al. paper observed that Chain-of-Thought prompting (Wei et al. 2022, "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models") improves single-shot reasoning by encouraging the model to write intermediate steps. But CoT struggles when the test problem is harder than the few-shot examples, particularly on compositional tasks where the final answer requires combining sub-skills the model has only seen in isolation.
The fix the authors proposed is to make decomposition explicit. The prompt has two stages:
- Decomposition. Given the problem, list the sub-problems that need to be solved, in the order they should be solved.
- Solution. For each sub-problem in turn, solve it using the previously solved sub-problems as context.
The contribution was less the structure (decomposition has been a teaching idea forever) and more the empirical demonstration. On SCAN, a compositional command-execution benchmark, GPT-3 with CoT got around 16 percent. With least-to-most prompting it crossed 99 percent. The technique transferred to math word problems and string manipulation tasks with similar reliability gains.
How it works in practice
Concretely, a least-to-most prompt has two prompts you run in sequence. The first asks for the decomposition. The second runs each sub-problem.
DECOMPOSE_PROMPT = """
You will be given a problem. Your job is to break it into a list of
simpler sub-problems that, solved in order, will let you answer the
original problem.
Problem: {problem}
Sub-problems (numbered, in order):
"""
SOLVE_PROMPT = """
Problem: {problem}
Sub-problems solved so far:
{prior}
Now solve sub-problem: {current}
Show your work, then give the answer on a final line as "Answer: ...".
"""
The runtime loop calls the model once to get the decomposition, then calls the model once per sub-problem, threading prior answers into the context. This is structurally a prompt chain where the chain is generated by the model rather than hard-coded.
A worked example
Take a math word problem from the original paper's style. "A juggler has 16 balls. Half of them are golf balls, and half of the golf balls are blue. How many blue golf balls are there?"
A modern reasoning model like GPT-5.5 will get this in one shot. But let's run least-to-most on a smaller model to see the structure.
Stage 1, decomposition. The model outputs:
1. How many golf balls are there?
2. How many blue golf balls are there?
Stage 2, solve in order.
Sub-problem 1: "How many golf balls are there?" Context: none yet. The model answers: "Half of 16 is 8. Answer: 8."
Sub-problem 2: "How many blue golf balls are there?" Context: sub-problem 1 gave us 8 golf balls. The model answers: "Half of 8 is 4. Answer: 4."
Notice how each step only has to reason about one thing. The compositional pressure (combine "half of all balls are golf" with "half of golf balls are blue") is dissolved by the ordering.
For a more interesting case, take a SCAN command: "jump twice and walk thrice." The decomposition produces:
1. What action does "jump twice" produce?
2. What action does "walk thrice" produce?
3. How are these combined for "and"?
Each piece is in the model's training distribution. The combination is what was novel. Least-to-most makes the combination explicit and the model executes it.
Least-to-most vs Chain-of-Thought
The key difference is that CoT keeps everything in one model call. The model writes its working, and you take whatever the final answer is. Least-to-most splits the decomposition into a separate stage with a separate prompt and threads sub-answers explicitly.
This means least-to-most is:
- More reliable on out-of-distribution problems. If the test problem is more complex than your few-shot examples, CoT can miss steps; least-to-most forces them.
- More expensive. N+1 calls instead of one. Latency and tokens scale with N.
- Easier to debug. Each sub-problem has its own input and output. If step 3 is wrong, you know it.
- More verbose to author. You need two prompts, the orchestration, and the threading.
When to reach for CoT. Single-pass reasoning that fits in the model's strength: arithmetic, short logical chains, well-known task families. When to reach for least-to-most. Compositional tasks where you suspect the model will skip steps. Novel domains. High-stakes tasks where you want per-step inspection.
For an excellent reference on the canonical CoT setup, see Wei et al. 2022 "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models."
Least-to-most vs Tree-of-Thoughts
Tree-of-Thoughts (Yao et al. 2023) generalizes CoT into a search procedure. At each step the model proposes multiple candidate continuations, an evaluator scores them, and the system explores promising branches. ToT is useful when there is real ambiguity at each step and exploring alternatives matters.
Least-to-most is strictly linear. There is no branching. One decomposition, one solution path. This makes it cheaper and easier to reason about than ToT, at the cost of being unable to recover from a bad decomposition.
A useful mental model:
- CoT. Think step by step inside one answer.
- Least-to-most. Plan the steps, then execute them in order.
- Tree-of-Thoughts. Plan, execute, but also try multiple paths and pick the best.
For most production work, CoT plus prompt chaining gets you 80 percent of the way. Least-to-most and ToT are reserved for harder reasoning surfaces.
Where it still matters in 2026
GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro all do meaningful internal decomposition. You can hand them the juggler problem and they will get it without prompting. So why bother with least-to-most?
Novel domains. When you are operating in a corpus the model has not memorized (your internal policy documents, a proprietary query language, a domain ontology), the model's internal decomposition is weaker. Explicit least-to-most prompting recovers reliability.
Long horizons. A reasoning model can think for a few thousand tokens, but on tasks that require dozens of dependent sub-answers the budget runs out and quality degrades. Externalizing the decomposition lets you spend that budget per sub-problem rather than across the whole task.
Inspectability. Regulatory or high-stakes work often requires the reasoning to be auditable per step. Least-to-most surfaces each sub-answer as a discrete artifact you can log, eval, and store.
Smaller models. If cost or latency pushes you to Haiku 4.5 or GPT-5.4-nano, those models benefit enormously from the structural prompt. The technique punches above its weight on smaller models.
Tool-augmented chains. When sub-problems require tool calls (look up a fact, run a calculation), least-to-most maps naturally onto a chain where some steps are LLM calls and some are tools. The decomposition becomes the plan.
Implementing least-to-most with structured outputs
A robust implementation uses structured outputs for the decomposition so you do not have to parse free text.
import json
from openai import OpenAI
client = OpenAI(base_url="https://api.respan.ai/v1")
DECOMPOSE = """
Decompose the problem into ordered sub-problems.
Return JSON: {"sub_problems": ["...", "...", ...]}.
Problem: {problem}
"""
SOLVE = """
Problem: {problem}
Solved so far:
{prior}
Solve this sub-problem and return JSON {"answer": "..."}:
{current}
"""
def least_to_most(problem: str, model="anthropic/claude-sonnet-4.6") -> str:
plan = json.loads(client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": DECOMPOSE.format(problem=problem)}],
response_format={"type": "json_object"},
).choices[0].message.content)
prior_lines = []
for sub in plan["sub_problems"]:
result = json.loads(client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": SOLVE.format(
problem=problem,
prior="\n".join(prior_lines) or "(none)",
current=sub,
)}],
response_format={"type": "json_object"},
).choices[0].message.content)
prior_lines.append(f"- {sub} => {result['answer']}")
return prior_lines[-1]In production, wrap each LLM call in tracing so each sub-problem becomes its own span. See LLM Tracing.
Pitfalls
The decomposition is wrong. Garbage in, garbage out. If the model produces a bad plan, every subsequent step inherits it. Mitigation: evaluate the decomposition step on a fixture, and consider sampling multiple decompositions and voting.
Sub-problem prompts forget the original context. Always pass the original problem alongside the prior answers. The model needs to know what it is ultimately trying to solve.
Cost balloons on long decompositions. A 12-step plan means 13 calls. Set a max-steps budget and have the prompt return early when the answer is in hand.
Mixing styles. Do not mix free-form CoT inside each sub-step with the least-to-most outer loop unless you actually want both. Pick one structure.
Skipping evaluation. With this many moving parts, eval is not optional. Add a per-step eval that checks each sub-answer against a fixture for your most common task shapes. See What Is Prompt Evaluation.
FAQ
Did Zhou et al. invent decomposition prompting? The general idea of breaking down problems predates LLMs by decades. The contribution of the 2022 paper was the empirical case for ordering and conditioning on prior answers as a prompt-only technique for GPT-3-class models.
Should I still use least-to-most with reasoning models? Often no, on tasks the model handles natively. Yes, on novel domains, long horizons, or when you need per-step inspection.
Is least-to-most a kind of prompt chaining? Yes. It is a chain where the model generates the steps. Compare to a hand-authored prompt chain where the developer fixes the steps.
How is this different from agents? Agents loop and choose tools based on observations. Least-to-most produces a plan up front and executes it linearly. No mid-flight replanning, no tool selection.
Can I parallelize the sub-problems? Only if they are independent. The whole point of least-to-most is dependency: step k uses step k-1's answer. If your decomposition produces independent sub-problems, you have a fan-out, not a least-to-most chain.
What is the relationship to "self-ask"? Self-ask is a near-cousin: the model asks itself follow-up questions and answers them. Least-to-most is more structured: a full decomposition first, then sequential solution.
Does it help with hallucinations? Indirectly. By making each step small and inspectable, hallucinations are easier to catch. It does not prevent them.