Prompt chaining is the practice of breaking a task into a series of smaller LLM calls, where each step has a narrow job and each step's output feeds the next. Instead of asking one giant prompt to "read this contract, find the risky clauses, rewrite them, and produce a summary," you split that into four calls: extract clauses, classify risk, rewrite, summarize. Each call is shorter, easier to evaluate, and easier to debug when it goes wrong.
If you have spent any time on production LLM code, you have seen the failure mode that motivates chaining. A 2,000-token system prompt that tries to do five things at once works on the demo, then quietly degrades on the long tail. One step ships malformed JSON. Another forgets a constraint buried on line 47. You cannot tell which because there is only one log line. Prompt chaining is the structural fix.
This article covers what prompt chaining is, the patterns that show up most often in production, a runnable Python example, the trade-offs versus agents and tool use, and the observability hooks that make chains debuggable. For broader context on building reliable LLM systems, see LLM Observability and LLM Evals.
TL;DR
- Definition. A pipeline of sequential LLM calls where each step's output becomes the next step's input.
- Why it beats one giant prompt. Higher reliability, smaller token windows per step, isolated debugging, cheaper retries.
- Common patterns. extract then reason then format; retrieve then rerank then answer; plan then execute then verify; classify then route then respond.
- Versus agents. Chains are static and deterministic. Agents loop and choose tools. Reach for chains first.
- What you need. A prompt registry, structured outputs at each step, and tracing so you can see every hop.
- When it breaks. Error compounding. Step 3 inherits step 2's hallucination. Evaluate every step, not just the final answer.
Why a chain beats a mega-prompt
Three reasons, and they all reinforce each other.
Reliability. Each call has one job. A prompt that only needs to extract dates from a paragraph is short, easy to test, and rarely wrong. A prompt that needs to extract dates, summarize the paragraph, decide a routing label, and produce JSON has four times the surface area for failure. In practice, splitting that 4-in-1 prompt into four steps cuts end-to-end error rate in half, sometimes more.
Debuggability. When the final answer is wrong, you need to know which hop produced the bad data. With one mega-prompt the only signal is "the output is wrong." With a chain, you look at the trace, find the step where the data went off, and fix that prompt in isolation.
Cost and latency control. Different steps need different models. The extraction step might run fine on Haiku 4.5 at a fraction of the price of Opus 4.7. The reasoning step needs the heavy model. With a chain you route each step to the cheapest model that meets its bar. A mega-prompt forces you to pay flagship rates for the easy parts.
Evaluation. You can ship an eval for each step independently. The extraction step gets evaluated against a fixture of paragraphs and ground-truth dates. The classification step gets evaluated against a confusion matrix. Per-step evals are tractable, fast, and give you a clear answer when a regression shows up. See What Is Prompt Evaluation.
The patterns that ship
Most production chains are a variation on one of four shapes.
Extract then reason then format
The most common pattern in document workflows. You have a long input (an email thread, a PDF, a transcript). Step one pulls structured facts. Step two reasons over those facts. Step three formats the final response.
[long PDF]
-> step 1: extract {parties, amounts, dates, jurisdictions} as JSON
-> step 2: classify {risk_level, requires_legal_review}
-> step 3: render markdown summary for Slack
Splitting like this lets you cache the extraction step (the PDF rarely changes), reason cheaply over the small JSON instead of the whole PDF, and swap the formatting step (Slack today, email tomorrow) without retraining anything.
Retrieve then rerank then answer
The RAG pattern, written as a chain. Step one runs vector search to fetch candidate chunks. Step two uses a small LLM to rerank those chunks by relevance to the question. Step three uses a larger model to write the final answer grounded in the top-k chunks.
The rerank step is the win here. Cheap retrievers return noisy results; a tight rerank prompt pushes precision up dramatically before you spend tokens on the answer step. For more on this, see What Is a RAG Pipeline.
Plan then execute then verify
Useful when the task has multiple sub-questions. Step one generates a plan: a list of sub-questions to answer. Step two answers each sub-question in parallel (this is where chains start to look like graphs). Step three writes the unified answer and verifies it against the plan.
This is the shape behind most "deep research" features shipped in 2025 and 2026. It scales because each sub-question is independent. Note: once your plan step starts choosing tools dynamically, you have crossed into agent territory. See the comparison section below.
Classify then route then respond
The pattern behind any support bot, smart inbox, or product router. Step one classifies the incoming message into one of N intents. Step two routes to a specialized prompt or tool based on that intent. Step three composes the response.
The router is short and cheap. The specialized prompts are focused and high quality. Compare to a mega-prompt trying to handle every intent in one shot; the router pattern wins on every axis. See Intent Classification With LLMs for the classification step in detail.
A runnable example
A small chain that classifies an incoming support ticket, extracts the entities, and drafts a reply. Three steps, each with structured outputs.
import os
import json
from openai import OpenAI
client = OpenAI(
base_url="https://api.respan.ai/v1",
api_key=os.environ["RESPAN_API_KEY"],
)
CLASSIFY = """You are a support triage classifier.
Return JSON: {"intent": "billing" | "bug" | "feature_request" | "other", "urgency": "low" | "medium" | "high"}.
Ticket: {ticket}"""
EXTRACT = """Extract entities from the ticket as JSON:
{"account_id": str | null, "product_area": str | null, "error_code": str | null}.
Ticket: {ticket}"""
DRAFT = """Draft a reply for a {intent} ticket with {urgency} urgency.
Customer context: {entities}.
Keep it under 120 words. Friendly, specific, no marketing fluff.
Ticket: {ticket}"""
def step(prompt, model="anthropic/claude-haiku-4.5", json_mode=False):
kwargs = {"model": model, "messages": [{"role": "user", "content": prompt}]}
if json_mode:
kwargs["response_format"] = {"type": "json_object"}
return client.chat.completions.create(**kwargs).choices[0].message.content
def run_chain(ticket: str) -> str:
cls = json.loads(step(CLASSIFY.format(ticket=ticket), json_mode=True))
ent = json.loads(step(EXTRACT.format(ticket=ticket), json_mode=True))
reply = step(
DRAFT.format(ticket=ticket, intent=cls["intent"], urgency=cls["urgency"], entities=json.dumps(ent)),
model="anthropic/claude-sonnet-4.6",
)
return replyA few things worth noting. The first two steps use Haiku 4.5 because they are simple. The drafting step uses Sonnet 4.6 because it benefits from a stronger model. Each step has its own prompt that you can version and evaluate independently. The JSON mode call keeps the intermediate hops machine-parseable. Wrap the chain in tracing and every hop becomes a span with cost, latency, and model attribution; see LLM Tracing.
Prompt chaining vs agents vs tool use
People often conflate these. The distinctions matter.
Prompt chain. A static, predetermined sequence of LLM calls. You wrote the order. The graph does not change at runtime. Deterministic, easy to reason about, easy to evaluate.
Tool use. A single LLM call where the model can invoke functions you defined. The model decides which tool to call and with what arguments. Useful when you want the model to read from a database or call an API as part of one turn.
Agent. A loop. The model takes an action, observes the result, decides the next action, and repeats until done or hits a budget. Tool use is one ingredient; the loop and the open-ended planning are what make it an agent.
Rule of thumb. Start with a chain. If a chain cannot capture the variability of the task because the next step truly depends on what the previous step found, add tool use. If the variability is so high that you cannot predict a sensible step count, you have an agent. Agents are powerful and expensive in tokens, latency, and failure modes. Most teams reach for them too early. See What Is Agentic RAG for the next step up.
Failure modes to plan for
Error compounding. A 5 percent error rate per step over four steps gives you an 81 percent end-to-end success rate. Watch the multiplication. The fix is per-step evaluation plus retries on parseable failures.
Bad handoff schemas. Step 1 emits a field step 2 does not expect, or step 2 emits a field step 3 misreads. Define schemas explicitly using JSON Schema or Pydantic and validate at each boundary. Failure to parse is a known failure, not a silent one.
Hidden state. Some teams pass the entire conversation forward at every step. This blows up tokens, leaks context, and makes each step harder to reason about. Pass only what the next step needs. If a step needs the original input, pass the original input; do not paste the previous step's chain-of-thought.
Untraced chains. If you cannot see every hop, you will spend hours bisecting failures by hand. Wrap every step in tracing from day one.
No prompt versioning. When you change step 2, you have changed the input distribution that step 3 sees. Without prompt versioning, you have no rollback. See What Is Prompt Versioning.
How to make a chain debuggable
Three practices, in priority order.
- Trace every hop. Each LLM call is a span. Capture model, prompt id, input tokens, output tokens, latency, cost, and the actual prompt and response strings. When the final answer is wrong, you open the trace and bisect in seconds.
- Evaluate every step. Add a per-step eval that runs on a fixture of inputs. A regression in step 2 is detectable before it ships. The mega-prompt cannot do this; the chain makes it natural.
- Version every prompt. Each step's prompt lives in a registry with a version id. The trace records which version produced each hop. A bad change is one rollback away.
Together these three turn a chain from "vibes-driven code that mostly works" into a system you can operate.
FAQ
Is prompt chaining the same as a workflow? Effectively, yes, for LLM workloads. A "chain" is workflow vocabulary applied to a sequence of model calls. Same idea: steps, inputs, outputs, retries.
When should I switch from a chain to an agent? When the next step truly cannot be predicted ahead of time and depends on the result of the previous step in non-trivial ways. If you can write the graph on a napkin, a chain is fine. If the graph is "the model picks," you have an agent.
Do I need a framework for chains?
No. Most production chains are 100 lines of Python with await and a typed schema between steps. Frameworks help with prebuilt patterns; they also add cognitive overhead. Start with plain functions.
Can I parallelize a chain?
Yes. Steps that do not depend on each other can run in parallel. Use asyncio.gather in Python. Latency drops, cost stays the same.
How does prompt chaining interact with structured outputs? Beautifully. Structured outputs at each hop give you machine-parseable handoffs and remove a category of "the model added a preamble" bugs.
Does chaining cost more? Per token, yes, because you make more calls. Per successful task, often less, because you can route cheap steps to cheap models and retries are scoped to the failed step.
What if step 3 needs the original document, not just step 2's summary? Pass it forward. Chains are not a forced waterfall. Each step can take any subset of upstream state as input.