The internet has settled into two camps on agent workflows. Camp one writes essays about whether you should call it an "agent" or a "workflow." Camp two ships agent systems and then quietly rolls them back when they cost $40 per query and answer half the time.
I think the camp-two problem is fixable. Most production agent workflows fail in predictable ways, and most of those failures map back to one of five workflow patterns plus three anti-patterns. This piece is the field guide we use at Respan after watching thousands of production agent loops trace through our backend. Patterns that work, with skeletons. Patterns that don't, with the reason. And for every pattern, the one trace signal that tells you it has gone bad.
TL;DR
- An agent workflow is any LLM application that makes more than one call per user request and lets a model influence what happens next. "Agent" and "workflow" are not different categories, they are a spectrum of how much routing freedom the model has.
- Five patterns ship reliably: router, parallelizer, evaluator-optimizer, orchestrator-workers, hierarchical handoff.
- Three patterns burn money: pure ReAct loops without a step budget, fully autonomous "let the agent decide what to do," recursive subagent spawning.
- Every pattern has one failure mode that matters. Token cost, latency, loop divergence, or judge-evaluator collapse. Knowing which one applies to your pattern is the difference between a 2-hour fix and a 2-week incident.
- Tracing is the only way to debug agent workflows. Without trace trees you are guessing what the model did between input and output. With traces, the failure mode is usually visible in the first 30 seconds.
Workflow vs agent: it is a spectrum
The cleanest framing is the one Anthropic published in Building Effective Agents in late 2024. A workflow is code that orchestrates LLM calls along a fixed path. An agent is code that lets the LLM choose the path. Both are valid. The decision is about how much freedom the model needs to do the job.
Workflows are predictable, cheaper, and easier to debug. Agents handle open-ended problems but cost more and fail in more interesting ways. Most production systems are somewhere in between. A "router" picks one of three pre-built workflows. An "orchestrator-workers" hands off to a constrained agent. Pure agents (model picks any tool, no path constraints) are rare in production for good reason.
If you take one thing from this piece: start with the most constrained pattern that solves your problem, then loosen the constraints only when measurement says the constraint is the bottleneck.
Pattern 1: Router
Shape. One LLM call classifies the input. Code dispatches to one of N specialized handlers.
Code skeleton.
ROUTER_PROMPT = """Classify this user message into one of:
- billing
- technical_support
- sales
- other
Message: {message}
Return JSON: {{"category": "billing|technical_support|sales|other", "reason": "..."}}
"""
def route(message: str):
decision = llm_call(ROUTER_PROMPT.format(message=message))
handler = HANDLERS[decision["category"]]
return handler(message)When to use. Whenever you would otherwise concatenate three or four use cases into one giant system prompt. Routing lets you keep each handler simple. Billing handler doesn't need to know how to do technical support. Sales handler doesn't have to be told not to refund anything.
The failure mode. Confident misrouting. The router commits to a wrong bucket with 0.95 confidence and the user gets a billing answer to a technical question. Look-alike inputs (e.g. "I was charged for a feature that doesn't work") drift between billing and tech support depending on phrasing.
Trace signal that tells you. A spike in the "other" bucket, or downstream handlers returning escalation more often. If 8% of routed sessions end with the wrong handler escalating, the router is the bug.
Pattern 2: Parallelizer
Shape. Fan out N LLM calls in parallel on related subtasks, then aggregate.
Code skeleton.
async def parallel_review(document: str):
tasks = [
llm_call(FACT_CHECK_PROMPT.format(doc=document)),
llm_call(STYLE_CHECK_PROMPT.format(doc=document)),
llm_call(SAFETY_CHECK_PROMPT.format(doc=document)),
]
fact, style, safety = await asyncio.gather(*tasks)
return aggregate([fact, style, safety])When to use. Independent subtasks. Multi-aspect review (fact, style, safety). Voting and self-consistency (call the same prompt 5 times, take majority). Anything where latency matters and the subtasks don't depend on each other.
The failure mode. Aggregation lies. Three parallel reviewers disagree, the aggregator picks the majority, and the lone correct reviewer gets overruled. Or two reviewers hallucinate the same error and the aggregator marks it as confirmed.
Trace signal. Compare inter-reviewer agreement over time. If agreement is too high (above 0.9), your prompts are redundant and you are paying 3x for one opinion. If agreement is too low (below 0.6), the aggregator is doing all the work and the parallel structure is wasted.
Pattern 3: Evaluator-Optimizer
Shape. A generator produces output. A separate evaluator critiques it. The generator revises. Repeat until the evaluator passes or you hit a budget.
Code skeleton.
def evaluator_optimizer(task: str, max_iters: int = 3):
output = generator(task)
for i in range(max_iters):
critique = evaluator(task, output)
if critique["passes"]:
return output
output = generator(task, prior=output, critique=critique["feedback"])
return output # best effort after budgetWhen to use. Tasks with a clear quality bar the evaluator can score (code generation with test cases, factual writing with citation checks, schema-conformant output). When the first pass is rarely perfect but the second or third is reliable.
The failure mode. Evaluator collapse. The evaluator and generator converge on the same wrong answer after two iterations. Or the evaluator's bar drifts looser across iterations because it is now critiquing its own previous critique. We have seen pass rates jump from 0.6 to 0.95 over 3 iterations on the same task and the actual quality not move at all.
Trace signal. Per-iteration evaluator scores trending up. If iteration 1 averages 0.55 and iteration 3 averages 0.95, but blind sampled human grading stays flat at 0.7, your evaluator is gaming itself. Pin the evaluator prompt, sample-grade weekly, alert on drift.
Pattern 4: Orchestrator-Workers
Shape. A top-level model decomposes the task into subtasks. It dispatches each subtask to a worker (a constrained subagent with a smaller tool set). Workers return, orchestrator synthesizes.
Code skeleton.
def orchestrator(task: str):
plan = llm_call(PLAN_PROMPT.format(task=task)) # returns list[Subtask]
results = []
for subtask in plan["subtasks"]:
worker = WORKERS[subtask["worker_type"]]
results.append(worker(subtask))
return llm_call(SYNTHESIS_PROMPT.format(task=task, results=results))When to use. Complex tasks that genuinely decompose: research that needs search plus calculation plus citation, customer issues that need account lookup plus refund processing plus follow-up scheduling.
The failure mode. Planning explosion. The orchestrator produces a 14-step plan when 3 steps would do. Token cost balloons, latency hits 30 seconds, and most steps return findings the synthesis step ignores. Or the orchestrator under-plans: a complex task gets a one-step plan that misses the whole point.
Trace signal. Plan length distribution. If your orchestrator's plans show a fat tail (median 4 steps, p99 22 steps), the long tail is where money disappears. Cap plan length explicitly in the prompt, fall back to a simpler workflow on overrun.
Pattern 5: Hierarchical handoff
Shape. A front-line agent handles the request until it hits a bound (knowledge gap, permission boundary, escalation trigger), then hands off to a more specialized agent with broader tools or higher privileges.
Code skeleton.
def front_line(message: str, context: dict):
response = llm_call(FRONT_LINE_PROMPT.format(message=message, ctx=context))
if response.get("escalate_to"):
target = SPECIALIST_AGENTS[response["escalate_to"]]
return target(message, context | {"prior": response})
return responseWhen to use. Asymmetric workloads where 80% of requests are simple and 20% need horsepower. Customer support (front-line answers FAQs, escalates billing disputes to a refund-authorized agent). Triage systems. Anything with a clear "do I need to wake the senior up" boundary.
The failure mode. Escalation cascade. Front-line hands off too eagerly, specialist hands off again, you end up paying 3x cost and 4x latency on what should have been a Tier-1 ticket.
Trace signal. Handoff depth per session. A healthy hierarchy looks like {0: 70%, 1: 25%, 2: 4%, 3+: 1%}. If you see {0: 40%, 1: 30%, 2: 20%, 3+: 10%}, the front line is escalating instead of solving.
Three patterns we tell teams to stop using
Pure ReAct loops with no step budget. The classic ReAct pattern (Thought, Action, Observation, repeat) is fine. ReAct with no max-step parameter is not. It will loop forever on edge cases. Always set a step cap and a token cap and return a graceful "could not complete" when either is hit.
Fully autonomous "the agent decides what to do." Marketed as the holy grail. Almost never the right shape in production. You sacrifice every debugging affordance for a system that is hard to constrain, hard to evaluate, and produces wildly inconsistent results on similar inputs. Use this pattern only when the problem space genuinely cannot be decomposed.
Recursive subagent spawning. Subagent A spawns subagent B to handle a piece of A's task. B might spawn C. Three weeks later you discover a session in production that spawned 47 subagents and cost $89 to answer one user question. If you must spawn subagents, cap the recursion depth and total spawn count at the orchestrator level.
Observability for agent workflows
A pattern table is one half of the story. The other half is how you know which pattern is misbehaving in production. Three signals matter:
- Loop count per session. Histogram, not average. Average hides the long tail where the money goes.
- Tokens per session. Same shape. p50 and p99 should be within 5x of each other for a healthy workflow.
- Per-step pass rate. For evaluator-optimizer and orchestrator-workers patterns. Set explicit thresholds and alert.
Respan captures agent workflows as nested traces using @workflow and @task decorators in the Python SDK, plus auto-instrumentation for OpenAI Agents SDK, Claude Agent SDK, LangGraph, CrewAI, Mastra, Pydantic AI, Google ADK, and others. Pattern metadata rides on the standard span attributes (workflow, workflow path, metadata) so the same dashboards work across patterns. The monitor an AI agent in production cookbook is the end-to-end walkthrough. See LLM tracing and our LLM agents and tools deep-dive for the data model. For the evaluation layer that sits on top, the agent evaluation guide covers what to score.
If you are still picking a framework, agent frameworks explained walks through what LangGraph, CrewAI, OpenAI Agents SDK, and Anthropic's Claude Agent SDK each emit and how trace shapes compare.
Common gotchas
Mistakes we have seen, ranked by frequency:
- No step budget. Every agent workflow needs an explicit cap. A user session that hits 50 steps is almost always a bug, not a hard task.
- Step budget set in tokens only. Token caps don't catch cheap loops that hit your downstream APIs 200 times. Cap step count and tool-call count too.
- Treating the orchestrator's plan as ground truth. The plan is a hypothesis. Log it, score completion against it, alert when the actual run diverged from the plan by more than 30%.
- Trace tree without a session-level summary. A 200-span trace is useless if you have to read every span. Attach a session-level summary span with cost, step count, handoff depth, final status.
- No evaluator on the evaluator. If you are using an evaluator-optimizer pattern, sample-grade the evaluator weekly. It is the component most likely to drift silently.
- Re-prompting the same model for self-critique. Use a different model (or at least a different prompt configuration) for the evaluator. Self-critique with the same model converges on agreement, not accuracy.
FAQ
What is the difference between an agent workflow and a normal LLM application? A normal LLM application makes one call per user request. An agent workflow makes multiple, and at least one of those calls influences what happens next. As soon as a model decides which function to call, you are running an agent workflow.
Should I use LangGraph, CrewAI, or build my own? For the five patterns above, build your own first. Each pattern is 30 to 80 lines of code. Frameworks earn their keep when you need persistence, retries, and visual workflow editors. They cost you when you need to debug an unusual failure mode.
How many steps is too many? For most production workflows, p50 should be under 5 steps and p99 under 15. Anything above 15 routinely is a sign your decomposition is wrong.
Should the evaluator be the same model as the generator? No. Same-model self-critique tends to agree with itself. Use a different model, or at minimum a different system prompt that role-plays a stricter reviewer.
What is the cheapest way to add tracing? OpenTelemetry plus a hosted backend. Adding spans to your existing code is usually a one-day change. See LLM observability for what to capture.
How do I budget for an agent workflow? Token cap, step cap, tool-call cap, wall-clock cap. Set all four. The first one that trips returns a graceful failure. Without all four, one will eventually wedge a session.
Is the orchestrator-workers pattern worth the complexity? Only if your task genuinely decomposes into independent subtasks. If the subtasks share context heavily, you will spend more tokens passing context around than you save with decomposition. Test before committing.