The first time you debug an agent in production, you do what you always did. You read the stack trace. There isn't one. You check the logs. They show 47 LLM calls and a final user-facing answer that is wrong, but nothing about why the model made the choices it made. You ask the user to reproduce. They can't, because the model picked a slightly different tool order this time and the bug is gone.
This is the moment a lot of teams quietly migrate to traces. Agent debugging is not a smaller version of API debugging. It is its own discipline, because the failure surface is the model's decision tree across multiple steps, not a single function call. This piece is the method we use at Respan after watching engineers debug thousands of production agent loops. The trace tree is the unit of analysis. Five bug shapes cover roughly 90% of what you will see. The span schema you set up upfront determines whether the next debug session takes 5 minutes or 5 hours.
TL;DR
- Stop debugging with print statements. Agents make non-deterministic, multi-step decisions. Print statements show you one path of one run. You need the tree.
- Five bug shapes cover most production failures: stuck loops, hallucinated tool arguments, lost context, wrong-path planning, silent degradation across deploys.
- A useful trace span includes: model name, prompt hash, tool name, full arguments, full result, token counts, latency, error mode. Skip any of these and you will not be able to debug the next incident.
- Reproduce locally with the captured prompt, not the captured user query. The query is what the user said. The prompt is what the model actually saw, which is what produced the bug.
- One trace tree per session, not per call. The bug is in the relationship between calls, not inside any single call.
Why agent debugging is different
A normal API call fails in one of three ways. Bad input, bad code, bad downstream dependency. You read the stack trace, you know which one. Agent calls fail in different ways:
- The model decided to call the wrong tool.
- The model called the right tool with wrong arguments it hallucinated from prior context.
- The model called the right tool with right arguments, got the right result, then drew a wrong conclusion.
- The model did all three of those correctly but the previous step in the chain set up bad context.
- The model would have done it right on Monday but you deployed a prompt change Friday and it now picks step 3 differently.
None of these show up in a stack trace. Three of them aren't even deterministic. The state that matters for debugging is the full conversation history at each turn, the prompt template version, the model version, the tool definitions, and the actual sampled response. If you do not capture all of that you are debugging blind.
The trace tree: anatomy of a useful one
A good agent trace looks like a tree. The root is the user request. Each LLM call is a span. Each tool call is a child span of the LLM call that triggered it. Sub-agents (when the orchestrator-workers pattern from the agent workflow guide hands off) are sub-trees attached to their parent span.
A useful span schema, expressed as generic OTel attributes you would attach yourself or get for free via an instrumentation library:
# Per-span attributes that make debugging fast.
span.set_attribute("model", "claude-sonnet-4-6")
span.set_attribute("prompt_version", "v17")
span.set_attribute("prompt_hash", sha256(rendered_prompt))
span.set_attribute("usage.prompt_tokens", usage.input_tokens)
span.set_attribute("usage.completion_tokens", usage.output_tokens)
span.set_attribute("usage.cached_tokens", usage.cache_read_input_tokens)
span.set_attribute("latency_ms", elapsed_ms)
span.set_attribute("tool.name", tool_name)
span.set_attribute("tool.args_json", json.dumps(tool_args))
span.set_attribute("tool.result_json", json.dumps(tool_result)[:8192])
span.set_attribute("tool.error", tool_error or "")If you are running Respan, most of these are populated for you. The Respan span model already includes input, output, model, usage, metadata, customer_identifier, thread_identifier, and prompt versioning fields. See the span attributes reference for the full set.
Two things matter more than people expect. First, the prompt hash. When the same query produces different behavior on two days, the first question is: did the rendered prompt actually differ? The hash gives you a yes or no in one second. Second, the full tool result, truncated to 8KB or so. Truncate too aggressively and you lose the chunk that produced the hallucination. Save too much and your storage bill blows up. 8KB is the rough sweet spot for most workloads.
If your tracing system can't store this much per span, that is the first thing to fix. Everything below assumes the data is there.
Bug shape 1: Stuck loops
Symptom. The agent calls the same tool 8, 14, 47 times in a row, eventually gives up or hits a step cap, and the user gets an apology.
How to find it in a trace. Group tool spans by tool.name within a session. Look for runs longer than 3 consecutive calls. In a healthy session this almost never happens.
Common cause. The tool returned an error or empty result, the model interpreted that as "I should try again," and there is no escape clause in the prompt. Often the tool itself is fine and the model just needs to give up. Sometimes the tool is genuinely broken and the model is correctly retrying it, but the tool is going to keep failing.
Fix. Two things together. First, add a retry: False flag in your structured error responses so the model knows the result is terminal. We covered the pattern in the MCP server tutorial. Second, count repeated tool calls in your agent loop and inject an "you have called this tool 3 times, do not call it again, explain to the user what's wrong" message after the third call. This is one of the highest-ROI fixes in agent debugging.
Bug shape 2: Hallucinated tool arguments
Symptom. The tool call uses an argument that does not exist. A user ID the user never mentioned. A date in the future when nothing should be future-dated. A SQL table that isn't in your schema.
How to find it in a trace. Diff tool.args_json against agent.input (the original user text) and the prior tool results. If an argument appears that has no source in either, it was hallucinated.
Common cause. Tool description is ambiguous about what arguments it expects, and the model fills the gap with a plausible-looking value. Or the tool argument type is too permissive (a generic string field that the model decides to use for free-form intent).
Fix. Tighter tool descriptions and tighter argument types. Use enums where possible (status: "open" | "closed" | "pending"). Add a sentence to the docstring like "If you do not know X, do not guess. Ask the user." Models follow this surprisingly well when it is explicit.
Bug shape 3: Lost context
Symptom. The user asked a question, the agent answered. The user asked a follow-up. The agent answered as if the first turn never happened.
How to find it in a trace. Read the llm.input of the second turn (the rendered prompt the model actually saw). If the first turn's content is missing or summarized away, that is your bug.
Common cause. One of three. Conversation history was truncated to fit the context window and the truncation removed the relevant turn. Or your code passes only the last user message to the model, not the conversation. Or the agent framework summarized the history aggressively and dropped a detail that mattered.
Fix. Audit your context construction. If you must truncate, prefer dropping middle turns and keeping the first and last few. Log the rendered prompt every turn so this bug is always one query away from being visible.
Bug shape 4: Wrong-path planning
Symptom. The orchestrator produces a plan with the wrong shape. A 14-step plan when 3 would do. A plan that skips the step that actually solves the task. A plan that loops back to an earlier step that was already done.
How to find it in a trace. The first LLM call in an orchestrator-workers session emits the plan. Inspect it. Then compare the plan to the actual sequence of spans that followed. Healthy sessions match plan to execution within 1 to 2 steps. Bad sessions diverge wildly.
Common cause. The orchestrator prompt does not constrain plan length, format, or step types tightly enough. The model is told "make a plan" without telling it what good looks like.
Fix. Constrain the plan output. A JSON schema with max_steps: 6 and a closed-vocabulary step_type enum. Include 2 to 3 example plans in the prompt (few-shot). Reject plans that don't validate and re-prompt with the validation error.
Bug shape 5: Silent degradation
Symptom. The agent worked great in May. In June, users start complaining. No code changed. The metrics dashboard shows a 4% drop in answer quality scores but nothing dramatic.
How to find it in a trace. Compare a working trace from May to a broken trace from June with the same input shape. Check llm.model (did the API auto-upgrade?), llm.prompt_version (did a teammate ship a prompt change?), llm.prompt_hash (same template, but did a referenced variable change content?). One of these almost always differs.
Common cause. A model auto-upgrade. We have seen gpt-4o-latest aliases silently roll forward and break agents that depended on a specific version's tool-calling behavior. Or a docs change in a referenced corpus that the agent retrieves. Or a tool's response shape changed because an upstream API was updated.
Fix. Pin model versions explicitly. Pin prompt versions and store them in source control with hashes. Set up an evaluation alert (see our agent evaluation guide) that fires on per-week quality regressions, not just absolute thresholds.
Setting up traces for debuggability
Three rules I would write down before adding tracing to a new agent system:
- Every LLM call is a span. Every tool call is a span. Every sub-agent is a sub-tree. Skipping any of these breaks the parent-child structure that makes the bug findable.
- Capture the rendered prompt, not just the template. The rendered prompt is what the model actually saw. The template plus variables is what you intended. Bugs live in the difference.
- Attach a session-level summary span at the end. Total cost, step count, handoff depth, final status. This is how you spot bad sessions in aggregate before reading any individual trace.
If you are using Respan, the SDK and the gateway populate most of these attributes automatically when you wrap workflows with the @workflow and @task decorators or call models through the gateway. See LLM observability for the architecture and LLM workflows and tracing for the data model details.
Reproducing a bug in dev
When you find a bad session in production, do not ask the user to reproduce. Pull the rendered prompt directly from the trace and replay it in a notebook against the same pinned model. If you reproduce the bug, you can iterate on the prompt without involving the user. If you cannot reproduce, the bug depends on prior context that the trace didn't capture, which is itself a fix: add more span attributes.
A short replay script:
import anthropic
client = anthropic.Anthropic()
# From the trace: pinned model, rendered prompt
MODEL = "claude-sonnet-4-6"
RENDERED_PROMPT = trace_span["llm.input"] # full chat history as messages
resp = client.messages.create(
model=MODEL,
max_tokens=2048,
messages=RENDERED_PROMPT,
tools=trace_span["llm.tools"], # captured from the span
)
print(resp)If the bad behavior reproduces, you have a tight iteration loop. Change one variable at a time. Stricter tool description. Tighter system prompt. Different model. Each change replays in seconds and the trace shows whether the bug shape changed.
Common gotchas
Mistakes we see, ranked:
- Tracing only LLM calls, not tool calls. Most bugs live in the gap between the two. If your trace tree has gaps where the tool call should be, debugging is much harder.
- Storing the prompt template instead of the rendered prompt. Useless when the template uses a variable whose value changed between runs.
- Truncating tool results too aggressively. A 2KB cap on result strings hides the chunk that caused the hallucination half the time.
- No session-level summary. Forces you to read every trace one by one. With a session summary span, you can sort by cost or step count and find the bad ones in 30 seconds.
- No prompt versioning. When the answer changes between Monday and Tuesday and you cannot tell whether the prompt did, the bug becomes detective work.
- Trying to reproduce with the user's query instead of the trace's prompt. The user's query went through your context construction code, which is part of what produced the bug. Skip that layer when reproducing.
- Reading traces sequentially. Aggregate first. Sort sessions by some signal (cost, step count, eval score). Look at the worst 10. The bug pattern is usually obvious in 5 of them.
FAQ
Do I need OpenTelemetry or can I roll my own tracing? Roll your own first if you want to understand what to capture. Switch to OTel once you know which attributes you actually use. Custom tracing is 50 lines of code. OTel becomes worth it when you have multiple services to correlate across.
How much data does this cost to store? For a typical production agent (5 to 10 steps per session, 4KB per span), about 30 to 50KB per session uncompressed. At 10K sessions per day that is roughly 300 to 500 MB per day. Most observability backends compress this well.
Should I sample traces or capture all of them? Capture all of them for the first 90 days while you are learning the bug shapes. Sample to 10 to 20% once you have alerting and aggregation set up. Always capture 100% of errored sessions.
What if the model is non-deterministic and the bug only happens 30% of the time?
Set temperature=0 for the replay. If the original ran at higher temperature, you may not reproduce exactly, but the same temperature=0 run on the rendered prompt will tell you whether the prompt itself is the problem or whether sampling was unlucky.
Should I log the full conversation history or just diffs? Full history per span. Diffs are clever and useless when the bug is in the part that didn't change.
Is print debugging ever the right answer for agents? For the first 30 minutes of getting a new agent off the ground, sure. After that, no. The cost of adding tracing is small compared to the cost of debugging without it once.
What's the single highest-ROI fix for agent debug speed? Pin model versions explicitly. We have lost more debug hours to silent model upgrades than to any other single cause.