LLM tracing is the practice of capturing every step of an LLM-powered request — the prompt, the completion, retrieval calls, tool invocations, intermediate agent decisions, latency, and cost — as a hierarchical trace with parent-child spans. It's the foundation of debugging multi-step agents in production.

What's the difference between LLM tracing and logging?

Logs are flat, structured events. Traces are causal: they tie events into a parent-child hierarchy that captures one logical operation end-to-end. For a single-call LLM you can get by with logs; for an agent that fans out into 10+ LLM and tool calls, you need a trace or you can't reconstruct what happened.

Should I use OpenTelemetry for LLM tracing?

Yes, by default. OpenTelemetry's GenAI semantic conventions (gen_ai.* attributes) are stable enough as of 2025 that any new tracing infrastructure should be OTel-native. Going proprietary creates portability debt that gets expensive fast.

Should I sample LLM traces in production?

Most teams shouldn't. Capture 100% to start. The rare failure modes — hallucinations, retries, runaway agent loops — are exactly what you most need traces of, and they're usually long-tail. Sample after you've decided what to drop, not before. Storage is cheaper than missed bugs.

What attributes belong on an LLM span?

Following OTel GenAI conventions: gen_ai.system (provider), gen_ai.request.model, gen_ai.request.temperature, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.finish_reasons. Plus a custom set: prompt template id and version, eval scores, user id, feature/route id. The custom set is what lets you slice metrics by feature and prompt later.

How do I trace tool calls in an agent?

Each tool call is a child span of the parent agent run. Capture the tool name, the arguments (sanitized), the result, and the latency. If a tool itself makes external HTTP calls, those become grandchildren. The tree structure is what makes agent debugging tractable.

What's the difference between a trace and a request?

A trace is one logical operation as the user perceives it — 'send a customer support reply.' A request is one HTTP call. For a non-agent app they're often 1:1; for an agent, one trace contains many LLM and tool requests. The unit of analysis for LLM observability should be the trace, not the request.

How long should I keep LLM traces?

30-90 days for hot search and debugging; longer cold storage if you want to mine traces for evaluation datasets. Most teams under-store, then regret it the first time they need to reproduce a 6-month-old failure for a customer ticket.

LLM Tracing: How to Trace Multi-Step Agent Pipelines (2026 Guide)

Frank Chen · Head of DevRel, Respan

Last updated May 10, 2026 · 12 min read

TL;DR

LLM tracing is the foundation of agent observability — capturing every LLM call, tool invocation, and intermediate step as a hierarchical span tree. Use OpenTelemetry GenAI semantic conventions by default, capture 100% of traces (don't sample early), and treat the trace (not the LLM call) as the unit of analysis. The teams that get this right debug agent regressions in minutes; teams that don't take days.

What is LLM tracing?

An LLM trace captures one logical operation end-to-end as a tree of spans. The root span is the user request. Each step — retrieval, LLM call, tool invocation, agent decision, sub-agent call — becomes a child span with timing, inputs, outputs, attributes, and a parent reference. The result is a structured object that lets you reconstruct exactly what happened, in what order, and with what cost.

One support agent run, six spans. The tool calls are children of the LLM call that requested them. Without this hierarchy, you get six separate log lines and no way to know they belong to the same user action.

For a single-call LLM endpoint (chatbot reply, completion API), tracing looks a lot like enriched logging. The discipline becomes essential the moment you're running multi-step agents: an agent that makes ten LLM calls and twenty tool calls per user action is impossible to debug from logs. The agent runs we see typically touch a handful of spans on the median path; the long tail of complex agent runs reaches into the dozens.

Respan trace view: a multi-step agent run with parent and child spans, latency, cost, and tool calls

A real trace from Respan: one user request fanned into a retrieval span, two LLM calls, and three tool invocations.

Why tracing is the foundation

Of the five pillars of LLM observability, tracing is the one the others rest on. Evals score traces. Metrics aggregate over traces. Prompt management uses trace data to A/B test. Dataset curation harvests traces. Without tracing, the rest is impossible.

Three concrete consequences of skipping tracing:

Agent regressions take days to bisect. A user reports a bad output. Without a trace, you have no record of what tools were called, in what order, with what arguments, or what the intermediate LLM decisions were. You're reverse-engineering from log lines.
Per-feature cost attribution is impossible. If you only have flat LLM call logs, you can total your bill by model — but not by feature, customer, or agent flow. You spend without knowing where.
Quality regressions surface in support tickets. Without traces tied to eval scores, a degraded prompt slips through staging and shows up as customer complaints, not a chart.

Founder's take

Frank Chen · Head of DevRel, Respan

Sampling LLM traces in production is the single most common mistake I see. Teams default to 1% or 10% to "save cost," and then six months later they hit a customer-reported hallucination they can't reproduce because the relevant trace was sampled out. The rare failure modes are precisely what you need to capture — they're long-tail by definition.

Capture 100% from day one. Storage is cheaper than missed bugs, and a trace is one row in a columnar store, not a megabyte. If you can't afford to capture every trace, you probably can't afford to ship the agent.

Teams using Respan tracing in production

Anatomy of an LLM span

The OpenTelemetry GenAI semantic conventions define a stable schema for LLM spans. Every conformant tracing platform should produce these attributes:

gen_ai.system — provider name (openai, anthropic, etc.)
gen_ai.request.model — the model called (gpt-4o, claude-3-5-sonnet)
gen_ai.request.temperature, gen_ai.request.max_tokens, etc.
gen_ai.usage.input_tokens, gen_ai.usage.output_tokens
gen_ai.response.finish_reasons
gen_ai.response.id — provider's response ID for cross-referencing

On top of the standard, add a custom set:

prompt.template_id + prompt.version — which prompt was used
eval.score per evaluator — quality scores attached after the fact
user.id + feature.id — for slicing metrics by user and feature
cost.usd — pre-computed cost so dashboards don't recompute

The custom set is what turns a trace from a debugging artifact into an analytics primitive. It's also what most teams forget to add in v1.

Should you sample?

Default answer: no. Capture 100% of traces in production. The objections to this default are usually wrong:

"It's too expensive." A trace is hundreds of bytes to a few kilobytes in columnar storage. At Respan scale (80M requests/day), full-fidelity capture costs cents per million traces. The model bill is 1000× larger.
"We have too much volume." Volume is what you instrument for. The whole point of observability is finding signal in volume.
"We only care about errors." Most LLM failures are HTTP 200 — the response is technically successful but functionally wrong. You can't filter on HTTP status to find them.

When sampling is genuinely required (regulatory, vendor cost limits), prefer tail-based sampling: capture all errors, all slow traces, all traces that hit eval thresholds, plus a random sample of the rest. Never head-sample on input alone.

How to instrument

Path 1: SDK (fastest)

from respan import Respan
from openai import OpenAI

respan = Respan(api_key="...")
client = respan.wrap(OpenAI())

# Every call is now a span with full GenAI attributes
client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "..."}],
    metadata={
        "user_id": "u_123",
        "feature": "support_agent",
        "prompt_template_id": "support.v3",
    },
)

Path 2: OpenTelemetry-native

If you're already on OTel, point your existing exporter at Respan's OTLP endpoint. No code changes, full GenAI conventions support.

# Standard OTel SDK config — Respan accepts OTLP
OTEL_EXPORTER_OTLP_ENDPOINT=https://api.respan.ai/v1/otlp
OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer${RESPAN_API_KEY}

Wrapping agent and tool spans

For multi-step agents, manually wrap each step so the trace tree mirrors your logical operation:

with respan.span("support_agent.run") as run:
    run.set_attributes({"feature.id": "support", "ticket.id": ticket.id})

    with respan.span("retrieve") as retrieval:
        chunks = vector_db.search(ticket.text)
        retrieval.set_attribute("retrieved.count", len(chunks))

    with respan.span("draft_reply"):
        reply = client.chat.completions.create(...)

    with respan.span("send_email"):
        email_service.send(reply)

The trace now has four spans under support_agent.run and you can debug each step independently.

A real debugging story

One Respan customer shipped a prompt tweak on a Friday. Their support agent had been running smoothly — TTFT under 1.5s, eval scores stable. By Monday morning, one customer (out of thousands) reported that replies had gone "robotic and confused." Their PagerDuty was silent. Their Datadog dashboard was green. Cost looked fine.

They opened the trace for the complaining customer's last interaction. The trace tree showed something the dashboards couldn't: a tool call that had previously been a leaf was now being followed by a second LLM call to "interpret" the tool result, and the interpretation was sometimes hallucinating. The new prompt had subtly encouraged tool-result reflection that the agent didn't need.

Without traces, this would have been three days of bisecting prompts. With traces, it was 8 minutes. The fix was a one-line revert. The lesson is the lesson of the whole pillar: aggregate metrics tell you something is wrong; traces tell you what.

Common tracing mistakes

Sampling early to "save cost." See above. Don't.
Forgetting to instrument tools. Half the trace is missing if tool calls aren't spans. Tools fail more than LLMs do.
Trace IDs not propagated across services. If your retrieval service runs in a different process, you need to pass the trace context (W3C traceparent) so spans connect. Otherwise you have two disconnected traces.
Capturing PII without thinking. Decide upfront what to redact: emails, names, account numbers. Most platforms (including Respan) support redaction at ingest.
One trace per LLM call. If the user-facing operation is "draft a reply," that's the trace. The two LLM calls inside are spans. Conflating them flattens the tree.

Tracing tools compared

Most LLM observability platforms include tracing as core. The differentiators: instrumentation model (SDK vs OTel vs proxy), depth of GenAI semantic conventions support, and whether tracing is paired with evals + prompt management. Full feature comparison on the LLM observability pillar.

Respan: SDK + OTel + Proxy. 100% capture default. GenAI conventions native. Paired with evals, gateway, prompt mgmt in one platform.
Langfuse: SDK + OTel. Open source. Strong tracing UI. No gateway.
LangSmith: SDK-first, LangChain-native. Less general OTel support.
Helicone: Proxy-based. Easiest one-line install. Less depth on agent tracing.
Braintrust: Eval-first product, tracing is solid but secondary.
Datadog LLM: Bolted onto APM. Good if you already use Datadog.

Frequently asked questions

Frank Chen

Head of DevRel, Respan

Head of DevRel at Respan (YC W24). Working alongside the team running the infrastructure that handles 80M+ LLM requests a day.

Connect on LinkedIn →

Add tracing to your LLM app in two lines

100% trace capture by default. SDK or OTel. Pairs with evals, gateway, and prompt management.

Start for free See tracing in product

Frank Chen · Head of DevRel, Respan

Last updated May 10, 2026 · 12 min read

TL;DR

What is LLM tracing?

A real trace from Respan: one user request fanned into a retrieval span, two LLM calls, and three tool invocations.

Why tracing is the foundation

Three concrete consequences of skipping tracing:

Agent regressions take days to bisect. A user reports a bad output. Without a trace, you have no record of what tools were called, in what order, with what arguments, or what the intermediate LLM decisions were. You're reverse-engineering from log lines.
Per-feature cost attribution is impossible. If you only have flat LLM call logs, you can total your bill by model — but not by feature, customer, or agent flow. You spend without knowing where.
Quality regressions surface in support tickets. Without traces tied to eval scores, a degraded prompt slips through staging and shows up as customer complaints, not a chart.

Founder's take

Frank Chen · Head of DevRel, Respan

Teams using Respan tracing in production

Anatomy of an LLM span

The OpenTelemetry GenAI semantic conventions define a stable schema for LLM spans. Every conformant tracing platform should produce these attributes:

gen_ai.system — provider name (openai, anthropic, etc.)
gen_ai.request.model — the model called (gpt-4o, claude-3-5-sonnet)
gen_ai.request.temperature, gen_ai.request.max_tokens, etc.
gen_ai.usage.input_tokens, gen_ai.usage.output_tokens
gen_ai.response.finish_reasons
gen_ai.response.id — provider's response ID for cross-referencing

On top of the standard, add a custom set:

prompt.template_id + prompt.version — which prompt was used
eval.score per evaluator — quality scores attached after the fact
user.id + feature.id — for slicing metrics by user and feature
cost.usd — pre-computed cost so dashboards don't recompute

The custom set is what turns a trace from a debugging artifact into an analytics primitive. It's also what most teams forget to add in v1.

Should you sample?

Default answer: no. Capture 100% of traces in production. The objections to this default are usually wrong:

"It's too expensive." A trace is hundreds of bytes to a few kilobytes in columnar storage. At Respan scale (80M requests/day), full-fidelity capture costs cents per million traces. The model bill is 1000× larger.
"We have too much volume." Volume is what you instrument for. The whole point of observability is finding signal in volume.
"We only care about errors." Most LLM failures are HTTP 200 — the response is technically successful but functionally wrong. You can't filter on HTTP status to find them.

How to instrument

Path 1: SDK (fastest)

from respan import Respan
from openai import OpenAI

respan = Respan(api_key="...")
client = respan.wrap(OpenAI())

# Every call is now a span with full GenAI attributes
client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "..."}],
    metadata={
        "user_id": "u_123",
        "feature": "support_agent",
        "prompt_template_id": "support.v3",
    },
)

Path 2: OpenTelemetry-native

If you're already on OTel, point your existing exporter at Respan's OTLP endpoint. No code changes, full GenAI conventions support.

# Standard OTel SDK config — Respan accepts OTLP
OTEL_EXPORTER_OTLP_ENDPOINT=https://api.respan.ai/v1/otlp
OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer${RESPAN_API_KEY}

Wrapping agent and tool spans

For multi-step agents, manually wrap each step so the trace tree mirrors your logical operation:

with respan.span("support_agent.run") as run:
    run.set_attributes({"feature.id": "support", "ticket.id": ticket.id})

    with respan.span("retrieve") as retrieval:
        chunks = vector_db.search(ticket.text)
        retrieval.set_attribute("retrieved.count", len(chunks))

    with respan.span("draft_reply"):
        reply = client.chat.completions.create(...)

    with respan.span("send_email"):
        email_service.send(reply)

The trace now has four spans under support_agent.run and you can debug each step independently.

A real debugging story

Common tracing mistakes

Sampling early to "save cost." See above. Don't.
Forgetting to instrument tools. Half the trace is missing if tool calls aren't spans. Tools fail more than LLMs do.
Trace IDs not propagated across services. If your retrieval service runs in a different process, you need to pass the trace context (W3C traceparent) so spans connect. Otherwise you have two disconnected traces.
Capturing PII without thinking. Decide upfront what to redact: emails, names, account numbers. Most platforms (including Respan) support redaction at ingest.
One trace per LLM call. If the user-facing operation is "draft a reply," that's the trace. The two LLM calls inside are spans. Conflating them flattens the tree.

Tracing tools compared

Respan: SDK + OTel + Proxy. 100% capture default. GenAI conventions native. Paired with evals, gateway, prompt mgmt in one platform.
Langfuse: SDK + OTel. Open source. Strong tracing UI. No gateway.
LangSmith: SDK-first, LangChain-native. Less general OTel support.
Helicone: Proxy-based. Easiest one-line install. Less depth on agent tracing.
Braintrust: Eval-first product, tracing is solid but secondary.
Datadog LLM: Bolted onto APM. Good if you already use Datadog.

Frequently asked questions

Frank Chen

Head of DevRel, Respan

Head of DevRel at Respan (YC W24). Working alongside the team running the infrastructure that handles 80M+ LLM requests a day.

Connect on LinkedIn →

Add tracing to your LLM app in two lines

100% trace capture by default. SDK or OTel. Pairs with evals, gateway, and prompt management.

Start for free See tracing in product

LLM Tracing: The Complete Guide

What is LLM tracing?

Why tracing is the foundation

Anatomy of an LLM span

Should you sample?

How to instrument

Path 1: SDK (fastest)

Path 2: OpenTelemetry-native

Wrapping agent and tool spans

A real debugging story

Common tracing mistakes

Tracing tools compared

Frequently asked questions

Add tracing to your LLM app in two lines

Ship reliable AI agents

LLM Tracing: The Complete Guide

What is LLM tracing?

Why tracing is the foundation

Anatomy of an LLM span

Should you sample?

How to instrument

Path 1: SDK (fastest)

Path 2: OpenTelemetry-native

Wrapping agent and tool spans

A real debugging story

Common tracing mistakes

Tracing tools compared

Frequently asked questions

Add tracing to your LLM app in two lines

Ship reliable AI agents