LLM tracing is the practice of capturing every step of an LLM-powered request — the prompt, the completion, retrieval calls, tool invocations, intermediate agent decisions, latency, and cost — as a hierarchical trace with parent-child spans. It's the foundation of debugging multi-step agents in production.

What's the difference between LLM tracing and logging?

Logs are flat, structured events. Traces are causal: they tie events into a parent-child hierarchy that captures one logical operation end-to-end. For a single-call LLM you can get by with logs; for an agent that fans out into 10+ LLM and tool calls, you need a trace or you can't reconstruct what happened.

Should I use OpenTelemetry for LLM tracing?

Yes, by default. OpenTelemetry's GenAI semantic conventions (gen_ai.* attributes) are stable enough as of 2025 that any new tracing infrastructure should be OTel-native. Going proprietary creates portability debt that gets expensive fast.

Should I sample LLM traces in production?

Most teams shouldn't. Capture 100% to start. The rare failure modes — hallucinations, retries, runaway agent loops — are exactly what you most need traces of, and they're usually long-tail. Sample after you've decided what to drop, not before. Storage is cheaper than missed bugs.

What attributes belong on an LLM span?

Following OTel GenAI conventions: gen_ai.system (provider), gen_ai.request.model, gen_ai.request.temperature, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.finish_reasons. Plus a custom set: prompt template id and version, eval scores, user id, feature/route id. The custom set is what lets you slice metrics by feature and prompt later.

How do I trace tool calls in an agent?

Each tool call is a child span of the parent agent run. Capture the tool name, the arguments (sanitized), the result, and the latency. If a tool itself makes external HTTP calls, those become grandchildren. The tree structure is what makes agent debugging tractable.

What's the difference between a trace and a request?

A trace is one logical operation as the user perceives it — 'send a customer support reply.' A request is one HTTP call. For a non-agent app they're often 1:1; for an agent, one trace contains many LLM and tool requests. The unit of analysis for LLM observability should be the trace, not the request.

How long should I keep LLM traces?

30-90 days for hot search and debugging; longer cold storage if you want to mine traces for evaluation datasets. Most teams under-store, then regret it the first time they need to reproduce a 6-month-old failure for a customer ticket.

AI Tracing: How to Trace Multi-Step Agent Pipelines (2026 Guide)

TL;DR

LLM tracing is the foundation of agent observability — capturing every LLM call, tool invocation, and intermediate step as a hierarchical span tree. Use OpenTelemetry GenAI semantic conventions by default, capture 100% of traces (don't sample early), and treat the trace (not the LLM call) as the unit of analysis. The teams that get this right debug agent regressions in minutes; teams that don't take days.

What is LLM tracing?

An LLM trace captures one logical operation end-to-end as a tree of spans. The root span is the user request. Each step — retrieval, LLM call, tool invocation, agent decision, sub-agent call — becomes a child span with timing, inputs, outputs, attributes, and a parent reference. The result is a structured object that lets you reconstruct exactly what happened, in what order, and with what cost.

One support agent run, six spans. The tool calls are children of the LLM call that requested them. Without this hierarchy, you get six separate log lines and no way to know they belong to the same user action.

For a single-call LLM endpoint (chatbot reply, completion API), tracing looks a lot like enriched logging. The discipline becomes essential the moment you're running multi-step agents: an agent that makes ten LLM calls and twenty tool calls per user action is impossible to debug from logs. The agent runs we see typically touch a handful of spans on the median path; the long tail of complex agent runs reaches into the dozens.

Respan trace view: a multi-step agent run with parent and child spans, latency, cost, and tool calls

A real trace from Respan: one user request fanned into a retrieval span, two LLM calls, and three tool invocations.

Why tracing is the foundation

Of the five pillars of LLM observability, tracing is the one the others rest on. Evals score traces. Metrics aggregate over traces. Prompt management uses trace data to A/B test. Dataset curation harvests traces. Without tracing, the rest is impossible.

Three concrete consequences of skipping tracing:

Agent regressions take days to bisect. A user reports a bad output. Without a trace, you have no record of what tools were called, in what order, with what arguments, or what the intermediate LLM decisions were. You're reverse-engineering from log lines.
Per-feature cost attribution is impossible. If you only have flat LLM call logs, you can total your bill by model — but not by feature, customer, or agent flow. You spend without knowing where.
Quality regressions surface in support tickets. Without traces tied to eval scores, a degraded prompt slips through staging and shows up as customer complaints, not a chart.

Founder's take

Frank Chen · Head of DevRel, Respan

Sampling LLM traces in production is the single most common mistake I see. Teams default to 1% or 10% to "save cost," and then six months later they hit a customer-reported hallucination they can't reproduce because the relevant trace was sampled out. The rare failure modes are precisely what you need to capture — they're long-tail by definition.

Capture 100% from day one. Storage is cheaper than missed bugs, and a trace is one row in a columnar store, not a megabyte. If you can't afford to capture every trace, you probably can't afford to ship the agent.

Teams using Respan tracing in production

Stop sampling. Start capturing 100%.

Respan tracing captures every LLM call, tool invocation, and sub-agent recursion — with full OpenTelemetry GenAI conventions support. Free to try, no credit card.

Try Respan free

Anatomy of an LLM span

The OpenTelemetry GenAI semantic conventions define a stable schema for LLM spans. Every conformant tracing platform should produce these attributes:

gen_ai.system — provider name (openai, anthropic, etc.)
gen_ai.request.model — the model called (gpt-4o, claude-3-5-sonnet)
gen_ai.request.temperature, gen_ai.request.max_tokens, etc.
gen_ai.usage.input_tokens, gen_ai.usage.output_tokens
gen_ai.response.finish_reasons
gen_ai.response.id — provider's response ID for cross-referencing

On top of the standard, add a custom set:

prompt.template_id + prompt.version — which prompt was used
eval.score per evaluator — quality scores attached after the fact
user.id + feature.id — for slicing metrics by user and feature
cost.usd — pre-computed cost so dashboards don't recompute

The custom set is what turns a trace from a debugging artifact into an analytics primitive. It's also what most teams forget to add in v1.

Should you sample?

Default answer: no. Capture 100% of traces in production. The objections to this default are usually wrong:

"It's too expensive." A trace is hundreds of bytes to a few kilobytes in columnar storage. At Respan scale (80M requests/day), full-fidelity capture costs cents per million traces. The model bill is 1000× larger.
"We have too much volume." Volume is what you instrument for. The whole point of observability is finding signal in volume.
"We only care about errors." Most LLM failures are HTTP 200 — the response is technically successful but functionally wrong. You can't filter on HTTP status to find them.

When sampling is genuinely required (regulatory, vendor cost limits), prefer tail-based sampling: capture all errors, all slow traces, all traces that hit eval thresholds, plus a random sample of the rest. Never head-sample on input alone.

How to instrument

Path 1: SDK (fastest)

from respan import Respan
from openai import OpenAI

respan = Respan(api_key="...")
client = respan.wrap(OpenAI())

# Every call is now a span with full GenAI attributes
client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "..."}],
    metadata={
        "user_id": "u_123",
        "feature": "support_agent",
        "prompt_template_id": "support.v3",
    },
)

Path 2: OpenTelemetry-native

If you're already on OTel, point your existing exporter at Respan's OTLP endpoint. No code changes, full GenAI conventions support.

# Standard OTel SDK config — Respan accepts OTLP
OTEL_EXPORTER_OTLP_ENDPOINT=https://api.respan.ai/v1/otlp
OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer${RESPAN_API_KEY}

Wrapping agent and tool spans

For multi-step agents, manually wrap each step so the trace tree mirrors your logical operation:

with respan.span("support_agent.run") as run:
    run.set_attributes({"feature.id": "support", "ticket.id": ticket.id})

    with respan.span("retrieve") as retrieval:
        chunks = vector_db.search(ticket.text)
        retrieval.set_attribute("retrieved.count", len(chunks))

    with respan.span("draft_reply"):
        reply = client.chat.completions.create(...)

    with respan.span("send_email"):
        email_service.send(reply)

The trace now has four spans under support_agent.run and you can debug each step independently.

A real debugging story

One Respan customer shipped a prompt tweak on a Friday. Their support agent had been running smoothly — TTFT under 1.5s, eval scores stable. By Monday morning, one customer (out of thousands) reported that replies had gone "robotic and confused." Their PagerDuty was silent. Their Datadog dashboard was green. Cost looked fine.

They opened the trace for the complaining customer's last interaction. The trace tree showed something the dashboards couldn't: a tool call that had previously been a leaf was now being followed by a second LLM call to "interpret" the tool result, and the interpretation was sometimes hallucinating. The new prompt had subtly encouraged tool-result reflection that the agent didn't need.

Without traces, this would have been three days of bisecting prompts. With traces, it was 8 minutes. The fix was a one-line revert. The lesson is the lesson of the whole pillar: aggregate metrics tell you something is wrong; traces tell you what.

Common tracing mistakes

Sampling early to "save cost." See above. Don't.
Forgetting to instrument tools. Half the trace is missing if tool calls aren't spans. Tools fail more than LLMs do.
Trace IDs not propagated across services. If your retrieval service runs in a different process, you need to pass the trace context (W3C traceparent) so spans connect. Otherwise you have two disconnected traces.
Capturing PII without thinking. Decide upfront what to redact: emails, names, account numbers. Most platforms (including Respan) support redaction at ingest.
One trace per LLM call. If the user-facing operation is "draft a reply," that's the trace. The two LLM calls inside are spans. Conflating them flattens the tree.

LLM tracing tools compared (May 2026)

Most LLM observability platforms include tracing as core. The differentiators are instrumentation model, OTel GenAI conventions depth, agent-waterfall quality, and whether tracing is bundled with evals + gateway + prompt management. Every cell below verified against the vendor's docs on 2026-05-16.

Tool	Instrumentation	100% capture default	OTel GenAI conventions	Agent waterfall	Self-host	Bundled evals + gateway
Respanus	SDK + OTel + Proxy	Yes	Yes	Yes	Enterprise	Yes
Langfuse	SDK + OTel	Yes	Yes	Yes	OSS	Evals only
LangSmith	SDK (LangChain)	Yes	Partial	Yes	No	Evals only
Helicone	Proxy + SDK	Yes	Partial	Partial	OSS app	Gateway only
Braintrust	SDK	Yes	Partial	Yes	No	Evals only
Phoenix (Arize)	OTel	Yes	Yes	Yes	OSS	Evals only
Datadog LLM Observability	APM agent + OTel	Configurable	Yes	Partial	No	No

Which tracing tool should you pick?

Pick the path that matches you.

You're all-in on LangChain. Pick LangSmith. Native LCEL integration, evaluators tied to LangChain primitives, lowest friction if you're never leaving the ecosystem. Trade-off: lock-in to one framework.
You want open-source self-hosted. Pick Langfuse or Phoenix. Langfuse has the more polished product, Phoenix is more academic / research-oriented. Both OTel-native.
You already run Datadog or Honeycomb company-wide. Use that platform's LLM module — and ALSO export OTLP to a dedicated LLM platform like Respan or Langfuse for the LLM-specific UI. Generic APM gives you correlation; dedicated tools give you debugging.
Tracing is the starting point but you'll need gateway + evals + prompt mgmt soon. Pick Respan. The four pillars share auth, billing, and a single trace surface — meaning your eval scores attach to gateway traces attach to prompt versions, all in one tree.
You're eval-first, tracing is secondary. Pick Braintrust. Their tracing is solid but the strength of their product is the scoring functions library.

Failure modes you should plan for

Disconnected traces across services

Your retrieval service runs in Python, your orchestration in Node, your frontend triggers them both. If trace context (W3C traceparent) doesn't propagate across HTTP boundaries, you get two or three disconnected traces instead of one. Mitigation: pass the trace header explicitly on every service-to-service call; verify the spans connect in a real production trace before you call this done.

PII captured into spans by accident

Default instrumentation captures input and output payloads verbatim. That includes any PII the user sent — emails, names, account numbers, sometimes credit cards. Decide upfront what to redact and configure the redaction at ingest. Respan does this with regex + LLM-based detection; LangSmith and Langfuse expose redaction APIs but require you to wire them in.

Trace storage cost spirals at scale

At 80M+ requests/day, full-fidelity trace capture can grow into terabytes/month if you don't compress payloads and prune attributes. Most platforms automate this; if yours doesn't, you'll find out at the next month's bill. Verify before you scale.

Production tracing checklist

Capture 100% by default; revisit only if storage cost actually becomes a constraint.
Use OpenTelemetry GenAI semantic conventions for all gen_ai.* attributes — it's the only portable choice.
Add a custom attribute set: prompt.template_id, prompt.version, user.id, feature.id, cost.usd. Without these the trace can't be sliced by what matters.
Verify W3C traceparent propagation across every service boundary with a real end-to-end test.
Configure PII redaction at ingest before going live, not after the first audit.
Tie eval scores to trace IDs so you can search "show me traces where faithfulness < 3."
Retention: 30-90 days hot, longer cold storage for traces curated into eval datasets.
Make sure tool calls are instrumented — they're the biggest blind spot in 80% of agent traces.

See Respan tracing in 5 minutes

OpenTelemetry-native. 100% capture default. Agent-aware waterfall with collapsible sub-agent recursion. Trace → eval → prompt version in one tree.

Try Respan free See tracing features

Frequently asked questions

Related guides: LLM observability · LLM gateway · LLM evals

TL;DR

What is LLM tracing?

A real trace from Respan: one user request fanned into a retrieval span, two LLM calls, and three tool invocations.

Why tracing is the foundation

Three concrete consequences of skipping tracing:

Agent regressions take days to bisect. A user reports a bad output. Without a trace, you have no record of what tools were called, in what order, with what arguments, or what the intermediate LLM decisions were. You're reverse-engineering from log lines.
Per-feature cost attribution is impossible. If you only have flat LLM call logs, you can total your bill by model — but not by feature, customer, or agent flow. You spend without knowing where.
Quality regressions surface in support tickets. Without traces tied to eval scores, a degraded prompt slips through staging and shows up as customer complaints, not a chart.

Founder's take

Frank Chen · Head of DevRel, Respan

Teams using Respan tracing in production

Stop sampling. Start capturing 100%.

Respan tracing captures every LLM call, tool invocation, and sub-agent recursion — with full OpenTelemetry GenAI conventions support. Free to try, no credit card.

Try Respan free

Anatomy of an LLM span

The OpenTelemetry GenAI semantic conventions define a stable schema for LLM spans. Every conformant tracing platform should produce these attributes:

gen_ai.system — provider name (openai, anthropic, etc.)
gen_ai.request.model — the model called (gpt-4o, claude-3-5-sonnet)
gen_ai.request.temperature, gen_ai.request.max_tokens, etc.
gen_ai.usage.input_tokens, gen_ai.usage.output_tokens
gen_ai.response.finish_reasons
gen_ai.response.id — provider's response ID for cross-referencing

On top of the standard, add a custom set:

prompt.template_id + prompt.version — which prompt was used
eval.score per evaluator — quality scores attached after the fact
user.id + feature.id — for slicing metrics by user and feature
cost.usd — pre-computed cost so dashboards don't recompute

The custom set is what turns a trace from a debugging artifact into an analytics primitive. It's also what most teams forget to add in v1.

Should you sample?

Default answer: no. Capture 100% of traces in production. The objections to this default are usually wrong:

"It's too expensive." A trace is hundreds of bytes to a few kilobytes in columnar storage. At Respan scale (80M requests/day), full-fidelity capture costs cents per million traces. The model bill is 1000× larger.
"We have too much volume." Volume is what you instrument for. The whole point of observability is finding signal in volume.
"We only care about errors." Most LLM failures are HTTP 200 — the response is technically successful but functionally wrong. You can't filter on HTTP status to find them.

How to instrument

Path 1: SDK (fastest)

from respan import Respan
from openai import OpenAI

respan = Respan(api_key="...")
client = respan.wrap(OpenAI())

# Every call is now a span with full GenAI attributes
client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "..."}],
    metadata={
        "user_id": "u_123",
        "feature": "support_agent",
        "prompt_template_id": "support.v3",
    },
)

Path 2: OpenTelemetry-native

If you're already on OTel, point your existing exporter at Respan's OTLP endpoint. No code changes, full GenAI conventions support.

# Standard OTel SDK config — Respan accepts OTLP
OTEL_EXPORTER_OTLP_ENDPOINT=https://api.respan.ai/v1/otlp
OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer${RESPAN_API_KEY}

Wrapping agent and tool spans

For multi-step agents, manually wrap each step so the trace tree mirrors your logical operation:

with respan.span("support_agent.run") as run:
    run.set_attributes({"feature.id": "support", "ticket.id": ticket.id})

    with respan.span("retrieve") as retrieval:
        chunks = vector_db.search(ticket.text)
        retrieval.set_attribute("retrieved.count", len(chunks))

    with respan.span("draft_reply"):
        reply = client.chat.completions.create(...)

    with respan.span("send_email"):
        email_service.send(reply)

The trace now has four spans under support_agent.run and you can debug each step independently.

A real debugging story

Common tracing mistakes

Sampling early to "save cost." See above. Don't.
Forgetting to instrument tools. Half the trace is missing if tool calls aren't spans. Tools fail more than LLMs do.
Trace IDs not propagated across services. If your retrieval service runs in a different process, you need to pass the trace context (W3C traceparent) so spans connect. Otherwise you have two disconnected traces.
Capturing PII without thinking. Decide upfront what to redact: emails, names, account numbers. Most platforms (including Respan) support redaction at ingest.
One trace per LLM call. If the user-facing operation is "draft a reply," that's the trace. The two LLM calls inside are spans. Conflating them flattens the tree.

LLM tracing tools compared (May 2026)

Tool	Instrumentation	100% capture default	OTel GenAI conventions	Agent waterfall	Self-host	Bundled evals + gateway
Respanus	SDK + OTel + Proxy	Yes	Yes	Yes	Enterprise	Yes
Langfuse	SDK + OTel	Yes	Yes	Yes	OSS	Evals only
LangSmith	SDK (LangChain)	Yes	Partial	Yes	No	Evals only
Helicone	Proxy + SDK	Yes	Partial	Partial	OSS app	Gateway only
Braintrust	SDK	Yes	Partial	Yes	No	Evals only
Phoenix (Arize)	OTel	Yes	Yes	Yes	OSS	Evals only
Datadog LLM Observability	APM agent + OTel	Configurable	Yes	Partial	No	No

Which tracing tool should you pick?

Pick the path that matches you.

You're all-in on LangChain. Pick LangSmith. Native LCEL integration, evaluators tied to LangChain primitives, lowest friction if you're never leaving the ecosystem. Trade-off: lock-in to one framework.
You want open-source self-hosted. Pick Langfuse or Phoenix. Langfuse has the more polished product, Phoenix is more academic / research-oriented. Both OTel-native.
You already run Datadog or Honeycomb company-wide. Use that platform's LLM module — and ALSO export OTLP to a dedicated LLM platform like Respan or Langfuse for the LLM-specific UI. Generic APM gives you correlation; dedicated tools give you debugging.
Tracing is the starting point but you'll need gateway + evals + prompt mgmt soon. Pick Respan. The four pillars share auth, billing, and a single trace surface — meaning your eval scores attach to gateway traces attach to prompt versions, all in one tree.
You're eval-first, tracing is secondary. Pick Braintrust. Their tracing is solid but the strength of their product is the scoring functions library.

Failure modes you should plan for

Disconnected traces across services

PII captured into spans by accident

Trace storage cost spirals at scale

Production tracing checklist

Capture 100% by default; revisit only if storage cost actually becomes a constraint.
Use OpenTelemetry GenAI semantic conventions for all gen_ai.* attributes — it's the only portable choice.
Add a custom attribute set: prompt.template_id, prompt.version, user.id, feature.id, cost.usd. Without these the trace can't be sliced by what matters.
Verify W3C traceparent propagation across every service boundary with a real end-to-end test.
Configure PII redaction at ingest before going live, not after the first audit.
Tie eval scores to trace IDs so you can search "show me traces where faithfulness < 3."
Retention: 30-90 days hot, longer cold storage for traces curated into eval datasets.
Make sure tool calls are instrumented — they're the biggest blind spot in 80% of agent traces.

See Respan tracing in 5 minutes

OpenTelemetry-native. 100% capture default. Agent-aware waterfall with collapsible sub-agent recursion. Trace → eval → prompt version in one tree.

Try Respan free See tracing features

Frequently asked questions

Related guides: LLM observability · LLM gateway · LLM evals

AI Tracing: The Complete Guide

What is LLM tracing?

Why tracing is the foundation

Stop sampling. Start capturing 100%.

Anatomy of an LLM span

Should you sample?

How to instrument

Path 1: SDK (fastest)

Path 2: OpenTelemetry-native

Wrapping agent and tool spans

A real debugging story

Common tracing mistakes

LLM tracing tools compared (May 2026)

Which tracing tool should you pick?

Failure modes you should plan for

Disconnected traces across services

PII captured into spans by accident

Trace storage cost spirals at scale

Production tracing checklist

See Respan tracing in 5 minutes

Frequently asked questions

What is LLM tracing?

What's the difference between LLM tracing and logging?

Should I use OpenTelemetry for LLM tracing?

Should I sample LLM traces in production?

What attributes belong on an LLM span?

How do I trace tool calls in an agent?

What's the difference between a trace and a request?

How long should I keep LLM traces?

Built for AI agents. Break less. Ship more.

AI Tracing: The Complete Guide

What is LLM tracing?

Why tracing is the foundation

Stop sampling. Start capturing 100%.

Anatomy of an LLM span

Should you sample?

How to instrument

Path 1: SDK (fastest)

Path 2: OpenTelemetry-native

Wrapping agent and tool spans

A real debugging story

Common tracing mistakes

LLM tracing tools compared (May 2026)

Which tracing tool should you pick?

Failure modes you should plan for

Disconnected traces across services

PII captured into spans by accident

Trace storage cost spirals at scale

Production tracing checklist

See Respan tracing in 5 minutes

Frequently asked questions

What is LLM tracing?

What's the difference between LLM tracing and logging?

Should I use OpenTelemetry for LLM tracing?

Should I sample LLM traces in production?

What attributes belong on an LLM span?

How do I trace tool calls in an agent?

What's the difference between a trace and a request?

How long should I keep LLM traces?

Built for AI agents. Break less. Ship more.

Built for AI agents.
Break less.
Ship more.

Built for AI agents.
Break less.
Ship more.