LLM tracing is the practice of capturing every step of an LLM-powered request — the prompt, completion, retrieval calls, tool invocations, intermediate decisions, latency, and cost — as a hierarchical trace with parent-child spans. It's the foundation of debugging multi-step agents in production. Without it, debugging is guesswork.

This is the short version of the topic. For the comprehensive treatment see our LLM Tracing pillar guide.

TL;DR

A trace captures one logical operation end-to-end as a tree of spans. Root span = user request. Each step (retrieval, LLM call, tool call, agent decision) is a child span with timing, inputs, outputs, attributes, and a parent reference.

For a single-call LLM endpoint, tracing looks a lot like enriched logging. The discipline becomes essential the moment you're running multi-step agents — an agent that makes ten LLM calls and twenty tool calls per user action is impossible to debug from logs.

What's in a trace

Every trace contains:

Root span — the top-level user-perceived operation
Child spans — each step of work (retrieval, LLM call, tool invocation)
Parent-child relationships — the tree structure that lets you reconstruct what happened
Timing — start, duration, latency at each level
Attributes — input prompt, output completion, model name, tokens, cost, eval scores

For LLM-specific spans, the data follows OpenTelemetry's GenAI semantic conventions:

gen_ai.system — provider (openai, anthropic, etc.)
gen_ai.request.model — which model
gen_ai.usage.input_tokens, gen_ai.usage.output_tokens
gen_ai.response.finish_reasons
Custom attributes: prompt.template_id, eval.score, user.id, cost.usd

Why tracing matters

Three things that break without it:

Bad outputs have no trail. A user reports a hallucination. Without a trace, you have no record of what was asked, what context was retrieved, what model was called. You can't reproduce, isolate, or fix it.
Multi-step agents are black boxes. Ten LLM calls + twenty tool calls per user action is impossible to debug from logs alone.
Per-feature cost attribution is impossible. You can't tell which feature, customer, or prompt change is driving cost without traces.

OpenTelemetry GenAI conventions

In 2025-2026 the OpenTelemetry community standardized GenAI semantic conventions — a stable schema for LLM spans. Every modern LLM observability platform supports these. Your tracing should be OTel-native by default; vendor-specific formats create portability debt.

Sampling

Don't sample LLM traces in production. The rare failure modes — hallucinations, retries, runaway agent loops — are exactly what you most need traces of, and they're long-tail by definition. Capture 100%.

Common objections:

"It's too expensive" — a trace is hundreds of bytes to a few kilobytes in columnar storage. The model bill is 1000× larger.
"We have too much volume" — volume is what you instrument for.
"We only care about errors" — most LLM failures are HTTP 200 (response succeeded but is wrong).

If sampling is genuinely required (vendor cost limits), use tail-based sampling: capture all errors, slow traces, and traces hitting eval thresholds, plus a random sample of the rest. Never head-sample on input alone.

How to start

Two paths to instrumentation:

SDK (fastest):

from respan import Respan
from openai import OpenAI
 
respan = Respan(api_key="...")
client = respan.wrap(OpenAI())
 
# Every call is now traced
client.chat.completions.create(
    model="gpt-5.4",
    messages=[{"role": "user", "content": "..."}],
    metadata={"user_id": "u_123", "feature": "support_agent"},
)

OpenTelemetry-native (portable):

OTEL_EXPORTER_OTLP_ENDPOINT=https://api.respan.ai/v1/otlp
OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer${RESPAN_API_KEY}

For multi-step agents, manually wrap each step in a span so the trace tree mirrors your logical operation.

Common mistakes

Sampling early to "save cost"
Forgetting to instrument tools (they fail more than LLMs do)
Trace IDs not propagated across services
Capturing PII without thinking about redaction
One trace per LLM call when the user-facing operation is the actual unit

See the LLM Tracing pillar for the full treatment.

FAQ

What's the difference between LLM tracing and logging? Logs are flat events. Traces are causal — they tie events into a parent-child hierarchy that captures one logical operation end-to-end.

Should I use OpenTelemetry? Yes, by default. OTel GenAI conventions are stable and portable across vendors.

Should I sample? Mostly no. Capture 100% in production. Tail-based sampling if vendor costs force it.

How long should I keep traces? 30-90 days for hot search. Longer cold storage if you mine traces for evaluation datasets.

Which platform should I use? See 9 Best LLM Observability Tools in 2026.

This is the short version of the topic. For the comprehensive treatment see our LLM Tracing pillar guide.

TL;DR

What's in a trace

Every trace contains:

Root span — the top-level user-perceived operation
Child spans — each step of work (retrieval, LLM call, tool invocation)
Parent-child relationships — the tree structure that lets you reconstruct what happened
Timing — start, duration, latency at each level
Attributes — input prompt, output completion, model name, tokens, cost, eval scores

For LLM-specific spans, the data follows OpenTelemetry's GenAI semantic conventions:

gen_ai.system — provider (openai, anthropic, etc.)
gen_ai.request.model — which model
gen_ai.usage.input_tokens, gen_ai.usage.output_tokens
gen_ai.response.finish_reasons
Custom attributes: prompt.template_id, eval.score, user.id, cost.usd

Why tracing matters

Three things that break without it:

Bad outputs have no trail. A user reports a hallucination. Without a trace, you have no record of what was asked, what context was retrieved, what model was called. You can't reproduce, isolate, or fix it.
Multi-step agents are black boxes. Ten LLM calls + twenty tool calls per user action is impossible to debug from logs alone.
Per-feature cost attribution is impossible. You can't tell which feature, customer, or prompt change is driving cost without traces.

OpenTelemetry GenAI conventions

Sampling

Common objections:

"It's too expensive" — a trace is hundreds of bytes to a few kilobytes in columnar storage. The model bill is 1000× larger.
"We have too much volume" — volume is what you instrument for.
"We only care about errors" — most LLM failures are HTTP 200 (response succeeded but is wrong).

How to start

Two paths to instrumentation:

SDK (fastest):

from respan import Respan
from openai import OpenAI
 
respan = Respan(api_key="...")
client = respan.wrap(OpenAI())
 
# Every call is now traced
client.chat.completions.create(
    model="gpt-5.4",
    messages=[{"role": "user", "content": "..."}],
    metadata={"user_id": "u_123", "feature": "support_agent"},
)

OpenTelemetry-native (portable):

OTEL_EXPORTER_OTLP_ENDPOINT=https://api.respan.ai/v1/otlp
OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer${RESPAN_API_KEY}

For multi-step agents, manually wrap each step in a span so the trace tree mirrors your logical operation.

Common mistakes

Sampling early to "save cost"
Forgetting to instrument tools (they fail more than LLMs do)
Trace IDs not propagated across services
Capturing PII without thinking about redaction
One trace per LLM call when the user-facing operation is the actual unit

See the LLM Tracing pillar for the full treatment.

FAQ

What's the difference between LLM tracing and logging? Logs are flat events. Traces are causal — they tie events into a parent-child hierarchy that captures one logical operation end-to-end.

Should I use OpenTelemetry? Yes, by default. OTel GenAI conventions are stable and portable across vendors.

Should I sample? Mostly no. Capture 100% in production. Tail-based sampling if vendor costs force it.

How long should I keep traces? 30-90 days for hot search. Longer cold storage if you mine traces for evaluation datasets.

Which platform should I use? See 9 Best LLM Observability Tools in 2026.

What Is LLM Tracing?

TL;DR

What's in a trace

Why tracing matters

OpenTelemetry GenAI conventions

Sampling

How to start

Common mistakes

FAQ

Related articles

What Is a RAG Pipeline?

What Is Agentic RAG?

What Is an LLM Gateway?

Built for AI agents.
Break less.
Ship more.

What Is LLM Tracing?

TL;DR

What's in a trace

Why tracing matters

OpenTelemetry GenAI conventions

Sampling

How to start

Common mistakes

FAQ

Related articles

What Is a RAG Pipeline?

What Is Agentic RAG?

What Is an LLM Gateway?

Built for AI agents.
Break less.
Ship more.

Related articles

Explainer
What Is a RAG Pipeline?
RAG pipeline explained: what it is, the components (chunking, embedding, retrieval, generation), common architectures, agentic RAG, and how to ship one in production.
Frank Chen · 18 hours ago

Explainer
What Is Agentic RAG?
Agentic RAG explained: how it differs from classic RAG, when to use it, the production architecture, and the tools that handle it well.
Frank Chen · 18 hours ago

Explainer
What Is an LLM Gateway?
LLM gateway explained: what it is, what it does (routing, fallback, caching, rate limits), why teams adopt one, the difference from an AI gateway, and how to choose.
Frank Chen · 18 hours ago

What Is LLM Tracing?

TL;DR

What's in a trace

Why tracing matters

OpenTelemetry GenAI conventions

Sampling

How to start

Common mistakes

FAQ

Related

Related articles

What Is a RAG Pipeline?

What Is Agentic RAG?

What Is an LLM Gateway?

Built for AI agents. Break less. Ship more.

What Is LLM Tracing?

TL;DR

What's in a trace

Why tracing matters

OpenTelemetry GenAI conventions

Sampling

How to start

Common mistakes

FAQ

Related

Related articles

What Is a RAG Pipeline?

What Is Agentic RAG?

What Is an LLM Gateway?

Built for AI agents. Break less. Ship more.

Built for AI agents.
Break less.
Ship more.

Built for AI agents.
Break less.
Ship more.