LLM tracing is the practice of capturing every step of an LLM-powered request — the prompt, completion, retrieval calls, tool invocations, intermediate decisions, latency, and cost — as a hierarchical trace with parent-child spans. It's the foundation of debugging multi-step agents in production. Without it, debugging is guesswork.
This is the short version of the topic. For the comprehensive treatment see our LLM Tracing pillar guide.
TL;DR
A trace captures one logical operation end-to-end as a tree of spans. Root span = user request. Each step (retrieval, LLM call, tool call, agent decision) is a child span with timing, inputs, outputs, attributes, and a parent reference.
For a single-call LLM endpoint, tracing looks a lot like enriched logging. The discipline becomes essential the moment you're running multi-step agents — an agent that makes ten LLM calls and twenty tool calls per user action is impossible to debug from logs.
What's in a trace
Every trace contains:
- Root span — the top-level user-perceived operation
- Child spans — each step of work (retrieval, LLM call, tool invocation)
- Parent-child relationships — the tree structure that lets you reconstruct what happened
- Timing — start, duration, latency at each level
- Attributes — input prompt, output completion, model name, tokens, cost, eval scores
For LLM-specific spans, the data follows OpenTelemetry's GenAI semantic conventions:
gen_ai.system— provider (openai, anthropic, etc.)gen_ai.request.model— which modelgen_ai.usage.input_tokens,gen_ai.usage.output_tokensgen_ai.response.finish_reasons- Custom attributes:
prompt.template_id,eval.score,user.id,cost.usd
Why tracing matters
Three things that break without it:
- Bad outputs have no trail. A user reports a hallucination. Without a trace, you have no record of what was asked, what context was retrieved, what model was called. You can't reproduce, isolate, or fix it.
- Multi-step agents are black boxes. Ten LLM calls + twenty tool calls per user action is impossible to debug from logs alone.
- Per-feature cost attribution is impossible. You can't tell which feature, customer, or prompt change is driving cost without traces.
OpenTelemetry GenAI conventions
In 2025-2026 the OpenTelemetry community standardized GenAI semantic conventions — a stable schema for LLM spans. Every modern LLM observability platform supports these. Your tracing should be OTel-native by default; vendor-specific formats create portability debt.
Sampling
Don't sample LLM traces in production. The rare failure modes — hallucinations, retries, runaway agent loops — are exactly what you most need traces of, and they're long-tail by definition. Capture 100%.
Common objections:
- "It's too expensive" — a trace is hundreds of bytes to a few kilobytes in columnar storage. The model bill is 1000× larger.
- "We have too much volume" — volume is what you instrument for.
- "We only care about errors" — most LLM failures are HTTP 200 (response succeeded but is wrong).
If sampling is genuinely required (vendor cost limits), use tail-based sampling: capture all errors, slow traces, and traces hitting eval thresholds, plus a random sample of the rest. Never head-sample on input alone.
How to start
Two paths to instrumentation:
SDK (fastest):
from respan import Respan
from openai import OpenAI
respan = Respan(api_key="...")
client = respan.wrap(OpenAI())
# Every call is now traced
client.chat.completions.create(
model="gpt-5.4",
messages=[{"role": "user", "content": "..."}],
metadata={"user_id": "u_123", "feature": "support_agent"},
)OpenTelemetry-native (portable):
OTEL_EXPORTER_OTLP_ENDPOINT=https://api.respan.ai/v1/otlp
OTEL_EXPORTER_OTLP_HEADERS=authorization=Bearer${RESPAN_API_KEY}
For multi-step agents, manually wrap each step in a span so the trace tree mirrors your logical operation.
Common mistakes
- Sampling early to "save cost"
- Forgetting to instrument tools (they fail more than LLMs do)
- Trace IDs not propagated across services
- Capturing PII without thinking about redaction
- One trace per LLM call when the user-facing operation is the actual unit
See the LLM Tracing pillar for the full treatment.
FAQ
What's the difference between LLM tracing and logging? Logs are flat events. Traces are causal — they tie events into a parent-child hierarchy that captures one logical operation end-to-end.
Should I use OpenTelemetry? Yes, by default. OTel GenAI conventions are stable and portable across vendors.
Should I sample? Mostly no. Capture 100% in production. Tail-based sampling if vendor costs force it.
How long should I keep traces? 30-90 days for hot search. Longer cold storage if you mine traces for evaluation datasets.
Which platform should I use? See 9 Best LLM Observability Tools in 2026.