A RAG pipeline that has no observability is a black box that occasionally embarrasses you in front of customers. You know the answer was wrong, you usually can't tell whether the retriever or the generator caused it, and by the time you ask the user to reproduce the problem the cache has rotated and the bug is gone. This is the part of running a RAG system that benchmarks don't cover.
Observability is the operational answer. Not evaluation, which is "is the system good." Observability is "what is the system doing right now, why, and is something drifting." The two work together. Evaluation gives you scores. Observability gives you the trace tree the score is computed against, the dashboard that surfaces a regression, and the alert that fires before users complain. This piece is the instrumentation method we use at Respan, with the span schema, the dashboards, and the gotchas we have learned from instrumenting thousands of RAG pipelines.
TL;DR
- RAG observability has four telemetry layers: traces (per-call spans), metrics (aggregated counts and histograms), eval scores (online and offline), and alerts. You need all four. Skipping any one of them leaves a class of bug undetectable.
- Per-call spans for retrieval and generation must be separate. Combining them into one span hides the most common production bug, which is "the retriever returned bad chunks but the generator made the answer sound fine."
- Five dashboards cover roughly 90% of production debugging: traffic health, retrieval quality, generation quality, cost and latency, eval score trends.
- Online evals sampling 1 to 5% of traffic is the only way to catch drift on questions your golden set never covered. Wire judge scores back to traces so you can drill from "score dropped" to "here is the failing span."
- Pin everything. Model version, prompt version, embedding model version, judge model version, judge prompt version. Every silent upgrade is a future incident.
RAG observability vs RAG evaluation
These overlap and people use the terms interchangeably. They are different jobs.
RAG evaluation is "measure quality." You build a golden set of questions with labeled answers and labeled must-have chunks. You compute context recall, faithfulness, citation accuracy. The output is scores.
RAG observability is "see what is happening." You instrument every call. You capture every retrieval, every generation, every judge score. You aggregate into dashboards. You alert on regression. The output is operational visibility.
You need both. Evaluation without observability gives you a number that tells you nothing about which trace caused the score drop. Observability without evaluation gives you beautiful dashboards of garbage quality. The two compose into a working production loop: evaluation scores ride on top of observability traces, alerts fire on aggregated trends, debugging starts at the trace.
The four telemetry layers
Layer 1: Traces. Per-call spans. The root span is the user request. Child spans are retrieval, generation, judge, and any reranker or post-processor.
Layer 2: Metrics. Aggregated counts and histograms derived from spans. Tokens per request (p50, p99). Retrieval latency. Cost per session. Cache hit rate.
Layer 3: Eval scores. Faithfulness, citation accuracy, context recall, context precision. From online judges on sampled traffic and from offline runs on the golden set.
Layer 4: Alerts. Rolling-window thresholds on metrics and scores. Cost per query above $0.40. Faithfulness 7-day rolling average below 0.92. Retrieval latency p99 above 800ms.
The four layers compose. A bad alert fires. You drill to the metric that triggered it. You drill to the spans that drove the metric. You read the prompts and results in the spans. You find the bug.
The span schema
The single highest-ROI decision is the span schema. Get this wrong and you cannot debug anything. Get this right and most production debugging is a 5-minute trace read. The attributes below are illustrative custom names. With Respan, the equivalents are populated automatically through the SDK and gateway: input, output, model, usage, metadata, trace_id, parent_span_id, plus the scores field for eval results. See the span attributes reference for the full set.
# Retrieval span (custom attributes if not using auto-instrumentation)
span.set_attribute("query", query)
span.set_attribute("embedding_model", "text-embedding-3-small")
span.set_attribute("embedding_version", "v2")
span.set_attribute("retriever", "pgvector")
span.set_attribute("k", 10)
span.set_attribute("reranker", "cohere-rerank-v3")
span.set_attribute("retrieved_chunk_ids", json.dumps(chunk_ids))
span.set_attribute("retrieved_scores", json.dumps(scores))
span.set_attribute("latency_ms", retrieval_ms)# Generation span
span.set_attribute("model", "claude-sonnet-4-6")
span.set_attribute("prompt_version", "rag-system-v8")
span.set_attribute("prompt_hash", sha256(rendered_prompt))
span.set_attribute("usage.prompt_tokens", usage.input_tokens)
span.set_attribute("usage.completion_tokens", usage.output_tokens)
span.set_attribute("usage.cached_tokens", usage.cache_read_input_tokens)
span.set_attribute("context_token_count", count_tokens(context))
span.set_attribute("answer", answer[:8192])
span.set_attribute("citation_count", len(citations))# Judge span (attached async after the response is delivered)
span.set_attribute("judge_model", "claude-sonnet-4-6")
span.set_attribute("judge_prompt_version", "faithfulness-v3")
# Eval scores attach to the existing trace as a Score in Respan;
# for raw OTel, attach as span attributes:
span.set_attribute("score.faithfulness", 0.94)
span.set_attribute("score.citation_accuracy", 0.91)
span.set_attribute("score.context_relevance", 4.2)
span.set_attribute("score.answer_relevance", 0.88)The non-obvious attributes that matter:
- Prompt hash on the generation span. When users start complaining about a regression, the first question is "did the rendered prompt actually change?" The hash answers that in one second.
- Retrieved chunk IDs and scores on the retrieval span. Not the chunk contents (too large) but the IDs and similarity scores. With these you can replay retrieval, diff against another session, and identify the broken chunk.
- Embedding model version separate from name. When you upgrade the embedding model and forget to re-embed the corpus, retrieval quality silently degrades because the query vector is now in a different space than the index vectors.
- Judge model and prompt version on the judge span. Identical to the LLM-as-judge pinning rule from rag-evaluation. Unpinned judges drift and invalidate trendlines.
The five dashboards
A dashboard exists because someone needed to answer a question repeatedly. These are the five questions that come up most often.
Dashboard 1: Traffic health. Sessions per minute, error rate, retrieval timeouts, generation timeouts, token-budget overruns. The "is anything broken right now" view.
Dashboard 2: Retrieval quality. Context recall and precision on the golden set (refreshed daily). Online context relevance (LLM-judged) on a sampled production stream. Retrieved chunk count histogram. Rerank score distribution. Cache hit rate at the retrieval layer if you have one.
Dashboard 3: Generation quality. Faithfulness, citation accuracy, answer relevance trended over the last 30 days. Per-segment breakdowns (legal queries vs general chat vs API documentation queries). Drill-down to the spans behind low-scoring sessions.
Dashboard 4: Cost and latency. Cost per session. Token spend by component (system prompt, retrieved context, response). p50 and p99 latency for retrieval and for generation, separately. Cache hit rate at the provider prompt cache layer (see LLM cache layers).
Dashboard 5: Eval score trends. Weekly rolling averages of every eval metric. Anomaly detection on top. This is the dashboard you check on Monday morning before you do anything else.
If you build only one of these first, build dashboard 3. Generation quality is where users notice problems. The others are diagnostic. This one is symptomatic.
Online evals: the missing layer
Offline evaluation on a golden set answers "do my known examples still work." It does not answer "do the new questions users are asking work." The gap between the two is where most regressions hide.
The fix is online evaluation. Sample 1 to 5% of production traffic. Run the same judges (faithfulness, citation accuracy, context relevance) asynchronously after the user response is delivered. Attach the scores to the trace as judge spans. Aggregate into dashboard 3.
import random
import asyncio
ONLINE_EVAL_SAMPLE_RATE = 0.02 # 2% of production traffic
async def rag_call_with_eval(query: str):
response, trace_id = await rag_pipeline(query)
if random.random() < ONLINE_EVAL_SAMPLE_RATE:
asyncio.create_task(judge_and_attach(trace_id, query, response))
return response
async def judge_and_attach(trace_id, query, response):
# Pull the retrieved context from the trace store, judge, attach scores
context = await trace_store.get_retrieval_context(trace_id)
scores = await run_judges(query, context, response)
await trace_store.attach_judge_span(trace_id, scores)Three rules for the online eval pipeline.
- Asynchronous. Never block the user response on a judge call. Run after the user has their answer.
- Sampled, not full. Even at $0.001 per judge call, 100% sampling on a million-request-per-day system is $1000 per day. 2% is usually enough to see drift.
- Stratified. Don't sample uniformly. Oversample the segments that matter most (high-stakes queries, new product surfaces, freshly deployed prompt versions).
Alerts that actually fire
A dashboard nobody looks at is not observability. The trick is alerting only on signals that mean "go fix something now." Three classes of alert worth setting up.
Hard regressions on rolling averages. Faithfulness 7-day rolling drops below your floor (we use 0.90 for general chat, 0.96 for any regulated workload). Citation accuracy drops below 0.92. Context recall on the golden set drops more than 5% week-over-week.
Cost outliers. Cost per session p99 spikes above your budget by 3x. Often the first signal that an agent loop has gone recursive or a prompt change blew the context window.
Silent shape changes. Cache hit rate at any layer drops 30% in 24 hours. Token mix (cached vs uncached input) inverts. These almost always mean someone shipped a change that broke a cache key.
Alerts that don't fire useful things, by contrast: absolute thresholds that look reasonable in development but never trip in production, or per-call alerts that fire 1000 times per day and get muted.
Common gotchas
Mistakes we have seen, ranked:
- One span for the whole RAG call. Hides the retrieval-vs-generation split, which is the single most useful diagnostic in RAG debugging.
- No prompt hash on generation spans. Makes regression debugging guesswork.
- Storing retrieved chunk contents instead of IDs. Storage costs balloon. IDs are what you actually need for replay.
- Caching judge scores forever. When the judge model or prompt changes, your cached scores are wrong. Cache keys must include judge version.
- No segment breakdowns. Aggregate faithfulness of 0.93 hides "0.98 on FAQ, 0.71 on legal." Always break out by intent or category.
- Alerts on absolute thresholds. A 0.85 faithfulness might be fine for a creative writing assistant and catastrophic for a clinical assistant. Set thresholds per surface.
- Skipping the judge span. Eval scores in a separate database are useless. They have to be on the trace so you can drill from "score dropped" to "here is the failing trace."
Wiring it together
Respan auto-instruments retrieval, generation, and judge calls through its SDK and gateway. The @workflow and @task decorators capture parent-child structure, vector-DB integrations (Pinecone, Qdrant, Weaviate, Milvus, ChromaDB, LanceDB, Marqo) auto-instrument retrieval, and the trace and evaluate a RAG pipeline cookbook walks through the wiring. Eval scores attach to the same span data via online eval automations. If you are building your own, OpenTelemetry plus Grafana is a reasonable starting stack. The data model is what matters. See LLM observability for the broader picture and LLM workflows and tracing for the trace tree shape across patterns.
For the evaluation layer that sits on top, RAG evaluation covers the six metrics and the judge code. For the cache layers that affect cost dashboards, LLM cache layers covers what to instrument at each layer.
FAQ
What's the difference between RAG observability and RAG evaluation? Evaluation measures quality (the score). Observability captures behavior (the traces, metrics, dashboards, and alerts). The score rides on top of the traces. You need both.
Do I need OpenTelemetry or a hosted observability platform? OTel for the SDK layer is the safest bet. Pick a backend after you understand which attributes you actually query. Hosted RAG-aware platforms (Respan, Phoenix, Confident AI) save weeks if your team is small. Roll your own if you have specialist infrastructure needs.
How much production traffic should I sample for online evals? 1 to 5% is the typical range. Lower for very high-traffic systems, higher for low-traffic ones. 100% sampling of errored responses regardless of the rate.
What's the cheapest judge to use for online evals? Claude Haiku 4.5 or GPT-5-mini are both reasonable for the qualitative metrics. Run a calibration where you score 50 examples with both the cheap judge and a stronger one (Sonnet 4.6 or GPT-5) and check correlation. If it stays above 0.85, use the cheap judge and bank the cost savings.
How do I detect retrieval problems specifically? Track context recall on the golden set (offline) plus context relevance (online, LLM-judged). A drop in recall means missing chunks. A drop in relevance means irrelevant chunks ranking too high. Different fixes for each.
Should I trace embedding calls separately? Yes if you have a reranker or a multi-stage retrieval pipeline. The embedding latency and cost are usually small but the model version is critical to track.
What is the most common bug RAG observability catches that ad-hoc logging misses? Silent retrieval degradation. The retriever keeps returning chunks. The chunks are increasingly irrelevant because the corpus has drifted or the embedding model was upgraded. The generator still produces a confident-sounding answer. Without per-span context relevance scoring you would never see the drift until users complain.