Most RAG evaluation guides explain the metrics. That part is easy. A one-paragraph definition of faithfulness or context recall is in every framework's docs and every conference talk. What kills production RAG systems is the part nobody writes about: a retriever that's "correct" on paper but slow enough to lose users, a judge model that drifts when Anthropic ships a new Sonnet, a 0.95 faithfulness score that hides the fact that 30% of your traffic is now asking out-of-domain questions you never built a golden set for.
This is the method we run at Respan across roughly 80M LLM requests per day. Six metrics, split across the two real failure surfaces. A golden set built from production logs rather than synthetic generators. Reproducible LLM-as-judge code with the gotchas labeled. An online-eval loop that catches the failure modes the offline run can't see.
TL;DR
- A RAG system has two failure surfaces: retrieval (did we find the right context?) and generation (did the model use it correctly?). Measure them separately, fix them separately.
- The 6 metrics worth measuring: context recall, context precision, context relevance, faithfulness, answer relevance, citation accuracy.
- Build a 100 to 300 question golden set from real production logs. Synthetic questions are a backstop, never the foundation. Refresh quarterly.
- LLM-as-judge is the standard for the qualitative metrics. Pin the judge model and pin the prompt. Drift here silently invalidates six months of trendlines.
- Run offline evals on every retriever or prompt change. Run online evals on a 1 to 5% sample of production traffic. The two catch different failure modes.
- Wire eval results into your LLM observability stack so score drops correlate to deploys, segments, and traffic shifts. A score in a notebook is interesting once. A score on a dashboard is operational.
The two failure surfaces
A RAG call is, at minimum, three steps. Embed the query. Retrieve top-k chunks. Ask the LLM to answer using those chunks. Failures cluster cleanly into two surfaces:
| Surface | What goes wrong | What users say |
|---|---|---|
| Retrieval | Wrong chunks, missing chunks, irrelevant chunks ranked high | "It says it can't find this but I know it's in the docs" |
| Generation | Hallucination despite good context, citation errors, the model ignoring what was retrieved | "The cited paragraph doesn't actually say that" |
This split matters because the fixes are different. Bad retrieval means re-chunking, switching embedding model, or adding a reranker. Bad generation means prompt changes, model swap, or stricter output constraints. If you measure end-to-end answer quality only, you will spend a week tuning the wrong layer. We have watched it happen.
Metric 1: Context Recall (retrieval)
Did we retrieve all the chunks needed to answer the question?
For each question in your golden set, label the chunks that contain the answer. Call them must-haves. Context recall is the fraction of must-haves that appear in the retrieved top-k.
context_recall = |retrieved ∩ must_haves| / |must_haves|
Recall under 0.8 is the single most common reason RAG systems "feel dumb." The model never had a chance. The right chunk was never in its context window in the first place.
Track recall at k=5, k=10, and k=20. If recall jumps a lot between k=10 and k=20, your reranker is underperforming and a larger candidate pool fixes it. If recall barely moves, the problem is upstream in the embedding model or the chunking strategy.
Metric 2: Context Precision (retrieval)
Of the chunks we retrieved, what fraction were actually useful?
Same labels, flipped denominator:
context_precision = |retrieved ∩ must_haves| / |retrieved|
Low precision means the context window is full of junk. That wastes tokens, raises cost and latency, and gives the model material to hallucinate from. Most teams ignore precision because the LLM "figures it out." It does, until the irrelevant chunk happens to contradict the right answer. Then you have a confident wrong response and no obvious reason for it.
Metric 3: Context Relevance (retrieval, LLM-judged)
Are the retrieved chunks topically relevant to the question, even if not strictly must-haves?
Where recall and precision use a labeled gold set, relevance uses an LLM judge to score each retrieved chunk on a 1 to 5 scale for topical fit. It is cheaper to scale across thousands of production traces. It is less precise than labels. Use it to monitor live traffic where you don't have golden labels yet.
Metric 4: Faithfulness (generation)
Does the answer only make claims that are supported by the retrieved context?
For each claim in the answer, an LLM judge checks whether the cited or implied context entails it. Faithfulness is the fraction of claims that pass.
faithfulness = |claims_supported_by_context| / |total_claims|
This is the metric users care about most under the brand name "hallucination." Here is the part the marketing pages don't tell you: a faithfulness score of 0.95 still means 1 in 20 sentences in your answer is unsupported. For regulated industries (legal, healthcare, finance) that ratio is unshippable. You don't need a metric, you need a gate. See our legal AI hallucination and clinical AI hallucination deep-dives for what those gates look like in code.
Metric 5: Answer Relevance (generation)
Does the answer actually address the user's question?
A faithful answer can still be irrelevant. Faithfulness asks: "is what's said true, given the docs?" Relevance asks: "is what's said responsive to what was asked?" The first is about truth. The second is about whether you answered.
Standard implementation. An LLM judge generates 3 to 5 alternative questions that the given answer would plausibly answer. Cosine similarity of those alternatives back to the original question gives a relevance score. If the answer "would also answer" five unrelated questions, it is evasive or off-topic.
Metric 6: Citation Accuracy (generation)
When the answer cites a source, does that source actually contain the cited claim?
This is the metric that catches the worst class of production failure: confident citations to the wrong paragraph. Users trust cited answers more than uncited ones. So a wrong citation does more damage than no citation. It is the failure mode that turns "the model made a mistake" into "the model is lying with my product."
For each citation in the answer, an LLM judge checks whether the cited chunk supports the specific claim it is attached to. Anything below 0.95 here is a product risk, not just a quality metric. See citation grounding evals for the implementation gotchas.
Building the golden set
You need 100 to 300 questions with labeled answers and labeled must-have chunks. Sources, in order of value:
- Real production queries. Pull a stratified sample from the last 30 to 90 days. Stratify by intent (factual lookup, summarization, multi-doc reasoning) so the set reflects actual traffic distribution.
- Customer support tickets. Queries that ended in human escalation are exactly the queries you want to test against. They are the hard ones, by definition.
- Edge cases from incident reports. Every hallucination ticket or wrong-answer report becomes a golden-set entry the next time you refresh.
- Synthetic questions generated from your corpus. Backstop only, when sources 1 to 3 don't cover a topic. Mark synthetic entries with a different tag. They tend to be too easy and they inflate your scores.
Refresh quarterly. Production query distribution shifts. New features land, new user segments arrive, a stale golden set will pass while real users fail. A team I will not name shipped a 0.94 faithfulness score for six straight months while their support queue filled with complaints. The golden set had stopped reflecting reality back in February.
One common mistake worth calling out: treating the golden set as something you only run on model changes. The golden set is also how you validate retrieval changes, chunking changes, embedding model swaps, and reranker upgrades. Run it on every meaningful pipeline change, not only prompt edits.
LLM-as-judge in Python
The minimum viable judge. Pin the model. Cache the prompt. Log the reasoning, not just the score.
import os
import json
from anthropic import Anthropic
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
JUDGE_MODEL = "claude-sonnet-4-6"
JUDGE_PROMPT_VERSION = "faithfulness-v3"
FAITHFULNESS_PROMPT = """You are evaluating whether an AI assistant's answer is faithful to the provided context.
Context:
{context}
Question:
{question}
Answer:
{answer}
For each claim made in the answer, decide if it is supported by the context.
Return JSON: {{"claims": [{{"claim": str, "supported": bool, "reason": str}}], "faithfulness_score": float}}
faithfulness_score is the fraction of claims that are supported. Be strict. Partial support counts as unsupported.
"""
def judge_faithfulness(question: str, context: str, answer: str) -> dict:
resp = client.messages.create(
model=JUDGE_MODEL,
max_tokens=2048,
messages=[{
"role": "user",
"content": FAITHFULNESS_PROMPT.format(
context=context, question=question, answer=answer
),
}],
)
text = resp.content[0].text
start = text.find("{")
end = text.rfind("}") + 1
return json.loads(text[start:end])Four things that look minor and bite hard:
- Pin the judge model version.
claude-sonnet-4-6, notclaude-sonnet-latest. The day Anthropic ships sonnet-4-7 your scores will drift and you will lose a week looking for a prompt regression that doesn't exist. - Pin the judge prompt. Treat the judge prompt like a model release. Version it (see
JUDGE_PROMPT_VERSIONabove). Log which version judged which run. When you tweak the prompt to be more lenient, last quarter's numbers are no longer comparable. - Cache responses keyed on
(question, context, answer, judge_model, prompt_version). When you re-run the golden set you will judge millions of identical inputs across the year. Caching is the difference between a $5 eval and a $500 eval. - Log the judge's reasoning, not only the score. When a score drops, you need to read why. Otherwise you cannot tell whether the model got worse or the judge got nitpicky.
Wiring eval results into observability
A score in a notebook is interesting once. A score on a dashboard is operational. Every eval run (offline and online) should write to the same observability backend as your production traces. That gives you:
- Correlation with deploys. Score drops timestamped against the deploy that caused them. A 6% drop in faithfulness at 14:32 on Tuesday is a different story when you can see the prompt change that shipped at 14:30.
- Per-segment breakdowns. Faithfulness on legal queries versus casual chat. If only one segment regressed, the fix is targeted.
- Trend over time. Slow drift in answer relevance is invisible from a single run. Obvious in a six-month trendline.
Respan models every LLM interaction as a span with input, output, model, metrics, and metadata, and groups spans into traces and threads. Online eval scores attach to the same span data via the online evaluation automations, so a failing eval links directly to the span that produced the bad answer. You read the trace tree and the failure mode is right there. The trace and evaluate a RAG pipeline cookbook walks through this end to end. See LLM observability for the wider picture.
Online evals: a 1 to 5% sample of production traffic
Offline evals on a golden set answer one question: do my known examples still work after a change? They don't tell you whether the new questions your users are asking work.
The fix is online evals. Judge a small random sample of production traffic with the same LLM-as-judge metrics you run offline. Typical setup:
- Sample 1 to 5% of production RAG calls. More for low-traffic apps, less for high-traffic.
- After the user response is delivered, asynchronously run the judges on the trace.
- Attach the resulting scores to the trace as observability attributes.
- Alert on rolling averages, for example faithfulness 7-day rolling average dropping below 0.92.
What this catches that offline evals don't: distribution shift. A new product surface routes a different kind of question at your RAG pipeline. The golden set never covered it. Faithfulness craters on that segment and the offline numbers stay green for weeks while support tickets pile up.
Common gotchas
Mistakes we have seen, ranked by how often they happen:
- Evaluating end-to-end only. You will never know if the retriever or the generator is broken. Split.
- Tiny golden set. 30 questions feels like enough until you realize you are seeing 5% noise on every run. Aim for 200.
- Synthetic-only golden set. It will be too easy. Production traffic is messier than your corpus: typos, ambiguous intent, partial questions, follow-ups that depend on a previous turn.
- Letting the judge model float. "We use the latest GPT" means your scores are not comparable across months. Pin or it doesn't count.
- No score on retrieval relevance. End-to-end faithfulness can pass while context precision is 0.3. You are paying for a context window full of junk and the generator is hiding it.
- Eval scores divorced from traces. A 0.91 faithfulness number tells you nothing actionable. The 9 failing traces tell you everything. Link the score to the span.
- Skipping citation accuracy. It is the metric most correlated with user trust loss, and the metric most teams skip because it is harder to automate.
Where this fits
RAG evaluation is one slice of a broader LLM evaluation practice. The metrics above are RAG-specific. The operational pattern (golden set, LLM judge, online sample, trace-linked scores) is the same one we recommend for agent evaluation, prompt evaluation, and general LLM evaluation.
If you're picking tools, best LLM evaluation tools (2026) walks through the trade-offs across Ragas, DeepEval, Phoenix, and managed platforms.
FAQ
What's the difference between context recall and faithfulness? Context recall measures the retriever. Did we find the right chunks. Faithfulness measures the generator. Did the model use the chunks correctly. A system can have perfect recall and terrible faithfulness, or the reverse, and the fix is different in each case.
How big should my golden set be? 100 questions is a reasonable floor for a focused use case. 300 is better and covers more intent classes. Below 50, the variance per run swamps the signal and you cannot tell improvement from noise.
Should I use an LLM-as-judge or human review? Both. LLM-as-judge for the bulk run because it is cheap, fast, and reproducible if you pin the model. Sampled human review for the trickiest cases and to calibrate the judge. Periodically score 20 to 50 examples with both and check correlation. If it drops below 0.7, your prompt or model has drifted.
Which judge model should I use? A capable model that is not the same one you are evaluating. Claude Sonnet 4.6 and GPT-5 are both reasonable defaults. Pin the version.
Can I evaluate RAG without a golden set? Partially. Context relevance, faithfulness, and answer relevance can be judged on production traces directly. Recall and precision require labels. The golden set is the only way to compare retrieval changes apples-to-apples.
How often should I re-run the golden set? On every meaningful change: prompt edit, model swap, retriever change, chunk strategy change. At minimum weekly as a regression guard. Treat it like a test suite. If you would run pytest, run the eval.
Does caching the judge skew results? No. Identical inputs should produce identical scores, that is the point. Just make sure the cache key includes judge model version and prompt version, so a version bump invalidates the cache automatically.