TL;DR
LLM evals are how you turn "this answer feels right" into a number. Three types matter — rule-based (fast, deterministic), LLM-as-judge (cheap, scales to every request), and human review (slow, ground truth). The teams that ship reliable AI run all three: rule-based on every output, LLM-as-judge on every output, sampled human review on the long tail. The teams that ship one number called "quality" never figure out what's actually wrong.
What are LLM evals?
An LLM eval is a function that takes a model output (and usually the input that produced it) and returns a score against a criterion. The criterion can be objective ("does this output match the JSON schema?"), semi-objective ("is this faithful to the retrieved documents?"), or subjective ("is the tone empathetic?"). The score can be binary, ordinal, or continuous.
Together, evals turn LLM quality into something you can chart over time, alert on, and ship against. Without them, every prompt change is a guess and every model swap is a leap of faith. With them, "quality regressed 7% on faithfulness over the last 24 hours" is a signal you can act on before the support tickets arrive.

A trace in Respan with three evaluators attached. Online evals run on every production request.
The three types of evals
1. Rule-based
Deterministic checks: schema validation, regex match, length bounds, profanity filter, banned-phrase detection, JSON validity, structured-output compliance. They're fast (microseconds), cheap (free), and reliable (no judge variance).
Use them for anything that has a binary correct/wrong answer. Most teams underuse rule-based evals because LLM-as-judge feels more sophisticated, but rule-based catches the cheapest failures cheapest.
2. LLM-as-a-judge
A separate LLM scores outputs against a rubric you write. Example rubric for a customer support reply:
Score the reply on faithfulness from 1-5:
5 = every claim is directly supported by the retrieved docs
4 = mostly supported, one minor inferential leap
3 = mostly supported, but one unsupported claim
2 = several claims have no support
1 = the reply contradicts or invents facts not in the docs
Reply: {output}
Retrieved docs: {context}
Return: {"score": <int>, "reason": "<one sentence>"}LLM-as-judge is fast (sub-second), cheap (a few cents per 1k requests), and scales to every production request. The catch: the judge has its own biases, blind spots, and failure modes. Use it for breadth, anchor with humans.
3. Human review
Sampled or full review by domain experts. Slow (minutes per item) and expensive (real people), but the ground truth for high-stakes domains. Three patterns work:
- Random sample: review N items per day. Catches drift you'd miss otherwise.
- Disagreement-triggered: when LLM-as-judge and rule-based scores disagree, route to humans. Highest signal-to-cost.
- Edge case curation: humans label the long tail you sample from production traces. The labeled examples become your gold-set.

The fastest way to ship eval debt is to define quality after you have a problem. Every team I've watched do this gets the same outcome: a prompt change ships, customer complaints arrive a week later, and the team scrambles to bolt on an evaluator that catches the specific class of failure that just happened. By then the bad data is in the model's training-loop, the customer's mental model, and three angry threads. Define your evaluators before the first prompt ships, not after the first complaint.
The other near-universal mistake: a single "quality" score. Quality is 3-5 orthogonal criteria, and they move independently. From the audits we've run against human reviewers, LLM-as-judge agrees with humans far more reliably on objective criteria like faithfulness than on subjective ones like tone — often by ten or more percentage points. If you collapse those into one score you lose the ability to act on either. Decompose or stay confused.
Teams running production evals on Respan
Stop shipping prompts on vibes.
Respan evals run rule-based, LLM-as-judge, and human review on the same traces — online or offline, with versioned datasets. Free to try, no credit card.
Try Respan freeOnline vs offline evals
The two loops feed each other. Production traces with low scores get curated back into the offline test set; offline-validated prompt changes ship to production. Skip either side and you have a blind spot.
Offline evals run during development against a frozen test set. Fast feedback loop — change a prompt, run the suite, see if the change improved or regressed. The unit is a controlled experiment.
Online evals run continuously against live production traffic. Slower signal — the metrics aggregate over hours or days — but they catch the regressions that don't show up in your test set, because production traffic is messier than any test set you'd write.
Production teams run both. Offline is the inner-loop dev experience. Online is the canary that catches what offline missed. Skipping either creates blind spots.
How to design an eval rubric
Six rules that hold up across the eval rubrics I've seen win:
- Start with user complaints, not abstract criteria. Read the last 30 days of customer feedback. Each common complaint becomes a criterion. "Replies sound robotic" → tone eval. "Wrong dates" → factual accuracy eval.
- One criterion per eval. Don't mix faithfulness and tone in the same prompt — the judge can't disentangle them. One scope, one rubric, one score.
- Use ordinal not binary, except when the criterion is genuinely binary. Faithfulness on a 1-5 scale gives you a smooth dial; pass/fail throws away signal.
- Anchor every score with an example. "5 = like this real example. 1 = like this real example." Drastically reduces judge variance.
- Force the judge to reason before scoring. Output schema:
{reason: string, score: int}. Reason first, score second. The chain of thought matters. - Validate the judge weekly. Random sample 50 production traces, score them with humans, compare to judge scores. Track agreement; if it drops below 85%, your judge needs work.
How to instrument
from respan import Respan
respan = Respan(api_key="...")
# Define an eval
faithfulness_eval = respan.eval.create(
name="faithfulness",
type="llm_as_judge",
judge_model="gpt-4o",
prompt="""
Score the reply on faithfulness 1-5 against the retrieved docs.
Return: {"reason": "<one sentence>", "score": <int>}
""",
)
# Attach to live traffic — runs on every request matching the filter
faithfulness_eval.attach_online(filter={"feature.id": "support_agent"})
# Or run offline against a dataset
results = faithfulness_eval.run_offline(dataset_id="support_v3_test")
print(results.aggregate()) # {"mean": 4.2, "p25": 3, "p75": 5}Common eval mistakes
- One "quality" score that conflates everything. Decompose into 3-5 criteria.
- LLM-as-judge with no human anchor. Sample-validate weekly or you're measuring judge bias.
- Test set built from synthetic data only. Synthetic test sets miss the ugly edge cases real users produce. Sample from production traces.
- No regression test before shipping prompt changes. The whole point of evals is to catch regressions before users do.
- Rule-based evals dismissed as "too simple." They catch the cheapest failures cheapest. Always run them.
- Eval scores not tied to traces. A score without the trace context is unactionable. The platform should link them automatically.
Eval platforms compared (May 2026)
Most LLM observability platforms include evals now — the differentiators are which eval types they support, whether they run online against live traffic, and how tightly they integrate with tracing and datasets. Verified against vendor docs in May 2026.
| Platform | Rule-based | LLM-as-judge | Online (live traffic) | Dataset versioning | Self-host | Bundled tracing + gateway |
|---|---|---|---|---|---|---|
| Respan | Yes | Yes | Yes | Yes | Enterprise | Yes |
| Braintrust | Yes | Yes (deepest scoring library) | Yes | Yes | No | Proxy only |
| Langfuse | Yes | Yes | Yes | Yes | Yes (open source) | Tracing only |
| LangSmith | Yes | Yes (LangChain-native) | Yes | Yes | Enterprise | Tracing only |
| Phoenix (Arize) | Yes | Yes | Yes | Yes | Yes (open source) | Tracing only |
| Helicone | Limited | Basic | Yes (proxy hooks) | No | Yes (open source) | Proxy gateway |
| Patronus AI | Yes | Yes (Lynx faithfulness judge) | API-based | Yes | No | No |
Cells reflect public docs as of May 2026. "Bundled tracing + gateway" matters because eval scores need to attach to traces, and the gateway is where production scores get captured cheaply.
Which eval platform should you pick?
- You're prompt-iterating and need a fast offline loop — Braintrust or Langfuse. Both excel at the dataset → run-suite → compare experience.
- You're on LangChain or LangGraph end-to-end — LangSmith. The LCEL-native evaluators save real integration work.
- You want evals + tracing + a gateway in one platform — Respan. Score attaches to trace; trace attaches to request; one billing line. (See why bundling matters.)
- You need on-prem / fully self-hosted for compliance — Langfuse or Phoenix. Both ship a real OSS install (not a "community edition" with the eval features stripped out).
- You need state-of-the-art faithfulness scoring out of the box — Patronus's Lynx model is the strongest specialized judge for RAG faithfulness in public benchmarks.
Failure modes you should plan for
Judge drift after model upgrades
You wired the judge against gpt-4o-2024-08, and then OpenAI promoted a new default. Your scores shift by 8% overnight and you spend a week debugging a "regression" that's actually a different judge. Pin the judge model version explicitly. Re-baseline whenever you bump it.
Test set rot
Your test set was great in Q1. By Q3 it doesn't reflect the queries customers are actually sending — the product changed, the user base grew, the docs got rewritten. Curate from production traces quarterly, retire examples that no longer represent live traffic, and version the test set so old runs stay comparable.
Eval scores you can't act on
"Faithfulness dropped 4% this week" — and then what? If the score is detached from the trace, you can't look at the failing cases. If it's not segmented by feature, you can't tell which surface regressed. Every score needs a trace link and a feature/version tag, or it's just a chart that lights up red and never gets fixed.
Production eval checklist
- 3-5 orthogonal criteria, each tied to a real user complaint pattern
- Rule-based + LLM-as-judge running on 100% of production traffic
- Judge model version pinned; re-baseline on every bump
- Weekly human sample-validation (50+ traces) tracking judge–human agreement
- Test set curated from production traces, refreshed quarterly
- Every score attached to its trace and tagged by feature + prompt version
- Regression alert when any criterion's rolling-24h mean drops > 1 stddev
- Pre-merge eval gate on prompt changes — block deploy if scores regress
Wire up evals on every production request
Rule-based, LLM-as-judge, and human review — scored on the trace that produced the output, tagged by feature and prompt version. Free tier, no card required.
Frequently asked questions
Related guides: LLM observability · LLM tracing · LLM gateway