LLM evals are automated quality scoring of LLM outputs against criteria you define — faithfulness, format compliance, tone, accuracy. They turn 'this answer feels right' into a number you can chart, alert on, and ship against. Without evals, every prompt change is a guess and every model swap is a leap of faith.

What's the difference between offline and online evals?

Offline evals run against a frozen test set during development — fast feedback for prompt iteration. Online evals run against live production traffic — slower signal but catches the regressions that don't show up in your test set. Production teams run both; offline is the dev loop, online is the canary.

How does LLM-as-a-judge work?

A separate LLM scores production outputs against criteria you define — usually a rubric like 'rate faithfulness 1-5 against the source documents.' It's fast, cheap, and scales to every request. Strong correlation with human judgment for objective criteria; weaker for subjective ones like tone.

Is LLM-as-a-judge reliable enough for production?

Yes for breadth (running on every request to catch obvious regressions), no as your only signal. Sample-validate the judge against humans regularly, and anchor with a static gold-set you don't let drift. Treating LLM-as-judge as ground truth is the most common eval mistake.

How many eval criteria should I have?

3 to 5 for most products. One catch-all 'quality' score hides everything; ten criteria nobody can keep in their head. Pick the criteria that map to user complaints — the things customers actually say when something goes wrong.

What are rule-based evals?

Deterministic checks: JSON validity, regex match, length bounds, profanity filters, structured output schemas. Cheap and fast. Use them for everything that's binary correct/wrong and reserve LLM-as-judge for fuzzy criteria.

How do I get started with LLM evals?

Three-step path: (1) Pick 3 user complaints from the last month; turn each into an eval criterion. (2) Build a 50-100 example test set with expected behavior. (3) Wire LLM-as-judge against the criteria, run on production traffic, alert when scores drop. Iterate from there.

What's the difference between LLM evals and benchmarks like MMLU?

Benchmarks measure model capability on standardized tasks — useful for picking a base model. Evals measure your application's quality on your data — useful for shipping. MMLU tells you GPT-4o is good at multiple choice; evals tell you whether your support agent makes fewer hallucinations after a prompt change.

Can I use evals to compare models for the same task?

Yes — that's one of the highest-ROI uses. Run the same eval suite against GPT-4o and Claude on a held-out set of your data. The scores plus latency and cost tell you which model wins for your use case, not the leaderboard.

AI Evals: How to Score, Compare, and Ship LLM Quality (2026 Guide)

TL;DR

LLM evals are how you turn "this answer feels right" into a number. Three types matter — rule-based (fast, deterministic), LLM-as-judge (cheap, scales to every request), and human review (slow, ground truth). The teams that ship reliable AI run all three: rule-based on every output, LLM-as-judge on every output, sampled human review on the long tail. The teams that ship one number called "quality" never figure out what's actually wrong.

What are LLM evals?

An LLM eval is a function that takes a model output (and usually the input that produced it) and returns a score against a criterion. The criterion can be objective ("does this output match the JSON schema?"), semi-objective ("is this faithful to the retrieved documents?"), or subjective ("is the tone empathetic?"). The score can be binary, ordinal, or continuous.

Together, evals turn LLM quality into something you can chart over time, alert on, and ship against. Without them, every prompt change is a guess and every model swap is a leap of faith. With them, "quality regressed 7% on faithfulness over the last 24 hours" is a signal you can act on before the support tickets arrive.

Respan eval view: trace with multiple eval scores attached including faithfulness, tone, and format compliance

A trace in Respan with three evaluators attached. Online evals run on every production request.

The three types of evals

1. Rule-based

Deterministic checks: schema validation, regex match, length bounds, profanity filter, banned-phrase detection, JSON validity, structured-output compliance. They're fast (microseconds), cheap (free), and reliable (no judge variance).

Use them for anything that has a binary correct/wrong answer. Most teams underuse rule-based evals because LLM-as-judge feels more sophisticated, but rule-based catches the cheapest failures cheapest.

2. LLM-as-a-judge

A separate LLM scores outputs against a rubric you write. Example rubric for a customer support reply:

Score the reply on faithfulness from 1-5:
5 = every claim is directly supported by the retrieved docs
4 = mostly supported, one minor inferential leap
3 = mostly supported, but one unsupported claim
2 = several claims have no support
1 = the reply contradicts or invents facts not in the docs

Reply: {output}
Retrieved docs: {context}

Return: {"score": <int>, "reason": "<one sentence>"}

LLM-as-judge is fast (sub-second), cheap (a few cents per 1k requests), and scales to every production request. The catch: the judge has its own biases, blind spots, and failure modes. Use it for breadth, anchor with humans.

3. Human review

Sampled or full review by domain experts. Slow (minutes per item) and expensive (real people), but the ground truth for high-stakes domains. Three patterns work:

Random sample: review N items per day. Catches drift you'd miss otherwise.
Disagreement-triggered: when LLM-as-judge and rule-based scores disagree, route to humans. Highest signal-to-cost.
Edge case curation: humans label the long tail you sample from production traces. The labeled examples become your gold-set.

Founder's take

Frank Chen · Head of DevRel, Respan

The fastest way to ship eval debt is to define quality after you have a problem. Every team I've watched do this gets the same outcome: a prompt change ships, customer complaints arrive a week later, and the team scrambles to bolt on an evaluator that catches the specific class of failure that just happened. By then the bad data is in the model's training-loop, the customer's mental model, and three angry threads. Define your evaluators before the first prompt ships, not after the first complaint.

The other near-universal mistake: a single "quality" score. Quality is 3-5 orthogonal criteria, and they move independently. From the audits we've run against human reviewers, LLM-as-judge agrees with humans far more reliably on objective criteria like faithfulness than on subjective ones like tone — often by ten or more percentage points. If you collapse those into one score you lose the ability to act on either. Decompose or stay confused.

Teams running production evals on Respan

Stop shipping prompts on vibes.

Respan evals run rule-based, LLM-as-judge, and human review on the same traces — online or offline, with versioned datasets. Free to try, no credit card.

Try Respan free

Online vs offline evals

The two loops feed each other. Production traces with low scores get curated back into the offline test set; offline-validated prompt changes ship to production. Skip either side and you have a blind spot.

Offline evals run during development against a frozen test set. Fast feedback loop — change a prompt, run the suite, see if the change improved or regressed. The unit is a controlled experiment.

Online evals run continuously against live production traffic. Slower signal — the metrics aggregate over hours or days — but they catch the regressions that don't show up in your test set, because production traffic is messier than any test set you'd write.

Production teams run both. Offline is the inner-loop dev experience. Online is the canary that catches what offline missed. Skipping either creates blind spots.

How to design an eval rubric

Six rules that hold up across the eval rubrics I've seen win:

Start with user complaints, not abstract criteria. Read the last 30 days of customer feedback. Each common complaint becomes a criterion. "Replies sound robotic" → tone eval. "Wrong dates" → factual accuracy eval.
One criterion per eval. Don't mix faithfulness and tone in the same prompt — the judge can't disentangle them. One scope, one rubric, one score.
Use ordinal not binary, except when the criterion is genuinely binary. Faithfulness on a 1-5 scale gives you a smooth dial; pass/fail throws away signal.
Anchor every score with an example. "5 = like this real example. 1 = like this real example." Drastically reduces judge variance.
Force the judge to reason before scoring. Output schema: {reason: string, score: int}. Reason first, score second. The chain of thought matters.
Validate the judge weekly. Random sample 50 production traces, score them with humans, compare to judge scores. Track agreement; if it drops below 85%, your judge needs work.

How to instrument

from respan import Respan

respan = Respan(api_key="...")

# Define an eval
faithfulness_eval = respan.eval.create(
    name="faithfulness",
    type="llm_as_judge",
    judge_model="gpt-4o",
    prompt="""
    Score the reply on faithfulness 1-5 against the retrieved docs.
    Return: {"reason": "<one sentence>", "score": <int>}
    """,
)

# Attach to live traffic — runs on every request matching the filter
faithfulness_eval.attach_online(filter={"feature.id": "support_agent"})

# Or run offline against a dataset
results = faithfulness_eval.run_offline(dataset_id="support_v3_test")
print(results.aggregate())  # {"mean": 4.2, "p25": 3, "p75": 5}

Common eval mistakes

One "quality" score that conflates everything. Decompose into 3-5 criteria.
LLM-as-judge with no human anchor. Sample-validate weekly or you're measuring judge bias.
Test set built from synthetic data only. Synthetic test sets miss the ugly edge cases real users produce. Sample from production traces.
No regression test before shipping prompt changes. The whole point of evals is to catch regressions before users do.
Rule-based evals dismissed as "too simple." They catch the cheapest failures cheapest. Always run them.
Eval scores not tied to traces. A score without the trace context is unactionable. The platform should link them automatically.

Eval platforms compared (May 2026)

Most LLM observability platforms include evals now — the differentiators are which eval types they support, whether they run online against live traffic, and how tightly they integrate with tracing and datasets. Verified against vendor docs in May 2026.

Platform	Rule-based	LLM-as-judge	Online (live traffic)	Dataset versioning	Self-host	Bundled tracing + gateway
Respan	Yes	Yes	Yes	Yes	Enterprise	Yes
Braintrust	Yes	Yes (deepest scoring library)	Yes	Yes	No	Proxy only
Langfuse	Yes	Yes	Yes	Yes	Yes (open source)	Tracing only
LangSmith	Yes	Yes (LangChain-native)	Yes	Yes	Enterprise	Tracing only
Phoenix (Arize)	Yes	Yes	Yes	Yes	Yes (open source)	Tracing only
Helicone	Limited	Basic	Yes (proxy hooks)	No	Yes (open source)	Proxy gateway
Patronus AI	Yes	Yes (Lynx faithfulness judge)	API-based	Yes	No	No

Cells reflect public docs as of May 2026. "Bundled tracing + gateway" matters because eval scores need to attach to traces, and the gateway is where production scores get captured cheaply.

Which eval platform should you pick?

You're prompt-iterating and need a fast offline loop — Braintrust or Langfuse. Both excel at the dataset → run-suite → compare experience.
You're on LangChain or LangGraph end-to-end — LangSmith. The LCEL-native evaluators save real integration work.
You want evals + tracing + a gateway in one platform — Respan. Score attaches to trace; trace attaches to request; one billing line. (See why bundling matters.)
You need on-prem / fully self-hosted for compliance — Langfuse or Phoenix. Both ship a real OSS install (not a "community edition" with the eval features stripped out).
You need state-of-the-art faithfulness scoring out of the box — Patronus's Lynx model is the strongest specialized judge for RAG faithfulness in public benchmarks.

Failure modes you should plan for

Judge drift after model upgrades

You wired the judge against gpt-4o-2024-08, and then OpenAI promoted a new default. Your scores shift by 8% overnight and you spend a week debugging a "regression" that's actually a different judge. Pin the judge model version explicitly. Re-baseline whenever you bump it.

Test set rot

Your test set was great in Q1. By Q3 it doesn't reflect the queries customers are actually sending — the product changed, the user base grew, the docs got rewritten. Curate from production traces quarterly, retire examples that no longer represent live traffic, and version the test set so old runs stay comparable.

Eval scores you can't act on

"Faithfulness dropped 4% this week" — and then what? If the score is detached from the trace, you can't look at the failing cases. If it's not segmented by feature, you can't tell which surface regressed. Every score needs a trace link and a feature/version tag, or it's just a chart that lights up red and never gets fixed.

Production eval checklist

3-5 orthogonal criteria, each tied to a real user complaint pattern
Rule-based + LLM-as-judge running on 100% of production traffic
Judge model version pinned; re-baseline on every bump
Weekly human sample-validation (50+ traces) tracking judge–human agreement
Test set curated from production traces, refreshed quarterly
Every score attached to its trace and tagged by feature + prompt version
Regression alert when any criterion's rolling-24h mean drops > 1 stddev
Pre-merge eval gate on prompt changes — block deploy if scores regress

Wire up evals on every production request

Rule-based, LLM-as-judge, and human review — scored on the trace that produced the output, tagged by feature and prompt version. Free tier, no card required.

Try Respan free See the eval product

Frequently asked questions

Related guides: LLM observability · LLM tracing · LLM gateway

TL;DR

What are LLM evals?

A trace in Respan with three evaluators attached. Online evals run on every production request.

The three types of evals

1. Rule-based

2. LLM-as-a-judge

A separate LLM scores outputs against a rubric you write. Example rubric for a customer support reply:

Score the reply on faithfulness from 1-5:
5 = every claim is directly supported by the retrieved docs
4 = mostly supported, one minor inferential leap
3 = mostly supported, but one unsupported claim
2 = several claims have no support
1 = the reply contradicts or invents facts not in the docs

Reply: {output}
Retrieved docs: {context}

Return: {"score": <int>, "reason": "<one sentence>"}

3. Human review

Sampled or full review by domain experts. Slow (minutes per item) and expensive (real people), but the ground truth for high-stakes domains. Three patterns work:

Random sample: review N items per day. Catches drift you'd miss otherwise.
Disagreement-triggered: when LLM-as-judge and rule-based scores disagree, route to humans. Highest signal-to-cost.
Edge case curation: humans label the long tail you sample from production traces. The labeled examples become your gold-set.

Founder's take

Frank Chen · Head of DevRel, Respan

Teams running production evals on Respan

Stop shipping prompts on vibes.

Respan evals run rule-based, LLM-as-judge, and human review on the same traces — online or offline, with versioned datasets. Free to try, no credit card.

Try Respan free

Online vs offline evals

Offline evals run during development against a frozen test set. Fast feedback loop — change a prompt, run the suite, see if the change improved or regressed. The unit is a controlled experiment.

Production teams run both. Offline is the inner-loop dev experience. Online is the canary that catches what offline missed. Skipping either creates blind spots.

How to design an eval rubric

Six rules that hold up across the eval rubrics I've seen win:

Start with user complaints, not abstract criteria. Read the last 30 days of customer feedback. Each common complaint becomes a criterion. "Replies sound robotic" → tone eval. "Wrong dates" → factual accuracy eval.
One criterion per eval. Don't mix faithfulness and tone in the same prompt — the judge can't disentangle them. One scope, one rubric, one score.
Use ordinal not binary, except when the criterion is genuinely binary. Faithfulness on a 1-5 scale gives you a smooth dial; pass/fail throws away signal.
Anchor every score with an example. "5 = like this real example. 1 = like this real example." Drastically reduces judge variance.
Force the judge to reason before scoring. Output schema: {reason: string, score: int}. Reason first, score second. The chain of thought matters.
Validate the judge weekly. Random sample 50 production traces, score them with humans, compare to judge scores. Track agreement; if it drops below 85%, your judge needs work.

How to instrument

from respan import Respan

respan = Respan(api_key="...")

# Define an eval
faithfulness_eval = respan.eval.create(
    name="faithfulness",
    type="llm_as_judge",
    judge_model="gpt-4o",
    prompt="""
    Score the reply on faithfulness 1-5 against the retrieved docs.
    Return: {"reason": "<one sentence>", "score": <int>}
    """,
)

# Attach to live traffic — runs on every request matching the filter
faithfulness_eval.attach_online(filter={"feature.id": "support_agent"})

# Or run offline against a dataset
results = faithfulness_eval.run_offline(dataset_id="support_v3_test")
print(results.aggregate())  # {"mean": 4.2, "p25": 3, "p75": 5}

Common eval mistakes

One "quality" score that conflates everything. Decompose into 3-5 criteria.
LLM-as-judge with no human anchor. Sample-validate weekly or you're measuring judge bias.
Test set built from synthetic data only. Synthetic test sets miss the ugly edge cases real users produce. Sample from production traces.
No regression test before shipping prompt changes. The whole point of evals is to catch regressions before users do.
Rule-based evals dismissed as "too simple." They catch the cheapest failures cheapest. Always run them.
Eval scores not tied to traces. A score without the trace context is unactionable. The platform should link them automatically.

Eval platforms compared (May 2026)

Platform	Rule-based	LLM-as-judge	Online (live traffic)	Dataset versioning	Self-host	Bundled tracing + gateway
Respan	Yes	Yes	Yes	Yes	Enterprise	Yes
Braintrust	Yes	Yes (deepest scoring library)	Yes	Yes	No	Proxy only
Langfuse	Yes	Yes	Yes	Yes	Yes (open source)	Tracing only
LangSmith	Yes	Yes (LangChain-native)	Yes	Yes	Enterprise	Tracing only
Phoenix (Arize)	Yes	Yes	Yes	Yes	Yes (open source)	Tracing only
Helicone	Limited	Basic	Yes (proxy hooks)	No	Yes (open source)	Proxy gateway
Patronus AI	Yes	Yes (Lynx faithfulness judge)	API-based	Yes	No	No

Cells reflect public docs as of May 2026. "Bundled tracing + gateway" matters because eval scores need to attach to traces, and the gateway is where production scores get captured cheaply.

Which eval platform should you pick?

You're prompt-iterating and need a fast offline loop — Braintrust or Langfuse. Both excel at the dataset → run-suite → compare experience.
You're on LangChain or LangGraph end-to-end — LangSmith. The LCEL-native evaluators save real integration work.
You want evals + tracing + a gateway in one platform — Respan. Score attaches to trace; trace attaches to request; one billing line. (See why bundling matters.)
You need on-prem / fully self-hosted for compliance — Langfuse or Phoenix. Both ship a real OSS install (not a "community edition" with the eval features stripped out).
You need state-of-the-art faithfulness scoring out of the box — Patronus's Lynx model is the strongest specialized judge for RAG faithfulness in public benchmarks.

Failure modes you should plan for

Judge drift after model upgrades

Test set rot

Eval scores you can't act on

Production eval checklist

3-5 orthogonal criteria, each tied to a real user complaint pattern
Rule-based + LLM-as-judge running on 100% of production traffic
Judge model version pinned; re-baseline on every bump
Weekly human sample-validation (50+ traces) tracking judge–human agreement
Test set curated from production traces, refreshed quarterly
Every score attached to its trace and tagged by feature + prompt version
Regression alert when any criterion's rolling-24h mean drops > 1 stddev
Pre-merge eval gate on prompt changes — block deploy if scores regress

Wire up evals on every production request

Rule-based, LLM-as-judge, and human review — scored on the trace that produced the output, tagged by feature and prompt version. Free tier, no card required.

Try Respan free See the eval product

Frequently asked questions

Related guides: LLM observability · LLM tracing · LLM gateway

AI Evals: The Complete Guide

What are LLM evals?

The three types of evals

1. Rule-based

2. LLM-as-a-judge

3. Human review

Stop shipping prompts on vibes.

Online vs offline evals

How to design an eval rubric

How to instrument

Common eval mistakes

Eval platforms compared (May 2026)

Which eval platform should you pick?

Failure modes you should plan for

Judge drift after model upgrades

Test set rot

Eval scores you can't act on

Production eval checklist

Wire up evals on every production request

Frequently asked questions

What are LLM evals?

What's the difference between offline and online evals?

How does LLM-as-a-judge work?

Is LLM-as-a-judge reliable enough for production?

How many eval criteria should I have?

What are rule-based evals?

How do I get started with LLM evals?

What's the difference between LLM evals and benchmarks like MMLU?

Can I use evals to compare models for the same task?

Built for AI agents. Break less. Ship more.

AI Evals: The Complete Guide

What are LLM evals?

The three types of evals

1. Rule-based

2. LLM-as-a-judge

3. Human review

Stop shipping prompts on vibes.

Online vs offline evals

How to design an eval rubric

How to instrument

Common eval mistakes

Eval platforms compared (May 2026)

Which eval platform should you pick?

Failure modes you should plan for

Judge drift after model upgrades

Test set rot

Eval scores you can't act on

Production eval checklist

Wire up evals on every production request

Frequently asked questions

What are LLM evals?

What's the difference between offline and online evals?

How does LLM-as-a-judge work?

Is LLM-as-a-judge reliable enough for production?

How many eval criteria should I have?

What are rule-based evals?

How do I get started with LLM evals?

What's the difference between LLM evals and benchmarks like MMLU?

Can I use evals to compare models for the same task?

Built for AI agents. Break less. Ship more.

Built for AI agents.
Break less.
Ship more.

Built for AI agents.
Break less.
Ship more.