LLM evals are automated quality scoring of LLM outputs against criteria you define — faithfulness, format compliance, tone, accuracy. They turn 'this answer feels right' into a number you can chart, alert on, and ship against. Without evals, every prompt change is a guess and every model swap is a leap of faith.

What's the difference between offline and online evals?

Offline evals run against a frozen test set during development — fast feedback for prompt iteration. Online evals run against live production traffic — slower signal but catches the regressions that don't show up in your test set. Production teams run both; offline is the dev loop, online is the canary.

How does LLM-as-a-judge work?

A separate LLM scores production outputs against criteria you define — usually a rubric like 'rate faithfulness 1-5 against the source documents.' It's fast, cheap, and scales to every request. Strong correlation with human judgment for objective criteria; weaker for subjective ones like tone.

Is LLM-as-a-judge reliable enough for production?

Yes for breadth (running on every request to catch obvious regressions), no as your only signal. Sample-validate the judge against humans regularly, and anchor with a static gold-set you don't let drift. Treating LLM-as-judge as ground truth is the most common eval mistake.

How many eval criteria should I have?

3 to 5 for most products. One catch-all 'quality' score hides everything; ten criteria nobody can keep in their head. Pick the criteria that map to user complaints — the things customers actually say when something goes wrong.

What are rule-based evals?

Deterministic checks: JSON validity, regex match, length bounds, profanity filters, structured output schemas. Cheap and fast. Use them for everything that's binary correct/wrong and reserve LLM-as-judge for fuzzy criteria.

How do I get started with LLM evals?

Three-step path: (1) Pick 3 user complaints from the last month; turn each into an eval criterion. (2) Build a 50-100 example test set with expected behavior. (3) Wire LLM-as-judge against the criteria, run on production traffic, alert when scores drop. Iterate from there.

What's the difference between LLM evals and benchmarks like MMLU?

Benchmarks measure model capability on standardized tasks — useful for picking a base model. Evals measure your application's quality on your data — useful for shipping. MMLU tells you GPT-4o is good at multiple choice; evals tell you whether your support agent makes fewer hallucinations after a prompt change.

Can I use evals to compare models for the same task?

Yes — that's one of the highest-ROI uses. Run the same eval suite against GPT-4o and Claude on a held-out set of your data. The scores plus latency and cost tell you which model wins for your use case, not the leaderboard.

LLM Evals: How to Score, Compare, and Ship LLM Quality (2026 Guide)

Frank Chen · Head of DevRel, Respan

Last updated May 10, 2026 · 12 min read

TL;DR

LLM evals are how you turn "this answer feels right" into a number. Three types matter — rule-based (fast, deterministic), LLM-as-judge (cheap, scales to every request), and human review (slow, ground truth). The teams that ship reliable AI run all three: rule-based on every output, LLM-as-judge on every output, sampled human review on the long tail. The teams that ship one number called "quality" never figure out what's actually wrong.

What are LLM evals?

An LLM eval is a function that takes a model output (and usually the input that produced it) and returns a score against a criterion. The criterion can be objective ("does this output match the JSON schema?"), semi-objective ("is this faithful to the retrieved documents?"), or subjective ("is the tone empathetic?"). The score can be binary, ordinal, or continuous.

Together, evals turn LLM quality into something you can chart over time, alert on, and ship against. Without them, every prompt change is a guess and every model swap is a leap of faith. With them, "quality regressed 7% on faithfulness over the last 24 hours" is a signal you can act on before the support tickets arrive.

Respan eval view: trace with multiple eval scores attached including faithfulness, tone, and format compliance

A trace in Respan with three evaluators attached. Online evals run on every production request.

The three types of evals

1. Rule-based

Deterministic checks: schema validation, regex match, length bounds, profanity filter, banned-phrase detection, JSON validity, structured-output compliance. They're fast (microseconds), cheap (free), and reliable (no judge variance).

Use them for anything that has a binary correct/wrong answer. Most teams underuse rule-based evals because LLM-as-judge feels more sophisticated, but rule-based catches the cheapest failures cheapest.

2. LLM-as-a-judge

A separate LLM scores outputs against a rubric you write. Example rubric for a customer support reply:

Score the reply on faithfulness from 1-5:
5 = every claim is directly supported by the retrieved docs
4 = mostly supported, one minor inferential leap
3 = mostly supported, but one unsupported claim
2 = several claims have no support
1 = the reply contradicts or invents facts not in the docs

Reply: {output}
Retrieved docs: {context}

Return: {"score": <int>, "reason": "<one sentence>"}

LLM-as-judge is fast (sub-second), cheap (a few cents per 1k requests), and scales to every production request. The catch: the judge has its own biases, blind spots, and failure modes. Use it for breadth, anchor with humans.

3. Human review

Sampled or full review by domain experts. Slow (minutes per item) and expensive (real people), but the ground truth for high-stakes domains. Three patterns work:

Random sample: review N items per day. Catches drift you'd miss otherwise.
Disagreement-triggered: when LLM-as-judge and rule-based scores disagree, route to humans. Highest signal-to-cost.
Edge case curation: humans label the long tail you sample from production traces. The labeled examples become your gold-set.

Founder's take

Frank Chen · Head of DevRel, Respan

The fastest way to ship eval debt is to define quality after you have a problem. Every team I've watched do this gets the same outcome: a prompt change ships, customer complaints arrive a week later, and the team scrambles to bolt on an evaluator that catches the specific class of failure that just happened. By then the bad data is in the model's training-loop, the customer's mental model, and three angry threads. Define your evaluators before the first prompt ships, not after the first complaint.

The other near-universal mistake: a single "quality" score. Quality is 3-5 orthogonal criteria, and they move independently. From the audits we've run against human reviewers, LLM-as-judge agrees with humans far more reliably on objective criteria like faithfulness than on subjective ones like tone — often by ten or more percentage points. If you collapse those into one score you lose the ability to act on either. Decompose or stay confused.

Teams running production evals on Respan

Online vs offline evals

The two loops feed each other. Production traces with low scores get curated back into the offline test set; offline-validated prompt changes ship to production. Skip either side and you have a blind spot.

Offline evals run during development against a frozen test set. Fast feedback loop — change a prompt, run the suite, see if the change improved or regressed. The unit is a controlled experiment.

Online evals run continuously against live production traffic. Slower signal — the metrics aggregate over hours or days — but they catch the regressions that don't show up in your test set, because production traffic is messier than any test set you'd write.

Production teams run both. Offline is the inner-loop dev experience. Online is the canary that catches what offline missed. Skipping either creates blind spots.

How to design an eval rubric

Six rules that hold up across the eval rubrics I've seen win:

Start with user complaints, not abstract criteria. Read the last 30 days of customer feedback. Each common complaint becomes a criterion. "Replies sound robotic" → tone eval. "Wrong dates" → factual accuracy eval.
One criterion per eval. Don't mix faithfulness and tone in the same prompt — the judge can't disentangle them. One scope, one rubric, one score.
Use ordinal not binary, except when the criterion is genuinely binary. Faithfulness on a 1-5 scale gives you a smooth dial; pass/fail throws away signal.
Anchor every score with an example. "5 = like this real example. 1 = like this real example." Drastically reduces judge variance.
Force the judge to reason before scoring. Output schema: {reason: string, score: int}. Reason first, score second. The chain of thought matters.
Validate the judge weekly. Random sample 50 production traces, score them with humans, compare to judge scores. Track agreement; if it drops below 85%, your judge needs work.

How to instrument

from respan import Respan

respan = Respan(api_key="...")

# Define an eval
faithfulness_eval = respan.eval.create(
    name="faithfulness",
    type="llm_as_judge",
    judge_model="gpt-4o",
    prompt="""
    Score the reply on faithfulness 1-5 against the retrieved docs.
    Return: {"reason": "<one sentence>", "score": <int>}
    """,
)

# Attach to live traffic — runs on every request matching the filter
faithfulness_eval.attach_online(filter={"feature.id": "support_agent"})

# Or run offline against a dataset
results = faithfulness_eval.run_offline(dataset_id="support_v3_test")
print(results.aggregate())  # {"mean": 4.2, "p25": 3, "p75": 5}

Common eval mistakes

One "quality" score that conflates everything. Decompose into 3-5 criteria.
LLM-as-judge with no human anchor. Sample-validate weekly or you're measuring judge bias.
Test set built from synthetic data only. Synthetic test sets miss the ugly edge cases real users produce. Sample from production traces.
No regression test before shipping prompt changes. The whole point of evals is to catch regressions before users do.
Rule-based evals dismissed as "too simple." They catch the cheapest failures cheapest. Always run them.
Eval scores not tied to traces. A score without the trace context is unactionable. The platform should link them automatically.

Eval platforms compared

Most LLM observability platforms include evals. The differentiators: judge flexibility, online vs offline support, dataset versioning, and how tightly evals integrate with traces and prompt management. See the full feature matrix on the observability pillar.

Respan: rule-based + LLM-as-judge + human review, online + offline, datasets versioned, scores attached to every trace.
Braintrust: eval-first product, deepest scoring functions library, slightly less integrated with tracing.
Langfuse: solid eval support, open source, strong dataset features.
LangSmith: LangChain-native, evaluator library tied to LCEL.
Helicone: basic evals, third-party integrations for advanced.

Frequently asked questions

Frank Chen

Head of DevRel, Respan

Head of DevRel at Respan (YC W24). Working alongside the team running the infrastructure that handles 80M+ LLM requests a day.

Connect on LinkedIn →

Score every LLM output, online and offline

Rule-based, LLM-as-judge, and human review — all integrated with traces, prompts, and gateway in one platform.

Start for free See evals in product

Related guides: LLM observability · LLM tracing · LLM gateway

Frank Chen · Head of DevRel, Respan

Last updated May 10, 2026 · 12 min read

TL;DR

What are LLM evals?

A trace in Respan with three evaluators attached. Online evals run on every production request.

The three types of evals

1. Rule-based

2. LLM-as-a-judge

A separate LLM scores outputs against a rubric you write. Example rubric for a customer support reply:

Score the reply on faithfulness from 1-5:
5 = every claim is directly supported by the retrieved docs
4 = mostly supported, one minor inferential leap
3 = mostly supported, but one unsupported claim
2 = several claims have no support
1 = the reply contradicts or invents facts not in the docs

Reply: {output}
Retrieved docs: {context}

Return: {"score": <int>, "reason": "<one sentence>"}

3. Human review

Sampled or full review by domain experts. Slow (minutes per item) and expensive (real people), but the ground truth for high-stakes domains. Three patterns work:

Random sample: review N items per day. Catches drift you'd miss otherwise.
Disagreement-triggered: when LLM-as-judge and rule-based scores disagree, route to humans. Highest signal-to-cost.
Edge case curation: humans label the long tail you sample from production traces. The labeled examples become your gold-set.

Founder's take

Frank Chen · Head of DevRel, Respan

Teams running production evals on Respan

Online vs offline evals

Offline evals run during development against a frozen test set. Fast feedback loop — change a prompt, run the suite, see if the change improved or regressed. The unit is a controlled experiment.

Production teams run both. Offline is the inner-loop dev experience. Online is the canary that catches what offline missed. Skipping either creates blind spots.

How to design an eval rubric

Six rules that hold up across the eval rubrics I've seen win:

Start with user complaints, not abstract criteria. Read the last 30 days of customer feedback. Each common complaint becomes a criterion. "Replies sound robotic" → tone eval. "Wrong dates" → factual accuracy eval.
One criterion per eval. Don't mix faithfulness and tone in the same prompt — the judge can't disentangle them. One scope, one rubric, one score.
Use ordinal not binary, except when the criterion is genuinely binary. Faithfulness on a 1-5 scale gives you a smooth dial; pass/fail throws away signal.
Anchor every score with an example. "5 = like this real example. 1 = like this real example." Drastically reduces judge variance.
Force the judge to reason before scoring. Output schema: {reason: string, score: int}. Reason first, score second. The chain of thought matters.
Validate the judge weekly. Random sample 50 production traces, score them with humans, compare to judge scores. Track agreement; if it drops below 85%, your judge needs work.

How to instrument

from respan import Respan

respan = Respan(api_key="...")

# Define an eval
faithfulness_eval = respan.eval.create(
    name="faithfulness",
    type="llm_as_judge",
    judge_model="gpt-4o",
    prompt="""
    Score the reply on faithfulness 1-5 against the retrieved docs.
    Return: {"reason": "<one sentence>", "score": <int>}
    """,
)

# Attach to live traffic — runs on every request matching the filter
faithfulness_eval.attach_online(filter={"feature.id": "support_agent"})

# Or run offline against a dataset
results = faithfulness_eval.run_offline(dataset_id="support_v3_test")
print(results.aggregate())  # {"mean": 4.2, "p25": 3, "p75": 5}

Common eval mistakes

One "quality" score that conflates everything. Decompose into 3-5 criteria.
LLM-as-judge with no human anchor. Sample-validate weekly or you're measuring judge bias.
Test set built from synthetic data only. Synthetic test sets miss the ugly edge cases real users produce. Sample from production traces.
No regression test before shipping prompt changes. The whole point of evals is to catch regressions before users do.
Rule-based evals dismissed as "too simple." They catch the cheapest failures cheapest. Always run them.
Eval scores not tied to traces. A score without the trace context is unactionable. The platform should link them automatically.

Eval platforms compared

Respan: rule-based + LLM-as-judge + human review, online + offline, datasets versioned, scores attached to every trace.
Braintrust: eval-first product, deepest scoring functions library, slightly less integrated with tracing.
Langfuse: solid eval support, open source, strong dataset features.
LangSmith: LangChain-native, evaluator library tied to LCEL.
Helicone: basic evals, third-party integrations for advanced.

Frequently asked questions

Frank Chen

Head of DevRel, Respan

Head of DevRel at Respan (YC W24). Working alongside the team running the infrastructure that handles 80M+ LLM requests a day.

Connect on LinkedIn →

Score every LLM output, online and offline

Rule-based, LLM-as-judge, and human review — all integrated with traces, prompts, and gateway in one platform.

Start for free See evals in product

Related guides: LLM observability · LLM tracing · LLM gateway

LLM Evals: The Complete Guide

What are LLM evals?

The three types of evals

1. Rule-based

2. LLM-as-a-judge

3. Human review

Online vs offline evals

How to design an eval rubric

How to instrument

Common eval mistakes

Eval platforms compared

Frequently asked questions

Score every LLM output, online and offline

Ship reliable AI agents

LLM Evals: The Complete Guide

What are LLM evals?

The three types of evals

1. Rule-based

2. LLM-as-a-judge

3. Human review

Online vs offline evals

How to design an eval rubric

How to instrument

Common eval mistakes

Eval platforms compared

Frequently asked questions

Score every LLM output, online and offline

Ship reliable AI agents