Evaluating an LLM for your specific use case is a different problem from running standardized benchmarks. Benchmarks (MMLU, SWE-bench) measure general model capability; evals measure your application's quality on your data. The methods are different and the answer for your workload is rarely "the highest-benchmark model."

This guide is the practical method we use at Respan and recommend to customers. Five steps, a week of work the first time, a few hours per re-run.

TL;DR — the five-step method

Define quality criteria specific to your use case (3-5, not one)
Build a test set from real production data (50-200 examples)
Score outputs with rule-based + LLM-as-judge + sampled human review
Compare candidates on quality vs latency vs cost
Wire continuous evaluation so production catches drift

Step 1: Define quality criteria

Quality is not a single number. It's 3-5 orthogonal criteria, and they move independently. The biggest single mistake in LLM eval is "test for quality" — you can't, because quality means different things for different criteria.

Read the last 30 days of customer feedback. Each common complaint becomes a criterion. Examples:

Product type	Common criteria
Customer support agent	Faithfulness, empathy tone, escalation accuracy, format compliance
RAG system	Faithfulness, citation correctness, completeness, relevance
Code generation	Compilability, correctness, style, security
Summarization	Faithfulness, completeness, conciseness

Rules:

3-5 criteria, no more
One criterion per evaluator (don't conflate faithfulness and tone)
Use 1-5 ordinal scales, not binary
Anchor every score with an example ("5 = like this real example")

Step 2: Build a test set

Test sets fail when they're synthetic. Real production traffic is messier. Sample from production.

Process:

Pull 500 random production traces from the last week
Cluster by user intent or feature
Sample 10-20 from each cluster
Have a human label expected behavior for each

Target size: 50-200 examples. Quality of examples matters more than size.

For each example, record:

Input (prompt + context)
Expected behavior (notes, not exact output — too brittle)
Edge case category (so you can slice by category later)

If you can't sample from production yet (pre-launch), use representative synthetic data but plan to swap to production data within month 1 of launch.

Step 3: Score outputs

Three scoring methods, each with a role.

Rule-based (always run first)

Deterministic checks: schema validation, regex match, length bounds, profanity filter, banned-phrase detection, citation count, JSON validity.

def schema_compliance(output: str) -> int:
    try:
        data = json.loads(output)
        return 1 if validate_schema(data) else 0
    except json.JSONDecodeError:
        return 0

Cheap, fast, deterministic. Run rule-based first; they catch the cheapest failures cheapest.

LLM-as-a-judge

A separate LLM scores outputs against a rubric. Use a cheap judge (GPT-5.4 nano, Claude Haiku 4.5) — judge cost is per call so cheap models compound.

Example rubric (faithfulness):

Score the reply on faithfulness from 1-5:
5 = every claim directly supported by retrieved docs
1 = the reply contradicts or invents facts

Reply: {output}
Retrieved docs: {context}

Output JSON: {"reason": "<one sentence>", "score": <int>}

Always validate the judge weekly. Random sample 50 production traces, score with humans, compare to judge scores. Track agreement. Drop below 85% → improve the rubric.

Human review

Three patterns:

Random sample — review N items per day from production
Disagreement-triggered — when judge and rule-based disagree, route to humans
Edge case curation — humans label the long tail you sample from production

Human review is slow and expensive. Goal is anchoring, not exhaustive coverage.

Step 4: Compare candidates

Now run the test set against multiple candidates: different models, different prompts, different agent architectures.

For each candidate, record:

Quality score per criterion (mean and distribution across the test set)
Latency P50/P95
Cost per query

The winner is rarely "best quality alone" — it's "best Pareto frontier of quality, cost, and latency for your application."

What you'll typically find:

Premium model (GPT-5.5, Opus 4.7) wins on quality but loses on cost
Volume model (GPT-5.4 nano, Haiku 4.5) wins on cost but loses on quality
Mid-tier model (GPT-5.4, Sonnet 4.6) is usually the production answer

The right architecture for production: route different query types to different models via a gateway. Not "pick the best model" — "pick the best model per query type."

Step 5: Wire continuous evaluation

Eval runs are useful only if they trigger action.

Offline evals (every prompt or model change):

Run automatically in CI
Block deploy if scores drop more than 5% on any criterion
Surface diffs

Online evals (on production traffic):

Run continuously on a sample
Alert when scores drop more than 5pp week-over-week
Alert on retry rate spikes
Alert on cost-per-active-user spikes

Rollback path:

Every prompt and model version must be one-click rollback
Don't tie rollback to a code deploy

A typical workflow

1. Engineer changes prompt v22 → v23
2. CI runs offline eval suite against v23
3. Compare v23 scores to v22 per criterion
4. If pass: deploy to staging
5. Online evals run on staging traffic
6. Production deploy with traffic split — 5% to v23, 95% to v22
7. Online evals score v23 traffic continuously
8. After 24-48h with stable scores: promote v23 to 100%

This is what mature teams do. The eval pipeline runs at every step.

Common evaluation mistakes

One "quality" score conflates everything. Decompose.
Synthetic test set only. Production data is messier.
LLM-as-judge with no human anchor. Validate weekly.
Rule-based skipped because it "feels less sophisticated." Catches cheapest failures cheapest.
Test scores not tied to model/prompt versions. Unactionable.
No rollback path. Pointless to detect regressions if you can't fix them in 30 seconds.
Skipping production monitoring. Offline evals miss what production exposes.

Tools you can use

Respan — rule-based + LLM-as-judge + human review, online + offline, integrated with traces
Braintrust — deepest scoring functions library
Langfuse — open-source self-host
LangSmith — LangChain-native
Promptfoo — open-source CLI for CI

How to start tomorrow

If you're starting from zero:

Today: pick 3 user complaints. Turn each into a criterion.
This week: build a 50-example test set from production traces.
Next week: wire LLM-as-judge against the criteria. Run on every prompt change.
Week 3: add online eval on production traffic. Alert on score drops.
Week 4: random-sample human review weekly. Track agreement vs judge.

A month from now you'll have a working eval pipeline.

FAQ

Should I use benchmarks like MMLU? For picking a base model, sure. For shipping an application, no — benchmarks don't measure your specific use case.

How big should my test set be? 50-200 examples for a focused eval. Quality of examples matters more than size.

How often should I re-run? Offline: every prompt or model change. Online: continuous (sample of every request).

Should I trust LLM-as-judge? For breadth — yes. As your only signal — no. Validate weekly with human review.

How much does eval cost? Rule-based: free. LLM-as-judge with cheap judge models: cents per 1k requests. Human review: real-people cost. Most teams' total eval cost under 5% of LLM bill.

Can I run evals in CI? Yes. Promptfoo is built for it. Most observability platforms have CI integrations.

This guide is the practical method we use at Respan and recommend to customers. Five steps, a week of work the first time, a few hours per re-run.

TL;DR — the five-step method

Define quality criteria specific to your use case (3-5, not one)
Build a test set from real production data (50-200 examples)
Score outputs with rule-based + LLM-as-judge + sampled human review
Compare candidates on quality vs latency vs cost
Wire continuous evaluation so production catches drift

Step 1: Define quality criteria

Read the last 30 days of customer feedback. Each common complaint becomes a criterion. Examples:

Product type	Common criteria
Customer support agent	Faithfulness, empathy tone, escalation accuracy, format compliance
RAG system	Faithfulness, citation correctness, completeness, relevance
Code generation	Compilability, correctness, style, security
Summarization	Faithfulness, completeness, conciseness

Rules:

3-5 criteria, no more
One criterion per evaluator (don't conflate faithfulness and tone)
Use 1-5 ordinal scales, not binary
Anchor every score with an example ("5 = like this real example")

Step 2: Build a test set

Test sets fail when they're synthetic. Real production traffic is messier. Sample from production.

Process:

Pull 500 random production traces from the last week
Cluster by user intent or feature
Sample 10-20 from each cluster
Have a human label expected behavior for each

Target size: 50-200 examples. Quality of examples matters more than size.

For each example, record:

Input (prompt + context)
Expected behavior (notes, not exact output — too brittle)
Edge case category (so you can slice by category later)

If you can't sample from production yet (pre-launch), use representative synthetic data but plan to swap to production data within month 1 of launch.

Step 3: Score outputs

Three scoring methods, each with a role.

Rule-based (always run first)

Deterministic checks: schema validation, regex match, length bounds, profanity filter, banned-phrase detection, citation count, JSON validity.

def schema_compliance(output: str) -> int:
    try:
        data = json.loads(output)
        return 1 if validate_schema(data) else 0
    except json.JSONDecodeError:
        return 0

Cheap, fast, deterministic. Run rule-based first; they catch the cheapest failures cheapest.

LLM-as-a-judge

A separate LLM scores outputs against a rubric. Use a cheap judge (GPT-5.4 nano, Claude Haiku 4.5) — judge cost is per call so cheap models compound.

Example rubric (faithfulness):

Score the reply on faithfulness from 1-5:
5 = every claim directly supported by retrieved docs
1 = the reply contradicts or invents facts

Reply: {output}
Retrieved docs: {context}

Output JSON: {"reason": "<one sentence>", "score": <int>}

Always validate the judge weekly. Random sample 50 production traces, score with humans, compare to judge scores. Track agreement. Drop below 85% → improve the rubric.

Human review

Three patterns:

Random sample — review N items per day from production
Disagreement-triggered — when judge and rule-based disagree, route to humans
Edge case curation — humans label the long tail you sample from production

Human review is slow and expensive. Goal is anchoring, not exhaustive coverage.

Step 4: Compare candidates

Now run the test set against multiple candidates: different models, different prompts, different agent architectures.

For each candidate, record:

Quality score per criterion (mean and distribution across the test set)
Latency P50/P95
Cost per query

The winner is rarely "best quality alone" — it's "best Pareto frontier of quality, cost, and latency for your application."

What you'll typically find:

Premium model (GPT-5.5, Opus 4.7) wins on quality but loses on cost
Volume model (GPT-5.4 nano, Haiku 4.5) wins on cost but loses on quality
Mid-tier model (GPT-5.4, Sonnet 4.6) is usually the production answer

The right architecture for production: route different query types to different models via a gateway. Not "pick the best model" — "pick the best model per query type."

Step 5: Wire continuous evaluation

Eval runs are useful only if they trigger action.

Offline evals (every prompt or model change):

Run automatically in CI
Block deploy if scores drop more than 5% on any criterion
Surface diffs

Online evals (on production traffic):

Run continuously on a sample
Alert when scores drop more than 5pp week-over-week
Alert on retry rate spikes
Alert on cost-per-active-user spikes

Rollback path:

Every prompt and model version must be one-click rollback
Don't tie rollback to a code deploy

A typical workflow

1. Engineer changes prompt v22 → v23
2. CI runs offline eval suite against v23
3. Compare v23 scores to v22 per criterion
4. If pass: deploy to staging
5. Online evals run on staging traffic
6. Production deploy with traffic split — 5% to v23, 95% to v22
7. Online evals score v23 traffic continuously
8. After 24-48h with stable scores: promote v23 to 100%

This is what mature teams do. The eval pipeline runs at every step.

Common evaluation mistakes

One "quality" score conflates everything. Decompose.
Synthetic test set only. Production data is messier.
LLM-as-judge with no human anchor. Validate weekly.
Rule-based skipped because it "feels less sophisticated." Catches cheapest failures cheapest.
Test scores not tied to model/prompt versions. Unactionable.
No rollback path. Pointless to detect regressions if you can't fix them in 30 seconds.
Skipping production monitoring. Offline evals miss what production exposes.

Tools you can use

Respan — rule-based + LLM-as-judge + human review, online + offline, integrated with traces
Braintrust — deepest scoring functions library
Langfuse — open-source self-host
LangSmith — LangChain-native
Promptfoo — open-source CLI for CI

How to start tomorrow

If you're starting from zero:

Today: pick 3 user complaints. Turn each into a criterion.
This week: build a 50-example test set from production traces.
Next week: wire LLM-as-judge against the criteria. Run on every prompt change.
Week 3: add online eval on production traffic. Alert on score drops.
Week 4: random-sample human review weekly. Track agreement vs judge.

A month from now you'll have a working eval pipeline.

FAQ

Should I use benchmarks like MMLU? For picking a base model, sure. For shipping an application, no — benchmarks don't measure your specific use case.

How big should my test set be? 50-200 examples for a focused eval. Quality of examples matters more than size.

How often should I re-run? Offline: every prompt or model change. Online: continuous (sample of every request).

Should I trust LLM-as-judge? For breadth — yes. As your only signal — no. Validate weekly with human review.

How much does eval cost? Rule-based: free. LLM-as-judge with cheap judge models: cents per 1k requests. Human review: real-people cost. Most teams' total eval cost under 5% of LLM bill.

Can I run evals in CI? Yes. Promptfoo is built for it. Most observability platforms have CI integrations.

How to Evaluate an LLM

TL;DR — the five-step method

Step 1: Define quality criteria

Step 2: Build a test set

Step 3: Score outputs

Rule-based (always run first)

LLM-as-a-judge

Human review

Step 4: Compare candidates

Step 5: Wire continuous evaluation

A typical workflow

Common evaluation mistakes

Tools you can use

How to start tomorrow

FAQ

Related articles

How to Test AI Models

8 Best LLM Evaluation Tools in 2026

What Is Prompt Evaluation?

Built for AI agents.
Break less.
Ship more.

How to Evaluate an LLM

TL;DR — the five-step method

Step 1: Define quality criteria

Step 2: Build a test set

Step 3: Score outputs

Rule-based (always run first)

LLM-as-a-judge

Human review

Step 4: Compare candidates

Step 5: Wire continuous evaluation

A typical workflow

Common evaluation mistakes

Tools you can use

How to start tomorrow

FAQ

Related articles

How to Test AI Models

8 Best LLM Evaluation Tools in 2026

What Is Prompt Evaluation?

Built for AI agents.
Break less.
Ship more.

Related articles

How-to
How to Test AI Models
How to test AI models in production: rule-based checks, LLM-as-judge, sampled human review, eval pipelines, A/B testing, and the workflow that catches regressions before customers do.
Frank Chen · 18 hours ago

Best of
8 Best LLM Evaluation Tools in 2026
Best LLM evaluation tools in 2026: Respan, Braintrust, Langfuse, LangSmith, Promptfoo, DeepEval, Galileo, Patronus. Pricing, features, and when each is the right pick.
Frank Chen · 18 hours ago

Explainer
What Is Prompt Evaluation?
Prompt evaluation explained: what it is, why it matters, the three types (rule-based, LLM-as-judge, human review), and how to build a real eval pipeline.
Frank Chen · 18 hours ago

How to Evaluate an LLM

TL;DR — the five-step method

Step 1: Define quality criteria

Step 2: Build a test set

Step 3: Score outputs

Rule-based (always run first)

LLM-as-a-judge

Human review

Step 4: Compare candidates

Step 5: Wire continuous evaluation

A typical workflow

Common evaluation mistakes

Tools you can use

How to start tomorrow

FAQ

Related

Related articles

How to Test AI Models

8 Best LLM Evaluation Tools in 2026

What Is Prompt Evaluation?

Built for AI agents. Break less. Ship more.

How to Evaluate an LLM

TL;DR — the five-step method

Step 1: Define quality criteria

Step 2: Build a test set

Step 3: Score outputs

Rule-based (always run first)

LLM-as-a-judge

Human review

Step 4: Compare candidates

Step 5: Wire continuous evaluation

A typical workflow

Common evaluation mistakes

Tools you can use

How to start tomorrow

FAQ

Related

Related articles

How to Test AI Models

8 Best LLM Evaluation Tools in 2026

What Is Prompt Evaluation?

Built for AI agents. Break less. Ship more.

Built for AI agents.
Break less.
Ship more.

Built for AI agents.
Break less.
Ship more.