Evaluating an LLM for your specific use case is a different problem from running standardized benchmarks. Benchmarks (MMLU, SWE-bench) measure general model capability; evals measure your application's quality on your data. The methods are different and the answer for your workload is rarely "the highest-benchmark model."
This guide is the practical method we use at Respan and recommend to customers. Five steps, a week of work the first time, a few hours per re-run.
TL;DR — the five-step method
- Define quality criteria specific to your use case (3-5, not one)
- Build a test set from real production data (50-200 examples)
- Score outputs with rule-based + LLM-as-judge + sampled human review
- Compare candidates on quality vs latency vs cost
- Wire continuous evaluation so production catches drift
Step 1: Define quality criteria
Quality is not a single number. It's 3-5 orthogonal criteria, and they move independently. The biggest single mistake in LLM eval is "test for quality" — you can't, because quality means different things for different criteria.
Read the last 30 days of customer feedback. Each common complaint becomes a criterion. Examples:
| Product type | Common criteria |
|---|---|
| Customer support agent | Faithfulness, empathy tone, escalation accuracy, format compliance |
| RAG system | Faithfulness, citation correctness, completeness, relevance |
| Code generation | Compilability, correctness, style, security |
| Summarization | Faithfulness, completeness, conciseness |
Rules:
- 3-5 criteria, no more
- One criterion per evaluator (don't conflate faithfulness and tone)
- Use 1-5 ordinal scales, not binary
- Anchor every score with an example ("5 = like this real example")
Step 2: Build a test set
Test sets fail when they're synthetic. Real production traffic is messier. Sample from production.
Process:
- Pull 500 random production traces from the last week
- Cluster by user intent or feature
- Sample 10-20 from each cluster
- Have a human label expected behavior for each
Target size: 50-200 examples. Quality of examples matters more than size.
For each example, record:
- Input (prompt + context)
- Expected behavior (notes, not exact output — too brittle)
- Edge case category (so you can slice by category later)
If you can't sample from production yet (pre-launch), use representative synthetic data but plan to swap to production data within month 1 of launch.
Step 3: Score outputs
Three scoring methods, each with a role.
Rule-based (always run first)
Deterministic checks: schema validation, regex match, length bounds, profanity filter, banned-phrase detection, citation count, JSON validity.
def schema_compliance(output: str) -> int:
try:
data = json.loads(output)
return 1 if validate_schema(data) else 0
except json.JSONDecodeError:
return 0Cheap, fast, deterministic. Run rule-based first; they catch the cheapest failures cheapest.
LLM-as-a-judge
A separate LLM scores outputs against a rubric. Use a cheap judge (GPT-5.4 nano, Claude Haiku 4.5) — judge cost is per call so cheap models compound.
Example rubric (faithfulness):
Score the reply on faithfulness from 1-5:
5 = every claim directly supported by retrieved docs
1 = the reply contradicts or invents facts
Reply: {output}
Retrieved docs: {context}
Output JSON: {"reason": "<one sentence>", "score": <int>}
Always validate the judge weekly. Random sample 50 production traces, score with humans, compare to judge scores. Track agreement. Drop below 85% → improve the rubric.
Human review
Three patterns:
- Random sample — review N items per day from production
- Disagreement-triggered — when judge and rule-based disagree, route to humans
- Edge case curation — humans label the long tail you sample from production
Human review is slow and expensive. Goal is anchoring, not exhaustive coverage.
Step 4: Compare candidates
Now run the test set against multiple candidates: different models, different prompts, different agent architectures.
For each candidate, record:
- Quality score per criterion (mean and distribution across the test set)
- Latency P50/P95
- Cost per query
The winner is rarely "best quality alone" — it's "best Pareto frontier of quality, cost, and latency for your application."
What you'll typically find:
- Premium model (GPT-5.5, Opus 4.7) wins on quality but loses on cost
- Volume model (GPT-5.4 nano, Haiku 4.5) wins on cost but loses on quality
- Mid-tier model (GPT-5.4, Sonnet 4.6) is usually the production answer
The right architecture for production: route different query types to different models via a gateway. Not "pick the best model" — "pick the best model per query type."
Step 5: Wire continuous evaluation
Eval runs are useful only if they trigger action.
Offline evals (every prompt or model change):
- Run automatically in CI
- Block deploy if scores drop more than 5% on any criterion
- Surface diffs
Online evals (on production traffic):
- Run continuously on a sample
- Alert when scores drop more than 5pp week-over-week
- Alert on retry rate spikes
- Alert on cost-per-active-user spikes
Rollback path:
- Every prompt and model version must be one-click rollback
- Don't tie rollback to a code deploy
A typical workflow
1. Engineer changes prompt v22 → v23
2. CI runs offline eval suite against v23
3. Compare v23 scores to v22 per criterion
4. If pass: deploy to staging
5. Online evals run on staging traffic
6. Production deploy with traffic split — 5% to v23, 95% to v22
7. Online evals score v23 traffic continuously
8. After 24-48h with stable scores: promote v23 to 100%
This is what mature teams do. The eval pipeline runs at every step.
Common evaluation mistakes
- One "quality" score conflates everything. Decompose.
- Synthetic test set only. Production data is messier.
- LLM-as-judge with no human anchor. Validate weekly.
- Rule-based skipped because it "feels less sophisticated." Catches cheapest failures cheapest.
- Test scores not tied to model/prompt versions. Unactionable.
- No rollback path. Pointless to detect regressions if you can't fix them in 30 seconds.
- Skipping production monitoring. Offline evals miss what production exposes.
Tools you can use
- Respan — rule-based + LLM-as-judge + human review, online + offline, integrated with traces
- Braintrust — deepest scoring functions library
- Langfuse — open-source self-host
- LangSmith — LangChain-native
- Promptfoo — open-source CLI for CI
How to start tomorrow
If you're starting from zero:
- Today: pick 3 user complaints. Turn each into a criterion.
- This week: build a 50-example test set from production traces.
- Next week: wire LLM-as-judge against the criteria. Run on every prompt change.
- Week 3: add online eval on production traffic. Alert on score drops.
- Week 4: random-sample human review weekly. Track agreement vs judge.
A month from now you'll have a working eval pipeline.
FAQ
Should I use benchmarks like MMLU? For picking a base model, sure. For shipping an application, no — benchmarks don't measure your specific use case.
How big should my test set be? 50-200 examples for a focused eval. Quality of examples matters more than size.
How often should I re-run? Offline: every prompt or model change. Online: continuous (sample of every request).
Should I trust LLM-as-judge? For breadth — yes. As your only signal — no. Validate weekly with human review.
How much does eval cost? Rule-based: free. LLM-as-judge with cheap judge models: cents per 1k requests. Human review: real-people cost. Most teams' total eval cost under 5% of LLM bill.
Can I run evals in CI? Yes. Promptfoo is built for it. Most observability platforms have CI integrations.