Evaluating an LLM for your specific use case is a different problem from running standardized benchmarks. Benchmarks (MMLU, SWE-bench) measure general model capability; evals measure your application's quality on your data. The methods are different and the answer for your workload is rarely "the highest-benchmark model."
This guide is the practical method we use at Respan and recommend to customers. Five steps, a week of work the first time, a few hours per re-run.
TL;DR: the five-step method
- Define quality criteria specific to your use case (3-5, not one)
- Build a test set from real production data (50-200 examples)
- Score outputs with rule-based + LLM-as-judge + sampled human review
- Compare candidates on quality vs latency vs cost
- Wire continuous evaluation so production catches drift
Step 1: Define quality criteria
Quality is not a single number. It's 3-5 orthogonal criteria, and they move independently. The biggest single mistake in LLM eval is "test for quality." You can't, because quality means different things for different criteria.
Read the last 30 days of customer feedback. Each common complaint becomes a criterion. Examples:
| Product type | Common criteria |
|---|---|
| Customer support agent | Faithfulness, empathy tone, escalation accuracy, format compliance |
| RAG system | Faithfulness, citation correctness, completeness, relevance |
| Code generation | Compilability, correctness, style, security |
| Summarization | Faithfulness, completeness, conciseness |
Rules:
- 3-5 criteria, no more
- One criterion per evaluator (don't conflate faithfulness and tone)
- Use 1-5 ordinal scales, not binary
- Anchor every score with an example ("5 = like this real example")
Step 2: Build a test set
Test sets fail when they're synthetic. Real production traffic is messier. Sample from production.
Process:
- Pull 500 random production traces from the last week
- Cluster by user intent or feature
- Sample 10-20 from each cluster
- Have a human label expected behavior for each
Target size: 50-200 examples. Quality of examples matters more than size.
For each example, record:
- Input (prompt + context)
- Expected behavior (notes, not exact output; too brittle)
- Edge case category (so you can slice by category later)
If you can't sample from production yet (pre-launch), use representative synthetic data but plan to swap to production data within month 1 of launch.
Step 3: Score outputs
Three scoring methods, each with a role.
Rule-based (always run first)
Deterministic checks: schema validation, regex match, length bounds, profanity filter, banned-phrase detection, citation count, JSON validity.
def schema_compliance(output: str) -> int:
try:
data = json.loads(output)
return 1 if validate_schema(data) else 0
except json.JSONDecodeError:
return 0Cheap, fast, deterministic. Run rule-based first; they catch the cheapest failures cheapest.
LLM-as-a-judge
A separate LLM scores outputs against a rubric. Use a cheap judge (GPT-5.4 nano, Claude Haiku 4.5); judge cost is per call so cheap models compound.
Example rubric (faithfulness):
Score the reply on faithfulness from 1-5:
5 = every claim directly supported by retrieved docs
1 = the reply contradicts or invents facts
Reply: {output}
Retrieved docs: {context}
Output JSON: {"reason": "<one sentence>", "score": <int>}
Always validate the judge weekly. Random sample 50 production traces, score with humans, compare to judge scores. Track agreement. Drop below 85% → improve the rubric.
Human review
Three patterns:
- Random sample: review N items per day from production
- Disagreement-triggered: when judge and rule-based disagree, route to humans
- Edge case curation: humans label the long tail you sample from production
Human review is slow and expensive. Goal is anchoring, not exhaustive coverage.
Step 4: Compare candidates
Now run the test set against multiple candidates: different models, different prompts, different agent architectures.
For each candidate, record:
- Quality score per criterion (mean and distribution across the test set)
- Latency P50/P95
- Cost per query
The winner is rarely "best quality alone." It's "best Pareto frontier of quality, cost, and latency for your application."
What you'll typically find:
- Premium model (GPT-5.5, Opus 4.7) wins on quality but loses on cost
- Volume model (GPT-5.4 nano, Haiku 4.5) wins on cost but loses on quality
- Mid-tier model (GPT-5.4, Sonnet 4.6) is usually the production answer
The right architecture for production: route different query types to different models via a gateway. Not "pick the best model," but "pick the best model per query type."
Step 5: Wire continuous evaluation
Eval runs are useful only if they trigger action.
Offline evals (every prompt or model change):
- Run automatically in CI
- Block deploy if scores drop more than 5% on any criterion
- Surface diffs
Online evals (on production traffic):
- Run continuously on a sample
- Alert when scores drop more than 5pp week-over-week
- Alert on retry rate spikes
- Alert on cost-per-active-user spikes
Rollback path:
- Every prompt and model version must be one-click rollback
- Don't tie rollback to a code deploy
A typical workflow
1. Engineer changes prompt v22 → v23
2. CI runs offline eval suite against v23
3. Compare v23 scores to v22 per criterion
4. If pass: deploy to staging
5. Online evals run on staging traffic
6. Production deploy with traffic split: 5% to v23, 95% to v22
7. Online evals score v23 traffic continuously
8. After 24-48h with stable scores: promote v23 to 100%
This is what mature teams do. The eval pipeline runs at every step.
Common evaluation mistakes
- One "quality" score conflates everything. Decompose.
- Synthetic test set only. Production data is messier.
- LLM-as-judge with no human anchor. Validate weekly.
- Rule-based skipped because it "feels less sophisticated." Catches cheapest failures cheapest.
- Test scores not tied to model/prompt versions. Unactionable.
- No rollback path. Pointless to detect regressions if you can't fix them in 30 seconds.
- Skipping production monitoring. Offline evals miss what production exposes.
Tools you can use
- Respan: rule-based + LLM-as-judge + human review, online + offline, integrated with traces
- Braintrust: deepest scoring functions library
- Langfuse: open-source self-host
- LangSmith: LangChain-native
- Promptfoo: open-source CLI for CI
How to start tomorrow
If you're starting from zero:
- Today: pick 3 user complaints. Turn each into a criterion.
- This week: build a 50-example test set from production traces.
- Next week: wire LLM-as-judge against the criteria. Run on every prompt change.
- Week 3: add online eval on production traffic. Alert on score drops.
- Week 4: random-sample human review weekly. Track agreement vs judge.
A month from now you'll have a working eval pipeline.
FAQ
Should I use benchmarks like MMLU? For picking a base model, sure. For shipping an application, no; benchmarks don't measure your specific use case.
How big should my test set be? 50-200 examples for a focused eval. Quality of examples matters more than size.
How often should I re-run? Offline: every prompt or model change. Online: continuous (sample of every request).
Should I trust LLM-as-judge? For breadth, yes. As your only signal, no. Validate weekly with human review.
How much does eval cost? Rule-based: free. LLM-as-judge with cheap judge models: cents per 1k requests. Human review: real-people cost. Most teams' total eval cost under 5% of LLM bill.
Can I run evals in CI? Yes. Promptfoo is built for it. Most observability platforms have CI integrations.