Testing an AI model isn't like testing a function. There's no input that produces a single deterministic correct output you can assert == against. Outputs vary across runs. Quality is contested. Edge cases are infinite. The classic test pyramid breaks against this reality and most teams' first instinct — "we'll just unit-test the prompt" — fails within months.
This guide is the pipeline that actually works in production. It's what we run on Respan and what we help customers run on their stacks. The principles apply whether you're testing a single LLM call, a multi-step agent, or a fine-tuned model.
The four-step pipeline
- Define quality criteria specific to your use case (3-5, not one)
- Build a test set from real production data (50-200 examples)
- Score outputs with rule-based + LLM-as-judge + sampled human review
- Wire alerts and rollback so regressions are caught before users notice
That's it. Everything else is implementation detail. Skip any step and the pipeline doesn't work.
Step 1: Define quality criteria
The single biggest mistake in AI testing is "test for quality." Quality is not a single number. It's 3-5 orthogonal criteria, and they move independently.
Read the last 30 days of customer feedback. Each common complaint becomes a criterion. Examples:
| Product type | Common criteria |
|---|---|
| Customer support agent | Faithfulness to docs, empathy tone, escalation accuracy, format compliance, response length |
| RAG system | Faithfulness, citation correctness, completeness, relevance ranking |
| Code generation agent | Compilability, correctness vs spec, style guide compliance, security |
| Summarization | Faithfulness, completeness, conciseness, key fact preservation |
Rules:
- 3-5 criteria, no more
- One criterion per evaluator (don't conflate faithfulness and tone)
- Use ordinal scales (1-5) not binary, except for genuinely binary criteria
- Anchor every score with an example ("5 = like this real example")
Step 2: Build a test set
Test sets fail when they're synthetic — real production traffic is messier than any test set you'd hand-write. Sample from production traces:
1. Pull 500 random production traces from the last week
2. Cluster by user intent or feature
3. Sample 10-20 from each cluster
4. Have a human label expected behavior for each
Target size: 50-200 examples. Quality matters more than size.
For each example, record:
- Input (prompt + context)
- Expected behavior (notes, not exact output)
- Edge case category (so you can slice by category)
If you can't sample from production yet (pre-launch), use representative synthetic data but recognize it'll miss real-world ugliness. Plan to swap to production data within month 1 of launch.
Step 3: Score outputs
Three flavors of scoring, each with a different role.
Rule-based (always run first)
Deterministic checks: schema validity, regex match, length bounds, profanity filter, banned-phrase detection, citation count, JSON structural correctness.
def schema_compliance(output: str) -> int:
try:
data = json.loads(output)
return 1 if validate_schema(data) else 0
except json.JSONDecodeError:
return 0Cheap, fast, deterministic. Always run rule-based first — they catch the cheapest failures cheapest.
LLM-as-a-judge (run on every test)
A separate LLM scores outputs against a rubric. Example rubric:
You are evaluating a customer support reply for faithfulness to retrieved docs.
Score 1-5:
5 = every claim is directly supported
4 = mostly supported, one minor inferential leap
3 = mostly supported, one unsupported claim
2 = several claims have no support
1 = the reply contradicts or invents facts
Reply: {output}
Retrieved docs: {context}
Output JSON: {"reason": "<one sentence>", "score": <int>}
Use a cheap judge model (GPT-5.4 nano, Claude Haiku 4.5) — judge cost is per-call so cheap models compound favorably.
Validate the judge weekly: random sample 50 production traces, score with humans, compare to judge scores. Track agreement. If agreement drops below 85%, your judge needs work.
Human review (sampled)
Three human review patterns:
- Random sample — review N items per day from production
- Disagreement-triggered — when LLM-as-judge and rule-based scores disagree, route to humans
- Edge case curation — humans label the long tail you sample from production traces
Human review is slow and expensive. The goal is anchoring, not exhaustive coverage.
Step 4: Wire alerts and rollback
Test runs are useful only if they trigger action.
Offline (development) tests:
- Run on every prompt or model change
- Block deploy if scores drop more than 5% on any criterion
- Surface diffs between old version scores and new version scores
Online (production) monitoring:
- Run evaluators continuously on a sample of production traffic
- Alert when scores drop more than 5pp week-over-week on any criterion
- Alert when retry rate spikes
- Alert when cost per active user spikes
Rollback path:
- Every prompt or model version must be one-click rollback
- Don't tie rollback to a code deploy
A typical workflow
What "test before ship" actually looks like in production:
1. Engineer changes prompt v22 → v23
2. CI runs offline eval suite against v23
3. Compare v23 scores to v22 scores per criterion
4. If pass: deploy to staging
5. Staging soak; online evals running on staging traffic
6. Production deploy with traffic split — 5% to v23, 95% to v22
7. Online evals score v23 traffic continuously
8. After 24-48h with stable scores: promote v23 to 100%
9. Continuous monitoring; rollback if scores drop
This is the workflow we see at teams shipping reliable AI. Without these steps, regressions reach customers.
Common AI testing mistakes
- One "quality" score that conflates everything. Decompose into 3-5 criteria.
- Synthetic test set only. Production data is messier. Swap to real samples within month 1.
- LLM-as-judge with no human anchor. Judges have biases. Validate weekly.
- No rule-based evals because LLM-as-judge "feels more sophisticated." Rule-based catches cheapest failures cheapest.
- Test scores not tied to model/prompt versions. A score without context is unactionable.
- No rollback path. If you can't rollback in 30 seconds, you don't have a workflow.
- Skipping production monitoring. Offline tests miss the regressions that production traffic exposes.
Tools you can use
- Respan — rule-based + LLM-as-judge + human review, online + offline, integrated with traces
- Braintrust — eval-first with strong scoring functions
- Langfuse — open source, solid evals
- LangSmith — LangChain-native evaluators
- Promptfoo — open-source CLI, great for CI integration
Specific guidance by model type
Testing a single LLM call: rule-based + LLM-as-judge cover 90% of value. Human review for the long tail.
Testing a multi-step agent: add per-step scores (was the right tool called? did the right argument get passed?). Score the agent run as a whole AND each step.
Testing a fine-tuned model: test set should include held-out examples not seen in training. Score against same criteria as the base model for direct comparison.
Testing a RAG system: split into retrieval evals (was the right doc retrieved?) and generation evals (did the answer use the retrieved doc faithfully?). Failure modes are different.
How to start tomorrow
If you're starting from zero:
- Today: pick 3 user complaints. Turn each into a criterion.
- This week: build a 50-example test set from production traces.
- Next week: wire LLM-as-judge against the criteria. Run on every prompt change.
- Week 3: add online eval on production traffic. Alert on score drops.
- Week 4: random-sample human review weekly. Track agreement vs judge.
A month from now you'll have an eval pipeline. The investment compounds — every future model swap, prompt change, and architectural decision becomes a measurement instead of a guess.
FAQ
How do I test an LLM I don't control (e.g., GPT-5.5)? Same pipeline. You're not testing the model; you're testing your application's quality with the model. The model is one variable in your test.
How big should my test set be? 50-200 examples for a focused eval. Quality of examples matters more than size — use real production data, not synthetic.
How often should I re-run tests? Offline: every prompt or model change. Online: continuous (every request or sampled).
Can I run tests in CI? Yes — Promptfoo is built for this. Most observability platforms have CI integrations.
Should I trust LLM-as-judge? For breadth — yes. As your only signal — no. Validate weekly with human review.
What about benchmarks like MMLU? Use benchmarks to pick a base model. Use evals to ship a product.
How much does AI testing cost? Rule-based: free. LLM-as-judge with cheap judge models: a few cents per 1k requests. Human review: real-people cost. Most teams' total testing cost is under 5% of the LLM bill.