Testing an AI model isn't like testing a function. There's no input that produces a single deterministic correct output you can assert == against. Outputs vary across runs. Quality is contested. Edge cases are infinite. The classic test pyramid breaks against this reality and most teams' first instinct — "we'll just unit-test the prompt" — fails within months.

This guide is the pipeline that actually works in production. It's what we run on Respan and what we help customers run on their stacks. The principles apply whether you're testing a single LLM call, a multi-step agent, or a fine-tuned model.

The four-step pipeline

Define quality criteria specific to your use case (3-5, not one)
Build a test set from real production data (50-200 examples)
Score outputs with rule-based + LLM-as-judge + sampled human review
Wire alerts and rollback so regressions are caught before users notice

That's it. Everything else is implementation detail. Skip any step and the pipeline doesn't work.

Step 1: Define quality criteria

The single biggest mistake in AI testing is "test for quality." Quality is not a single number. It's 3-5 orthogonal criteria, and they move independently.

Read the last 30 days of customer feedback. Each common complaint becomes a criterion. Examples:

Product type	Common criteria
Customer support agent	Faithfulness to docs, empathy tone, escalation accuracy, format compliance, response length
RAG system	Faithfulness, citation correctness, completeness, relevance ranking
Code generation agent	Compilability, correctness vs spec, style guide compliance, security
Summarization	Faithfulness, completeness, conciseness, key fact preservation

Rules:

3-5 criteria, no more
One criterion per evaluator (don't conflate faithfulness and tone)
Use ordinal scales (1-5) not binary, except for genuinely binary criteria
Anchor every score with an example ("5 = like this real example")

Step 2: Build a test set

Test sets fail when they're synthetic — real production traffic is messier than any test set you'd hand-write. Sample from production traces:

1. Pull 500 random production traces from the last week
2. Cluster by user intent or feature
3. Sample 10-20 from each cluster
4. Have a human label expected behavior for each

Target size: 50-200 examples. Quality matters more than size.

For each example, record:

Input (prompt + context)
Expected behavior (notes, not exact output)
Edge case category (so you can slice by category)

If you can't sample from production yet (pre-launch), use representative synthetic data but recognize it'll miss real-world ugliness. Plan to swap to production data within month 1 of launch.

Step 3: Score outputs

Three flavors of scoring, each with a different role.

Rule-based (always run first)

Deterministic checks: schema validity, regex match, length bounds, profanity filter, banned-phrase detection, citation count, JSON structural correctness.

def schema_compliance(output: str) -> int:
    try:
        data = json.loads(output)
        return 1 if validate_schema(data) else 0
    except json.JSONDecodeError:
        return 0

Cheap, fast, deterministic. Always run rule-based first — they catch the cheapest failures cheapest.

LLM-as-a-judge (run on every test)

A separate LLM scores outputs against a rubric. Example rubric:

You are evaluating a customer support reply for faithfulness to retrieved docs.

Score 1-5:
5 = every claim is directly supported
4 = mostly supported, one minor inferential leap
3 = mostly supported, one unsupported claim
2 = several claims have no support
1 = the reply contradicts or invents facts

Reply: {output}
Retrieved docs: {context}

Output JSON: {"reason": "<one sentence>", "score": <int>}

Use a cheap judge model (GPT-5.4 nano, Claude Haiku 4.5) — judge cost is per-call so cheap models compound favorably.

Validate the judge weekly: random sample 50 production traces, score with humans, compare to judge scores. Track agreement. If agreement drops below 85%, your judge needs work.

Human review (sampled)

Three human review patterns:

Random sample — review N items per day from production
Disagreement-triggered — when LLM-as-judge and rule-based scores disagree, route to humans
Edge case curation — humans label the long tail you sample from production traces

Human review is slow and expensive. The goal is anchoring, not exhaustive coverage.

Step 4: Wire alerts and rollback

Test runs are useful only if they trigger action.

Offline (development) tests:

Run on every prompt or model change
Block deploy if scores drop more than 5% on any criterion
Surface diffs between old version scores and new version scores

Online (production) monitoring:

Run evaluators continuously on a sample of production traffic
Alert when scores drop more than 5pp week-over-week on any criterion
Alert when retry rate spikes
Alert when cost per active user spikes

Rollback path:

Every prompt or model version must be one-click rollback
Don't tie rollback to a code deploy

A typical workflow

What "test before ship" actually looks like in production:

1. Engineer changes prompt v22 → v23
2. CI runs offline eval suite against v23
3. Compare v23 scores to v22 scores per criterion
4. If pass: deploy to staging
5. Staging soak; online evals running on staging traffic
6. Production deploy with traffic split — 5% to v23, 95% to v22
7. Online evals score v23 traffic continuously
8. After 24-48h with stable scores: promote v23 to 100%
9. Continuous monitoring; rollback if scores drop

This is the workflow we see at teams shipping reliable AI. Without these steps, regressions reach customers.

Common AI testing mistakes

One "quality" score that conflates everything. Decompose into 3-5 criteria.
Synthetic test set only. Production data is messier. Swap to real samples within month 1.
LLM-as-judge with no human anchor. Judges have biases. Validate weekly.
No rule-based evals because LLM-as-judge "feels more sophisticated." Rule-based catches cheapest failures cheapest.
Test scores not tied to model/prompt versions. A score without context is unactionable.
No rollback path. If you can't rollback in 30 seconds, you don't have a workflow.
Skipping production monitoring. Offline tests miss the regressions that production traffic exposes.

Tools you can use

Respan — rule-based + LLM-as-judge + human review, online + offline, integrated with traces
Braintrust — eval-first with strong scoring functions
Langfuse — open source, solid evals
LangSmith — LangChain-native evaluators
Promptfoo — open-source CLI, great for CI integration

Specific guidance by model type

Testing a single LLM call: rule-based + LLM-as-judge cover 90% of value. Human review for the long tail.

Testing a multi-step agent: add per-step scores (was the right tool called? did the right argument get passed?). Score the agent run as a whole AND each step.

Testing a fine-tuned model: test set should include held-out examples not seen in training. Score against same criteria as the base model for direct comparison.

Testing a RAG system: split into retrieval evals (was the right doc retrieved?) and generation evals (did the answer use the retrieved doc faithfully?). Failure modes are different.

How to start tomorrow

If you're starting from zero:

Today: pick 3 user complaints. Turn each into a criterion.
This week: build a 50-example test set from production traces.
Next week: wire LLM-as-judge against the criteria. Run on every prompt change.
Week 3: add online eval on production traffic. Alert on score drops.
Week 4: random-sample human review weekly. Track agreement vs judge.

A month from now you'll have an eval pipeline. The investment compounds — every future model swap, prompt change, and architectural decision becomes a measurement instead of a guess.

FAQ

How do I test an LLM I don't control (e.g., GPT-5.5)? Same pipeline. You're not testing the model; you're testing your application's quality with the model. The model is one variable in your test.

How big should my test set be? 50-200 examples for a focused eval. Quality of examples matters more than size — use real production data, not synthetic.

How often should I re-run tests? Offline: every prompt or model change. Online: continuous (every request or sampled).

Can I run tests in CI? Yes — Promptfoo is built for this. Most observability platforms have CI integrations.

Should I trust LLM-as-judge? For breadth — yes. As your only signal — no. Validate weekly with human review.

What about benchmarks like MMLU? Use benchmarks to pick a base model. Use evals to ship a product.

How much does AI testing cost? Rule-based: free. LLM-as-judge with cheap judge models: a few cents per 1k requests. Human review: real-people cost. Most teams' total testing cost is under 5% of the LLM bill.

The four-step pipeline

Define quality criteria specific to your use case (3-5, not one)
Build a test set from real production data (50-200 examples)
Score outputs with rule-based + LLM-as-judge + sampled human review
Wire alerts and rollback so regressions are caught before users notice

That's it. Everything else is implementation detail. Skip any step and the pipeline doesn't work.

Step 1: Define quality criteria

The single biggest mistake in AI testing is "test for quality." Quality is not a single number. It's 3-5 orthogonal criteria, and they move independently.

Read the last 30 days of customer feedback. Each common complaint becomes a criterion. Examples:

Product type	Common criteria
Customer support agent	Faithfulness to docs, empathy tone, escalation accuracy, format compliance, response length
RAG system	Faithfulness, citation correctness, completeness, relevance ranking
Code generation agent	Compilability, correctness vs spec, style guide compliance, security
Summarization	Faithfulness, completeness, conciseness, key fact preservation

Rules:

3-5 criteria, no more
One criterion per evaluator (don't conflate faithfulness and tone)
Use ordinal scales (1-5) not binary, except for genuinely binary criteria
Anchor every score with an example ("5 = like this real example")

Step 2: Build a test set

Test sets fail when they're synthetic — real production traffic is messier than any test set you'd hand-write. Sample from production traces:

1. Pull 500 random production traces from the last week
2. Cluster by user intent or feature
3. Sample 10-20 from each cluster
4. Have a human label expected behavior for each

Target size: 50-200 examples. Quality matters more than size.

For each example, record:

Input (prompt + context)
Expected behavior (notes, not exact output)
Edge case category (so you can slice by category)

If you can't sample from production yet (pre-launch), use representative synthetic data but recognize it'll miss real-world ugliness. Plan to swap to production data within month 1 of launch.

Step 3: Score outputs

Three flavors of scoring, each with a different role.

Rule-based (always run first)

Deterministic checks: schema validity, regex match, length bounds, profanity filter, banned-phrase detection, citation count, JSON structural correctness.

def schema_compliance(output: str) -> int:
    try:
        data = json.loads(output)
        return 1 if validate_schema(data) else 0
    except json.JSONDecodeError:
        return 0

Cheap, fast, deterministic. Always run rule-based first — they catch the cheapest failures cheapest.

LLM-as-a-judge (run on every test)

A separate LLM scores outputs against a rubric. Example rubric:

You are evaluating a customer support reply for faithfulness to retrieved docs.

Score 1-5:
5 = every claim is directly supported
4 = mostly supported, one minor inferential leap
3 = mostly supported, one unsupported claim
2 = several claims have no support
1 = the reply contradicts or invents facts

Reply: {output}
Retrieved docs: {context}

Output JSON: {"reason": "<one sentence>", "score": <int>}

Use a cheap judge model (GPT-5.4 nano, Claude Haiku 4.5) — judge cost is per-call so cheap models compound favorably.

Validate the judge weekly: random sample 50 production traces, score with humans, compare to judge scores. Track agreement. If agreement drops below 85%, your judge needs work.

Human review (sampled)

Three human review patterns:

Random sample — review N items per day from production
Disagreement-triggered — when LLM-as-judge and rule-based scores disagree, route to humans
Edge case curation — humans label the long tail you sample from production traces

Human review is slow and expensive. The goal is anchoring, not exhaustive coverage.

Step 4: Wire alerts and rollback

Test runs are useful only if they trigger action.

Offline (development) tests:

Run on every prompt or model change
Block deploy if scores drop more than 5% on any criterion
Surface diffs between old version scores and new version scores

Online (production) monitoring:

Run evaluators continuously on a sample of production traffic
Alert when scores drop more than 5pp week-over-week on any criterion
Alert when retry rate spikes
Alert when cost per active user spikes

Rollback path:

Every prompt or model version must be one-click rollback
Don't tie rollback to a code deploy

A typical workflow

What "test before ship" actually looks like in production:

1. Engineer changes prompt v22 → v23
2. CI runs offline eval suite against v23
3. Compare v23 scores to v22 scores per criterion
4. If pass: deploy to staging
5. Staging soak; online evals running on staging traffic
6. Production deploy with traffic split — 5% to v23, 95% to v22
7. Online evals score v23 traffic continuously
8. After 24-48h with stable scores: promote v23 to 100%
9. Continuous monitoring; rollback if scores drop

This is the workflow we see at teams shipping reliable AI. Without these steps, regressions reach customers.

Common AI testing mistakes

One "quality" score that conflates everything. Decompose into 3-5 criteria.
Synthetic test set only. Production data is messier. Swap to real samples within month 1.
LLM-as-judge with no human anchor. Judges have biases. Validate weekly.
No rule-based evals because LLM-as-judge "feels more sophisticated." Rule-based catches cheapest failures cheapest.
Test scores not tied to model/prompt versions. A score without context is unactionable.
No rollback path. If you can't rollback in 30 seconds, you don't have a workflow.
Skipping production monitoring. Offline tests miss the regressions that production traffic exposes.

Tools you can use

Respan — rule-based + LLM-as-judge + human review, online + offline, integrated with traces
Braintrust — eval-first with strong scoring functions
Langfuse — open source, solid evals
LangSmith — LangChain-native evaluators
Promptfoo — open-source CLI, great for CI integration

Specific guidance by model type

Testing a single LLM call: rule-based + LLM-as-judge cover 90% of value. Human review for the long tail.

Testing a multi-step agent: add per-step scores (was the right tool called? did the right argument get passed?). Score the agent run as a whole AND each step.

Testing a fine-tuned model: test set should include held-out examples not seen in training. Score against same criteria as the base model for direct comparison.

Testing a RAG system: split into retrieval evals (was the right doc retrieved?) and generation evals (did the answer use the retrieved doc faithfully?). Failure modes are different.

How to start tomorrow

If you're starting from zero:

Today: pick 3 user complaints. Turn each into a criterion.
This week: build a 50-example test set from production traces.
Next week: wire LLM-as-judge against the criteria. Run on every prompt change.
Week 3: add online eval on production traffic. Alert on score drops.
Week 4: random-sample human review weekly. Track agreement vs judge.

A month from now you'll have an eval pipeline. The investment compounds — every future model swap, prompt change, and architectural decision becomes a measurement instead of a guess.

FAQ

How big should my test set be? 50-200 examples for a focused eval. Quality of examples matters more than size — use real production data, not synthetic.

How often should I re-run tests? Offline: every prompt or model change. Online: continuous (every request or sampled).

Can I run tests in CI? Yes — Promptfoo is built for this. Most observability platforms have CI integrations.

Should I trust LLM-as-judge? For breadth — yes. As your only signal — no. Validate weekly with human review.

What about benchmarks like MMLU? Use benchmarks to pick a base model. Use evals to ship a product.

How to Test AI Models

The four-step pipeline

Step 1: Define quality criteria

Step 2: Build a test set

Step 3: Score outputs

Rule-based (always run first)

LLM-as-a-judge (run on every test)

Human review (sampled)

Step 4: Wire alerts and rollback

A typical workflow

Common AI testing mistakes

Tools you can use

Specific guidance by model type

How to start tomorrow

FAQ

Related articles

How to Evaluate an LLM

8 Best LLM Evaluation Tools in 2026

What Is Prompt Evaluation?

Built for AI agents.
Break less.
Ship more.

How to Test AI Models

The four-step pipeline

Step 1: Define quality criteria

Step 2: Build a test set

Step 3: Score outputs

Rule-based (always run first)

LLM-as-a-judge (run on every test)

Human review (sampled)

Step 4: Wire alerts and rollback

A typical workflow

Common AI testing mistakes

Tools you can use

Specific guidance by model type

How to start tomorrow

FAQ

Related articles

How to Evaluate an LLM

8 Best LLM Evaluation Tools in 2026

What Is Prompt Evaluation?

Built for AI agents.
Break less.
Ship more.

Related articles

How-to
How to Evaluate an LLM
How to evaluate an LLM for production: define criteria, build a test set, score with rule-based + LLM-as-judge + human review, run online evals on production traffic.
Frank Chen · 18 hours ago

Best of
8 Best LLM Evaluation Tools in 2026
Best LLM evaluation tools in 2026: Respan, Braintrust, Langfuse, LangSmith, Promptfoo, DeepEval, Galileo, Patronus. Pricing, features, and when each is the right pick.
Frank Chen · 18 hours ago

Explainer
What Is Prompt Evaluation?
Prompt evaluation explained: what it is, why it matters, the three types (rule-based, LLM-as-judge, human review), and how to build a real eval pipeline.
Frank Chen · 18 hours ago

How to Test AI Models

The four-step pipeline

Step 1: Define quality criteria

Step 2: Build a test set

Step 3: Score outputs

Rule-based (always run first)

LLM-as-a-judge (run on every test)

Human review (sampled)

Step 4: Wire alerts and rollback

A typical workflow

Common AI testing mistakes

Tools you can use

Specific guidance by model type

How to start tomorrow

FAQ

Related

Related articles

How to Evaluate an LLM

8 Best LLM Evaluation Tools in 2026

What Is Prompt Evaluation?

Built for AI agents. Break less. Ship more.

How to Test AI Models

The four-step pipeline

Step 1: Define quality criteria

Step 2: Build a test set

Step 3: Score outputs

Rule-based (always run first)

LLM-as-a-judge (run on every test)

Human review (sampled)

Step 4: Wire alerts and rollback

A typical workflow

Common AI testing mistakes

Tools you can use

Specific guidance by model type

How to start tomorrow

FAQ

Related

Related articles

How to Evaluate an LLM

8 Best LLM Evaluation Tools in 2026

What Is Prompt Evaluation?

Built for AI agents. Break less. Ship more.

Built for AI agents.
Break less.
Ship more.

Built for AI agents.
Break less.
Ship more.