Prompt evaluation is how you turn "this answer feels right" into a number. It's the practice of running prompts against test cases and scoring their outputs against criteria — automatically, at scale, before and after shipping. Without it, every prompt change is a guess and every customer complaint is a surprise.
If you're shipping LLM features, prompt evaluation is the single highest-leverage practice you can adopt. It's also the one most teams skip because "we'll add it later" — and then later never comes.
TL;DR
Prompt evaluation means:
- Defining 3-5 quality criteria specific to your use case (faithfulness, format compliance, tone, accuracy)
- Building a test set of 50-200 example inputs with known good behavior
- Running new prompt versions against the test set automatically
- Scoring outputs with rule-based checks, LLM-as-judge, or human review
- Tracking scores over time so regressions are visible
- Running the same evaluators on production traffic for online monitoring
The teams that ship reliable AI all have eval pipelines. The teams that don't, ship and pray.
Why prompt evaluation matters
Three things you can't do without it:
1. Catch regressions before customers do
A prompt change ships. It fixes the edge case it was meant to. It also breaks a subtle behavior nobody tested for. Without evals, you find out from a customer complaint a week later. With evals, the test set catches it before deploy.
2. Compare prompt versions empirically
Which prompt produces better answers — A or B? Without evals, this is opinion ("I think A reads more naturally"). With evals, it's a measurement ("A scores 4.2/5 on faithfulness, B scores 3.7"). Empirical comparison lets you ship the better version with confidence.
3. Compare models empirically
Should we switch from GPT-5.4 to Claude Sonnet 4.6? With evals on your specific use case, you can answer this in a week of work. Without, you're choosing based on benchmarks that don't match your data.
The three types of prompt evaluation
1. Rule-based evaluation
Deterministic checks: schema validation, regex match, length bounds, profanity filter, JSON validity, structured output compliance.
Pros: Fast (microseconds), cheap (free), reliable (no judge variance). Cons: Only works for criteria with clear right/wrong answers.
Use for: anything binary correct/wrong. JSON validity, format compliance, banned-phrase detection.
2. LLM-as-a-judge
A separate LLM scores outputs against a rubric you write. Example for a customer support reply:
Score the reply on faithfulness from 1-5:
5 = every claim is directly supported by retrieved docs
1 = the reply contradicts or invents facts not in the docs
Reply: {output}
Retrieved docs: {context}
Return: {"reason": "<one sentence>", "score": <int>}
Pros: Fast (sub-second), cheap (cents per 1k requests), scales to every production request, strong correlation with human judgment for objective criteria. Cons: Inherits the judge model's biases. Less reliable on subjective criteria like tone. Must be validated against humans periodically.
Use for: breadth — running on every output to catch obvious quality drops. Anchor with sampled human review.
3. Human review
Sampled or full review by domain experts.
Pros: Ground truth for high-stakes domains. Catches what LLM-as-judge misses. Cons: Slow (minutes per item). Expensive (real people).
Use for: edge case curation, weekly judge validation, high-stakes domains where reasonable humans disagree.
How to design an eval rubric
Six rules that hold up:
- Start with user complaints, not abstract criteria. Read the last 30 days of customer feedback. Each common complaint becomes an eval criterion.
- One criterion per eval. Don't mix faithfulness and tone in the same prompt — the judge can't disentangle them.
- Use ordinal not binary (1-5 scale) except when the criterion is genuinely binary.
- Anchor every score with an example. "5 = like this real example. 1 = like this real example."
- Force the judge to reason before scoring. Output
{reason, score}not just{score}. - Validate the judge weekly. Random sample 50 production traces, score with humans, compare to judge scores. Track agreement.
Online vs offline evals
Offline evals run during development against a frozen test set. Fast feedback — change a prompt, run the suite, see if it improved or regressed.
Online evals run continuously against live production traffic. Slower signal but catches the regressions that don't show up in your test set, because production traffic is messier than any test set you'd write.
Production teams run both. Offline is the dev loop; online is the canary.
Prompt evaluation in a typical workflow
1. Engineer changes prompt (v23)
2. Offline eval pipeline runs new prompt against test set
3. Scores compared to v22 — pass/fail per criterion
4. If pass: deploy to staging
5. Staging soak with eval still running on staging traffic
6. Production deploy with traffic split (5% to v23, 95% to v22)
7. Online evals score v23 traffic continuously
8. After 24-48h with stable scores: promote v23 to 100%
9. Continuous monitoring; rollback if scores drop
This is what a mature LLM team's prompt change workflow looks like. The eval pipeline runs at every step. It's the single most important practice for shipping reliable AI.
Common prompt evaluation mistakes
- One "quality" score that conflates everything. Decompose into 3-5 criteria.
- LLM-as-judge with no human anchor. Sample-validate weekly or you're measuring judge bias.
- Synthetic test set only. Synthetic test sets miss the ugly edge cases real users produce. Sample from production traces.
- No eval before shipping prompt changes. The whole point of evals is to catch regressions before users do.
- Eval scores not tied to prompt versions. A score without the version context is unactionable.
- Skipping rule-based evals because LLM-as-judge "feels more sophisticated." Rule-based catches the cheapest failures cheapest. Always run them.
Tools that handle prompt evaluation
- Respan — rule-based + LLM-as-judge + human review, online + offline, scores attached to traces
- Braintrust — eval-first product, deepest scoring functions library
- Langfuse — solid eval support, open source
- LangSmith — LangChain-native, evaluator library tied to LCEL
- Promptfoo (open source) — CLI-first prompt evaluation in CI
For a deep dive on evals specifically, see our LLM Evals: The Complete Guide.
How to start
Three-step path to working prompt evaluation:
- Pick 3 user complaints from the last month. Turn each into an eval criterion.
- Build a 50-100 example test set with expected behavior. Sample from production traces if you can.
- Wire LLM-as-judge against the criteria. Run on the test set on every prompt change. Alert when scores drop.
After that you can layer in human review, online evals on production traffic, dataset curation from interesting traces, and per-version score tracking.
FAQ
What's the difference between prompt evaluation and LLM evaluation? Prompt evaluation specifically scores how well a prompt produces good outputs. LLM evaluation is broader — can include model comparison, fine-tune evaluation, agent evaluation. They overlap heavily; the practices are similar.
Should I trust LLM-as-judge? Yes for breadth, no as your only signal. LLM-as-judge agrees with humans far more reliably on objective criteria like faithfulness than on subjective ones like tone. Validate weekly with human review.
How big should my test set be? 50-200 examples for a focused eval. More if your domain has many edge cases. Quality of the test set matters more than size — sample from real production traces, not synthetic data.
What about benchmarks like MMLU? Benchmarks measure model capability on standardized tasks. Evals measure your application's quality on your data. Use benchmarks to pick a base model; use evals to ship.
Can I run evals in CI? Yes. Promptfoo is built for this. Respan, Langfuse, LangSmith, and Braintrust all have CI integrations.
How much do prompt evals cost? Rule-based: free. LLM-as-judge: a few cents per 1k requests with cheap judge models (Haiku, GPT-5.4 nano). Human review: real-people cost. Most teams' total eval cost is under 5% of their LLM bill.
Should I evaluate before or after shipping? Both. Offline evals before shipping; online evals on production traffic. Skip either side and you have a blind spot.