By section 4 you have a tracing system that records every step of every request. The next risk is changing things and silently breaking them. Evaluators are the gate that catches regressions before they reach customers.
What is an evaluator
An evaluator is a way to score whether an LLM output is good or bad, automatically, on a sample of inputs.
Without evaluators, the loop is:
- Change a prompt
- Ship to production
- Wait for customers to complain
- Realize the change broke something a week later
With evaluators, the loop is:
- Change a prompt
- Run the evaluator on a sample of inputs
- See the score go up or down
- Decide whether to ship based on the number
Same loop, run before deploy instead of after.
The four pieces of an eval
- Prompt or model: what you are testing. The new prompt version, or a different model.
- Dataset: a list of test inputs (and optionally expected outputs). Usually called a "golden set" or "test set."
- Evaluator: code or an LLM that scores each output.
- Experiment: the run that combines all three and produces a score per output plus an aggregate score.
Two kinds of evaluators
LLM-as-judge graders
The most common evaluator type. You write a rubric in plain English, pin a grading model (often a stronger model than the one you are evaluating), and let the judge score each output.
Example rubric for a customer support agent:
Mark
passonly if every policy claim in the reply is supported by the retrieved KB excerpts. If a policy claim has no retrieval support, markfailand quote the unsupported claim.
That grader runs over 200 sampled customer support conversations. You get a pass rate per prompt version. When a new prompt drops the pass rate, you know before deploying.
LLM-as-judge graders are flexible (any subjective quality dimension) but cost a model call per evaluation. For a 200-input dataset, that is fast and cheap (under a dollar usually).
Code graders
A deterministic Python function. Use this when the check is objective.
def main(eval_inputs):
output = eval_inputs["output"]
# custom logic, return a score
if "I'm sorry" in output and len(output) < 500:
return {"score": 1.0, "passed": True}
return {"score": 0.0, "passed": False}Use code graders for:
- Format validation (is the output valid JSON?)
- Length constraints (under 500 characters?)
- Regex matching (does it cite a source URL?)
- Schema checks (does the response have the required fields?)
- Anything that does not need judgment, just rules.
A typical eval suite has 60% LLM-as-judge graders and 40% code graders.
Where the dataset comes from
Two common sources:
From a CSV you upload. Hand-written test cases. Useful for adversarial inputs you definitely want to cover.
From sampling production traces. Once you have tracing (section 4), you can sample 200 real production conversations into a dataset with one click. This is usually the better option because the dataset looks like real traffic.
You can also mix: 100 real production samples + 50 hand-written adversarial cases.
Offline vs online evals
Offline evals run on a fixed dataset before deploying a change. Block CI if the score drops below threshold. This catches regressions before they reach production.
Online evals run on a sample of live production traffic. Sample 1% of customer-facing replies into the same evaluator suite. Catch regressions in days, not weeks.
Both are valuable. Offline catches what you anticipated. Online catches what you did not.
Score types
Evaluator scores come in four shapes:
| Type | Example | When to use |
|---|---|---|
| Numerical | 1-5 rating | Quality dimensions like helpfulness |
| Boolean | pass/fail | Binary checks like "is this safe" |
| Categorical | ["polite", "concise"] | Multi-label classification |
| Comment | free-text reasoning | Qualitative feedback for review |
Most evaluators produce one numerical score plus a comment explaining why.
A working setup
The shortest viable eval workflow:
- One golden dataset: 50-200 inputs that look like real production traffic.
- Two LLM-as-judge graders on the failure modes you care about most. (For a customer support agent: hallucinated policy claims, missed escalations.)
- One experiment per significant prompt change. Compare against the previous prompt version on the same dataset.
- One CI gate. A PR that drops the eval score below threshold cannot merge.
- Online sampling at 1% of production. Daily dashboard alert if the score drops.
That setup is the difference between confident deploys and the Klarna 2024 reversal (which had no online quality monitoring on edge cases).
Where to set up evals
In Respan, datasets live on the Datasets page and evaluators live on the Evaluators page. You can also push results into the same trace tree from section 4 so a span shows its eval scores inline.
The full flow: create the prompt → create a dataset → create an evaluator with one or more graders → run an experiment → iterate.
What you have at the end of section 5
- A small golden dataset that represents real traffic.
- Two or more evaluators that score on the failure modes you fear.
- A score per change. You know before shipping whether quality went up or down.
- A CI gate that blocks regressions.
- An online sample that catches what offline evals missed.
Next: agents and tool use
The next section, Agents and tool use, covers when one LLM call is not enough, when a workflow is enough, and when you actually need a true agent.
Or back to the Chapter 1 hub.
