By section 4 you have a tracing system that records every step of every request. The next risk is changing things and silently breaking them. Evaluators are the gate that catches regressions before they reach customers.

What is an evaluator

An evaluator is a way to score whether an LLM output is good or bad, automatically, on a sample of inputs.

Without evaluators, the loop is:

Change a prompt
Ship to production
Wait for customers to complain
Realize the change broke something a week later

With evaluators, the loop is:

Change a prompt
Run the evaluator on a sample of inputs
See the score go up or down
Decide whether to ship based on the number

Same loop, run before deploy instead of after.

The four pieces of an eval

Prompt or model: what you are testing. The new prompt version, or a different model.
Dataset: a list of test inputs (and optionally expected outputs). Usually called a "golden set" or "test set."
Evaluator: code or an LLM that scores each output.
Experiment: the run that combines all three and produces a score per output plus an aggregate score.

Two kinds of evaluators

LLM-as-judge graders

The most common evaluator type. You write a rubric in plain English, pin a grading model (often a stronger model than the one you are evaluating), and let the judge score each output.

Example rubric for a customer support agent:

Mark pass only if every policy claim in the reply is supported by the retrieved KB excerpts. If a policy claim has no retrieval support, mark fail and quote the unsupported claim.

That grader runs over 200 sampled customer support conversations. You get a pass rate per prompt version. When a new prompt drops the pass rate, you know before deploying.

LLM-as-judge graders are flexible (any subjective quality dimension) but cost a model call per evaluation. For a 200-input dataset, that is fast and cheap (under a dollar usually).

Code graders

A deterministic Python function. Use this when the check is objective.

def main(eval_inputs):
    output = eval_inputs["output"]
    # custom logic, return a score
    if "I'm sorry" in output and len(output) < 500:
        return {"score": 1.0, "passed": True}
    return {"score": 0.0, "passed": False}

Use code graders for:

Format validation (is the output valid JSON?)
Length constraints (under 500 characters?)
Regex matching (does it cite a source URL?)
Schema checks (does the response have the required fields?)
Anything that does not need judgment, just rules.

A typical eval suite has 60% LLM-as-judge graders and 40% code graders.

Where the dataset comes from

Two common sources:

From a CSV you upload. Hand-written test cases. Useful for adversarial inputs you definitely want to cover.

From sampling production traces. Once you have tracing (section 4), you can sample 200 real production conversations into a dataset with one click. This is usually the better option because the dataset looks like real traffic.

You can also mix: 100 real production samples + 50 hand-written adversarial cases.

Offline vs online evals

Offline evals run on a fixed dataset before deploying a change. Block CI if the score drops below threshold. This catches regressions before they reach production.

Online evals run on a sample of live production traffic. Sample 1% of customer-facing replies into the same evaluator suite. Catch regressions in days, not weeks.

Both are valuable. Offline catches what you anticipated. Online catches what you did not.

Score types

Evaluator scores come in four shapes:

Type	Example	When to use
Numerical	1-5 rating	Quality dimensions like helpfulness
Boolean	pass/fail	Binary checks like "is this safe"
Categorical	["polite", "concise"]	Multi-label classification
Comment	free-text reasoning	Qualitative feedback for review

Most evaluators produce one numerical score plus a comment explaining why.

A working setup

The shortest viable eval workflow:

One golden dataset: 50-200 inputs that look like real production traffic.
Two LLM-as-judge graders on the failure modes you care about most. (For a customer support agent: hallucinated policy claims, missed escalations.)
One experiment per significant prompt change. Compare against the previous prompt version on the same dataset.
One CI gate. A PR that drops the eval score below threshold cannot merge.
Online sampling at 1% of production. Daily dashboard alert if the score drops.

That setup is the difference between confident deploys and the Klarna 2024 reversal (which had no online quality monitoring on edge cases).

Where to set up evals

In Respan, datasets live on the Datasets page and evaluators live on the Evaluators page. You can also push results into the same trace tree from section 4 so a span shows its eval scores inline.

The full flow: create the prompt → create a dataset → create an evaluator with one or more graders → run an experiment → iterate.

What you have at the end of section 5

A small golden dataset that represents real traffic.
Two or more evaluators that score on the failure modes you fear.
A score per change. You know before shipping whether quality went up or down.
A CI gate that blocks regressions.
An online sample that catches what offline evals missed.

Next: agents and tool use

The next section, Agents and tool use, covers when one LLM call is not enough, when a workflow is enough, and when you actually need a true agent.

Or back to the Chapter 1 hub.

What is an evaluator

An evaluator is a way to score whether an LLM output is good or bad, automatically, on a sample of inputs.

Without evaluators, the loop is:

Change a prompt
Ship to production
Wait for customers to complain
Realize the change broke something a week later

With evaluators, the loop is:

Change a prompt
Run the evaluator on a sample of inputs
See the score go up or down
Decide whether to ship based on the number

Same loop, run before deploy instead of after.

The four pieces of an eval

Prompt or model: what you are testing. The new prompt version, or a different model.
Dataset: a list of test inputs (and optionally expected outputs). Usually called a "golden set" or "test set."
Evaluator: code or an LLM that scores each output.
Experiment: the run that combines all three and produces a score per output plus an aggregate score.

Two kinds of evaluators

LLM-as-judge graders

The most common evaluator type. You write a rubric in plain English, pin a grading model (often a stronger model than the one you are evaluating), and let the judge score each output.

Example rubric for a customer support agent:

Mark pass only if every policy claim in the reply is supported by the retrieved KB excerpts. If a policy claim has no retrieval support, mark fail and quote the unsupported claim.

That grader runs over 200 sampled customer support conversations. You get a pass rate per prompt version. When a new prompt drops the pass rate, you know before deploying.

LLM-as-judge graders are flexible (any subjective quality dimension) but cost a model call per evaluation. For a 200-input dataset, that is fast and cheap (under a dollar usually).

Code graders

A deterministic Python function. Use this when the check is objective.

def main(eval_inputs):
    output = eval_inputs["output"]
    # custom logic, return a score
    if "I'm sorry" in output and len(output) < 500:
        return {"score": 1.0, "passed": True}
    return {"score": 0.0, "passed": False}

Use code graders for:

Format validation (is the output valid JSON?)
Length constraints (under 500 characters?)
Regex matching (does it cite a source URL?)
Schema checks (does the response have the required fields?)
Anything that does not need judgment, just rules.

A typical eval suite has 60% LLM-as-judge graders and 40% code graders.

Where the dataset comes from

Two common sources:

From a CSV you upload. Hand-written test cases. Useful for adversarial inputs you definitely want to cover.

You can also mix: 100 real production samples + 50 hand-written adversarial cases.

Offline vs online evals

Offline evals run on a fixed dataset before deploying a change. Block CI if the score drops below threshold. This catches regressions before they reach production.

Online evals run on a sample of live production traffic. Sample 1% of customer-facing replies into the same evaluator suite. Catch regressions in days, not weeks.

Both are valuable. Offline catches what you anticipated. Online catches what you did not.

Score types

Evaluator scores come in four shapes:

Type	Example	When to use
Numerical	1-5 rating	Quality dimensions like helpfulness
Boolean	pass/fail	Binary checks like "is this safe"
Categorical	["polite", "concise"]	Multi-label classification
Comment	free-text reasoning	Qualitative feedback for review

Most evaluators produce one numerical score plus a comment explaining why.

A working setup

The shortest viable eval workflow:

One golden dataset: 50-200 inputs that look like real production traffic.
Two LLM-as-judge graders on the failure modes you care about most. (For a customer support agent: hallucinated policy claims, missed escalations.)
One experiment per significant prompt change. Compare against the previous prompt version on the same dataset.
One CI gate. A PR that drops the eval score below threshold cannot merge.
Online sampling at 1% of production. Daily dashboard alert if the score drops.

That setup is the difference between confident deploys and the Klarna 2024 reversal (which had no online quality monitoring on edge cases).

Where to set up evals

In Respan, datasets live on the Datasets page and evaluators live on the Evaluators page. You can also push results into the same trace tree from section 4 so a span shows its eval scores inline.

The full flow: create the prompt → create a dataset → create an evaluator with one or more graders → run an experiment → iterate.

What you have at the end of section 5

A small golden dataset that represents real traffic.
Two or more evaluators that score on the failure modes you fear.
A score per change. You know before shipping whether quality went up or down.
A CI gate that blocks regressions.
An online sample that catches what offline evals missed.

Next: agents and tool use

The next section, Agents and tool use, covers when one LLM call is not enough, when a workflow is enough, and when you actually need a true agent.

Or back to the Chapter 1 hub.

Measuring Quality with Evals

What is an evaluator

The four pieces of an eval

Two kinds of evaluators

LLM-as-judge graders

Code graders

Where the dataset comes from

Offline vs online evals

Score types

A working setup

Where to set up evals

What you have at the end of section 5

Next: agents and tool use

Built for AI agents.
Break less.
Ship more.

Measuring Quality with Evals

What is an evaluator

The four pieces of an eval

Two kinds of evaluators

LLM-as-judge graders

Code graders

Where the dataset comes from

Offline vs online evals

Score types

A working setup

Where to set up evals

What you have at the end of section 5

Next: agents and tool use

Built for AI agents.
Break less.
Ship more.

Measuring Quality with Evals

What is an evaluator

The four pieces of an eval

Two kinds of evaluators

LLM-as-judge graders

Code graders

Where the dataset comes from

Offline vs online evals

Score types

A working setup

Where to set up evals

What you have at the end of section 5

Next: agents and tool use

Built for AI agents. Break less. Ship more.

Measuring Quality with Evals

What is an evaluator

The four pieces of an eval

Two kinds of evaluators

LLM-as-judge graders

Code graders

Where the dataset comes from

Offline vs online evals

Score types

A working setup

Where to set up evals

What you have at the end of section 5

Next: agents and tool use

Built for AI agents. Break less. Ship more.

Built for AI agents.
Break less.
Ship more.

Built for AI agents.
Break less.
Ship more.