TL;DR
LLM evals are how you turn "this answer feels right" into a number. Three types matter — rule-based (fast, deterministic), LLM-as-judge (cheap, scales to every request), and human review (slow, ground truth). The teams that ship reliable AI run all three: rule-based on every output, LLM-as-judge on every output, sampled human review on the long tail. The teams that ship one number called "quality" never figure out what's actually wrong.
What are LLM evals?
An LLM eval is a function that takes a model output (and usually the input that produced it) and returns a score against a criterion. The criterion can be objective ("does this output match the JSON schema?"), semi-objective ("is this faithful to the retrieved documents?"), or subjective ("is the tone empathetic?"). The score can be binary, ordinal, or continuous.
Together, evals turn LLM quality into something you can chart over time, alert on, and ship against. Without them, every prompt change is a guess and every model swap is a leap of faith. With them, "quality regressed 7% on faithfulness over the last 24 hours" is a signal you can act on before the support tickets arrive.

A trace in Respan with three evaluators attached. Online evals run on every production request.
The three types of evals
1. Rule-based
Deterministic checks: schema validation, regex match, length bounds, profanity filter, banned-phrase detection, JSON validity, structured-output compliance. They're fast (microseconds), cheap (free), and reliable (no judge variance).
Use them for anything that has a binary correct/wrong answer. Most teams underuse rule-based evals because LLM-as-judge feels more sophisticated, but rule-based catches the cheapest failures cheapest.
2. LLM-as-a-judge
A separate LLM scores outputs against a rubric you write. Example rubric for a customer support reply:
Score the reply on faithfulness from 1-5:
5 = every claim is directly supported by the retrieved docs
4 = mostly supported, one minor inferential leap
3 = mostly supported, but one unsupported claim
2 = several claims have no support
1 = the reply contradicts or invents facts not in the docs
Reply: {output}
Retrieved docs: {context}
Return: {"score": <int>, "reason": "<one sentence>"}LLM-as-judge is fast (sub-second), cheap (a few cents per 1k requests), and scales to every production request. The catch: the judge has its own biases, blind spots, and failure modes. Use it for breadth, anchor with humans.
3. Human review
Sampled or full review by domain experts. Slow (minutes per item) and expensive (real people), but the ground truth for high-stakes domains. Three patterns work:
- Random sample: review N items per day. Catches drift you'd miss otherwise.
- Disagreement-triggered: when LLM-as-judge and rule-based scores disagree, route to humans. Highest signal-to-cost.
- Edge case curation: humans label the long tail you sample from production traces. The labeled examples become your gold-set.

The fastest way to ship eval debt is to define quality after you have a problem. Every team I've watched do this gets the same outcome: a prompt change ships, customer complaints arrive a week later, and the team scrambles to bolt on an evaluator that catches the specific class of failure that just happened. By then the bad data is in the model's training-loop, the customer's mental model, and three angry threads. Define your evaluators before the first prompt ships, not after the first complaint.
The other near-universal mistake: a single "quality" score. Quality is 3-5 orthogonal criteria, and they move independently. From the audits we've run against human reviewers, LLM-as-judge agrees with humans far more reliably on objective criteria like faithfulness than on subjective ones like tone — often by ten or more percentage points. If you collapse those into one score you lose the ability to act on either. Decompose or stay confused.
Teams running production evals on Respan
Online vs offline evals
The two loops feed each other. Production traces with low scores get curated back into the offline test set; offline-validated prompt changes ship to production. Skip either side and you have a blind spot.
Offline evals run during development against a frozen test set. Fast feedback loop — change a prompt, run the suite, see if the change improved or regressed. The unit is a controlled experiment.
Online evals run continuously against live production traffic. Slower signal — the metrics aggregate over hours or days — but they catch the regressions that don't show up in your test set, because production traffic is messier than any test set you'd write.
Production teams run both. Offline is the inner-loop dev experience. Online is the canary that catches what offline missed. Skipping either creates blind spots.
How to design an eval rubric
Six rules that hold up across the eval rubrics I've seen win:
- Start with user complaints, not abstract criteria. Read the last 30 days of customer feedback. Each common complaint becomes a criterion. "Replies sound robotic" → tone eval. "Wrong dates" → factual accuracy eval.
- One criterion per eval. Don't mix faithfulness and tone in the same prompt — the judge can't disentangle them. One scope, one rubric, one score.
- Use ordinal not binary, except when the criterion is genuinely binary. Faithfulness on a 1-5 scale gives you a smooth dial; pass/fail throws away signal.
- Anchor every score with an example. "5 = like this real example. 1 = like this real example." Drastically reduces judge variance.
- Force the judge to reason before scoring. Output schema:
{reason: string, score: int}. Reason first, score second. The chain of thought matters. - Validate the judge weekly. Random sample 50 production traces, score them with humans, compare to judge scores. Track agreement; if it drops below 85%, your judge needs work.
How to instrument
from respan import Respan
respan = Respan(api_key="...")
# Define an eval
faithfulness_eval = respan.eval.create(
name="faithfulness",
type="llm_as_judge",
judge_model="gpt-4o",
prompt="""
Score the reply on faithfulness 1-5 against the retrieved docs.
Return: {"reason": "<one sentence>", "score": <int>}
""",
)
# Attach to live traffic — runs on every request matching the filter
faithfulness_eval.attach_online(filter={"feature.id": "support_agent"})
# Or run offline against a dataset
results = faithfulness_eval.run_offline(dataset_id="support_v3_test")
print(results.aggregate()) # {"mean": 4.2, "p25": 3, "p75": 5}Common eval mistakes
- One "quality" score that conflates everything. Decompose into 3-5 criteria.
- LLM-as-judge with no human anchor. Sample-validate weekly or you're measuring judge bias.
- Test set built from synthetic data only. Synthetic test sets miss the ugly edge cases real users produce. Sample from production traces.
- No regression test before shipping prompt changes. The whole point of evals is to catch regressions before users do.
- Rule-based evals dismissed as "too simple." They catch the cheapest failures cheapest. Always run them.
- Eval scores not tied to traces. A score without the trace context is unactionable. The platform should link them automatically.
Eval platforms compared
Most LLM observability platforms include evals. The differentiators: judge flexibility, online vs offline support, dataset versioning, and how tightly evals integrate with traces and prompt management. See the full feature matrix on the observability pillar.
- Respan: rule-based + LLM-as-judge + human review, online + offline, datasets versioned, scores attached to every trace.
- Braintrust: eval-first product, deepest scoring functions library, slightly less integrated with tracing.
- Langfuse: solid eval support, open source, strong dataset features.
- LangSmith: LangChain-native, evaluator library tied to LCEL.
- Helicone: basic evals, third-party integrations for advanced.
Frequently asked questions

Head of DevRel at Respan (YC W24). Working alongside the team running the infrastructure that handles 80M+ LLM requests a day.
Connect on LinkedIn →Score every LLM output, online and offline
Rule-based, LLM-as-judge, and human review — all integrated with traces, prompts, and gateway in one platform.
Related guides: LLM observability · LLM tracing · LLM gateway