Search "AI agent evaluation" and the top results all say the same thing. They cover GAIA, AgentBench, SWE-bench, BFCL, TauBench, Berkeley Function Calling Leaderboard. They explain what each benchmark tests. They tell you how the top models rank.
None of them tell you how to evaluate your agent on your data with your users.
That's the eval that matters. Benchmarks are useful for picking a base model. They tell you almost nothing about whether your customer support agent, your code review bot, or your sales-prospecting workflow actually does its job on production traffic.
This is the playbook for the eval that actually matters: production AI agent evaluation.
Why benchmarks are the wrong metric for your agent
The benchmark canon — GAIA, AgentBench, SWE-bench Verified, TauBench, BFCL — was designed by researchers for researchers. They standardize tasks so model labs can publish comparable scores. That's a real and useful function. It is not the same function as "tell me whether my production agent is working."
Three reasons benchmarks fail at the production-eval job:
The distribution is wrong. Your users don't ask SWE-bench questions. They ask the messy, off-distribution questions your benchmark didn't anticipate. A model can score 95% on SWE-bench Verified and still fall over on your queries — because your queries aren't SWE-bench.
The success criteria are wrong. SWE-bench scores binary pass/fail on patch acceptance. Your customers care about response time, tone, whether your agent admits uncertainty, and whether it called the right tool. None of those are in the benchmark.
The cost dimension is missing. Benchmarks score capability, not capability-per-dollar. Production agents that complete the task in 47 LLM calls when 5 would do are failing — just not in a way benchmarks measure.
Gartner's May 12 prediction named the gap directly. Padraig Byrne, Gartner VP Analyst: "Unlike traditional software, AI's decision making is often hidden, making it hard to explain or trust, yet errors can cause substantial financial loss, reputational damage and regulatory scrutiny."
That sentence is true about your specific agent. It is not solved by another leaderboard.
What production AI agent evaluation actually means
A production agent evaluator is a function:
score = evaluator(trace)
Where trace is the full record of an agent's run — every LLM call, tool invocation, intermediate state, retry, and final output. And score is one of a small set of numbers measuring something that matters to your users.
The evaluator runs continuously, on every production trace, in parallel to the agent. The scores get attached to the trace so when you investigate a failure, the scores travel with it.
This is fundamentally different from the offline-benchmark loop:
| Offline benchmark | Production agent eval |
|---|---|
| Runs on a frozen test set | Runs on live production traffic |
| Pass/fail or single capability score | 3-5 orthogonal criteria |
| Run once per model upgrade | Runs on every request |
| Tells you what a model can do | Tells you what your agent did |
| Useful for model selection | Useful for shipping reliably |
You need both. Most teams have neither. The ones that fail in production have only the first.
Five criteria for production agent evaluation
The 3-5 evaluators that matter are user-specific. The most common pattern looks like this:
1. Task completion rate
The simplest and most underused. Did the agent finish the task the user asked for? Not "did it return a response" — did it actually complete the task end to end.
How to score: define what "done" looks like as a downstream signal (the support ticket was closed, the code review was approved, the meeting was scheduled). Backfill that signal to the trace. Report the per-feature completion rate.
What it catches: agents that look like they're succeeding because they return responses, but actually punt on hard cases, ask the user to "try again," or escalate to humans.
2. Tool-use efficiency
How many LLM calls did the agent take to complete the task? How many tool invocations? Compare to the minimum that should have been needed.
How to score: classify each task by complexity tier (simple/medium/complex). Set an expected LLM-call budget per tier. Compute actual-vs-budget ratio. Anything above 2.5× the budget is investigated.
What it catches: tool-call loops, repeated re-searches, agents that re-read context they already have, model regressions that change reasoning chain length.
3. Recursion depth
For agents that dispatch sub-agents. How deep did the call stack go? At each level, was the budget exhausted before the leaf agent got to work?
How to score: track maximum depth per trace. Alert when depth exceeds the architectural max (most agents shouldn't go beyond 3-4 levels of sub-agent dispatch). Track per-level token consumption.
What it catches: cascading sub-agent dispatches, where each level of the agent tree consumes more budget than its parent intended.
4. Retry-to-success ratio
Number of failed attempts before a successful response. If your agent retries 4 times before succeeding, you have a brewing problem even though the user got the right answer.
How to score: count retries per trace, group by trace outcome, report rolling 24-hour median.
What it catches: flaky tools, model drift (when a model upgrade silently increases retries), context-length issues that surface as partial failures, prompt regressions after a change.
5. Faithfulness (for retrieval-grounded agents)
If your agent retrieves documents and reasons over them, the most important metric is whether the response actually reflects what was retrieved.
How to score: LLM-as-judge with the rubric prompt that compares response claims to retrieved context. Validate weekly against human reviewers.
What it catches: hallucination, agents inventing context that wasn't in the docs, retrieved-but-ignored documents.
These five aren't a universal set. They're the most common subset across the production AI deployments I see. Yours might add: tone, format compliance, refusal-when-appropriate, escalation correctness. The principle is "3-5 orthogonal criteria, each tied to a real failure pattern."
How to design a production agent evaluator
Six rules that hold up across the eval rubrics that ship and stay shipped:
Start with user complaints, not abstract criteria. Read the last 30 days of support tickets, Discord screenshots, Twitter mentions. Each common complaint becomes a criterion. "The agent gave wrong dates" → factual accuracy eval. "It sounded robotic" → tone eval.
One criterion per evaluator. Don't mix tone and faithfulness in the same LLM-as-judge prompt — the judge can't disentangle them. One scope, one rubric, one score.
Use ordinal, not binary, except when binary is genuine. A 1-5 scale gives you a smooth dial. Pass/fail throws away signal.
Anchor each score with an example. "5 = like this real example. 1 = like this real example." This single change drops judge variance dramatically.
Force the judge to reason before scoring. Output schema: {reason: string, score: int}. Reason first, score second. The chain of thought matters.
Validate the judge weekly. Random sample 50 production traces, score them with humans, compare to judge scores. Track agreement. If it drops below 85%, the judge needs work.
These rules apply equally to single-LLM evals and agent evals. What's specific to agents is the level you evaluate at: not just the final answer, but tool-use efficiency, recursion behavior, retry patterns, span-level checks.
Implementing production agent evaluation
A complete implementation requires three things to live in the same system:
Traces capture what happened. Per-span: model, prompt, tool call, output, timing, cost.
Evaluators score what happened. Rule-based for the cheap checks (JSON validity, length bounds, schema compliance), LLM-as-judge for the fuzzy ones (faithfulness, tone, task completion).
Datasets curate what happened well or badly. Production traces with low eval scores get pulled into evaluation datasets. Offline runs against those datasets validate prompt/model changes before they hit production.
With Respan's evals attached to traces, this looks like:
from respan import Respan
respan = Respan(api_key="...")
# Define a task-completion evaluator
task_completion = respan.eval.create(
name="task_completion",
type="llm_as_judge",
judge_model="claude-3-7-sonnet",
prompt="""
Score whether the agent completed the user's stated task from 1-5:
5 = task fully completed end to end
4 = task completed but with one minor gap
3 = partial completion, user needs to take action
2 = task attempted but failed, user is blocked
1 = no progress made on stated task
User request: {input}
Agent final output: {output}
Trace summary: {trace_summary}
Return: {"reason": "<one sentence>", "score": <int>}
""",
)
# Attach to live traffic — runs on every request matching the filter
task_completion.attach_online(filter={"feature.id": "support_agent"})
# Curate low-scoring traces into an evaluation dataset
respan.dataset.curate_from_traces(
eval_name="task_completion",
score_below=3,
dataset_id="support_agent_failures_v3",
)The dataset that builds up over weeks becomes the most valuable artifact your team has: real failure cases curated from production, ready to test against on every prompt change.
[INSERT: real Respan evals dashboard screenshot, anonymized]
Common production-eval mistakes
Treating a single "quality" score as enough. Quality is 3-5 orthogonal criteria, and they move independently. The team that ships "quality dropped 7%" can't act on it. The team that ships "task completion dropped 7% while tone stayed flat" has a thing to fix.
Wiring LLM-as-judge without anchoring to humans. A judge with no human validation will drift away from your users' actual experience.
Test sets built only from synthetic data. Synthetic test sets miss the ugly edge cases real users produce. Curate from production traces.
Eval scores not attached to traces. A score without the trace context is unactionable. The platform should link them automatically.
Skipping rule-based evals because "LLM-as-judge can do it." Rule-based catches the cheapest failures cheapest. Always run them.
The 2028 question
Gartner predicts 40% of organizations deploying AI will have dedicated AI observability — which includes production evaluation — by 2028. The companies that have it earlier will compound the advantage. Every production trace becomes a labeled dataset. Every customer complaint becomes an evaluator. Every model upgrade is verified against historical performance before it ships.
The other 60% will discover their agent is broken when their users tell them.
How Respan fits
Respan ships tracing, evals, gateway, and prompt management as one platform. Eval scores are attached to traces. Datasets are curated from production. Online and offline runs share schema. Free to try, no credit card.
FAQ
What's the difference between AI agent evaluation and LLM evaluation? LLM evaluation scores individual model outputs. AI agent evaluation scores agent runs — multi-step sequences with tool calls, retries, and sub-agent dispatches. Agent-specific criteria like tool-use efficiency and recursion depth don't apply to single-call evaluation.
How does AI agent evaluation differ from benchmarks like GAIA or AgentBench? Benchmarks score capability on a fixed test set, useful for choosing a model. Agent evaluation scores behavior on your live traffic, useful for shipping reliably. Both have their place; production teams need both.
Can I use LLM-as-a-judge for agent evaluation? Yes, for fuzzy criteria like task completion and faithfulness. Anchor weekly with human review. Use rule-based for binary checks (tool was called, format is valid).
How many evaluators should I have? 3-5 for most products. One catch-all "quality" score hides everything. Ten criteria nobody can keep in their head.
How fast can I implement production agent evaluation? A first version with 2-3 evaluators on one agent: one week. Full coverage of all production agents with anchored judges and curated datasets: a quarter.
Related guides: AI Evals · AI Tracing · Agent Observability · AI Observability



