Why we rebuilt the agent

Our v1 agent was a router-of-specialists: a top-level dispatcher that handed each user message to one of several domain-specialist agents (logs, evaluators, automations, etc.). It worked, but it had three structural problems.

1. Context collapses at every handoff. Multi-agent systems sound powerful, but in practice they break context in subtle ways. When sub-agents are wrapped as tools, the router only sees the summary the specialist returns — not the full conversation, the reasoning, or the intermediate decisions. Each handoff is a lossy compression step: subtle constraints, partial intent, and the path to a conclusion all flatten into a one-paragraph result, and the router decides what to do next based on that compressed summary, causing quality to degrade with every hop. Users feel this directly: a "create an evaluator" request turns into a stream of tool calls followed by "I've created it," with no visible reasoning in between. That missing inner loop — how the system interpreted the request and why it took each step — is exactly what users need to trust the result, and it disappears at every handoff.

2. Multi-agent was over-engineering for today's frontier models. The reflex to split work across specialists made sense when models were weaker — narrow each agent's surface, keep it on-rails, make sure each piece does its small job correctly. But for product AI on a current-gen model, the agent is already capable enough to handle the whole task end-to-end. What it actually needs is full context, not isolation; structuring the work around the model's limits is now structuring around a limit that's mostly gone. (We saw this directly during the rewrite — there's a section on the surprising emergent behavior below.)

3. Five prompts, five blast radii. The router carried its own ~10K-char prompt, and each of the four specialists carried a ~50K-char domain prompt. Adding a new tool meant editing the router (so it knew which specialist to route to), the specialist (to teach the new tool), and the routing rules. A typo in any one of those five surfaces shipped a regression.

v2 collapses all three into one architectural change: a single agent with a single context, on the same model and settings as v1. No router. No specialists. No handoff.

How we measured "is v2 actually better?"

We built a regression net using our own product. Every run of the test slate dogfoods the exact same flow a customer uses: dataset → evaluator pipeline → experiment → score.

Three evaluators, all judging on a 1–100 scale as the default judge:

Helpfulness & Completeness — does the final assistant message actually answer the user's question? Did the agent take the action when asked, or stop at "Want me to proceed?" Caps at 40 if the agent stalled at confirmation when the user clearly asked for an action.
Tool-Use Efficiency — count function_call items, classify question complexity (pure-knowledge / simple / medium / complex), penalize redundancy (duplicate creates, scratchpad spam, repeat list calls without filter changes). Rewards direct paths.
Hallucination & Grounding — extract every concrete claim from the final message (numbers, IDs, named resources), then check whether each one is traceable to an earlier function_call_output. Heavy penalty for invented UUIDs or numbers that don't appear in any tool result.

A note on the 1–100 scale: it's strictness-anchored, not a quality percentage. The rubric only awards 80+ when every must-have from the ground truth is hit, and caps at 40 when the agent stops at "Want me to proceed?" instead of taking action, even when the request is already clear. Most decent agent runs land in 40–70 — that's the band where the answer is useful but missing one specific.

The internal regression net

The first layer of the slate is a fixed set of internal probes — same questions every time we change the agent prompt. They're written to cover the core tools and functionalities the agent has to get right (taxonomy of evaluators / graders / pipelines, ID asymmetries, multi-step create/update flows, automation and monitor composition) — the surfaces customers depend on most. Locking those into a regression net is how we keep the agent sustainable over many prompt iterations: a future prompt edit can't silently break the things the platform is built around, because the slate would catch it before we ship.

The probes span four categories, each targeting a different class of bug:

Taxonomy

Does the agent keep evaluator vs grader vs pipeline straight? Does it know that passing workflow_id (family ID) to experiment_create returns 404, and id (version PK) is the right one?

Example:

"I tried evaluator_workflow_ids: [<two valid-looking UUIDs>] and got WorkflowVersions not found. What did I do wrong?"

The expected answer: identify that the user passed family IDs instead of version PKs, verify by listing pipelines, and return the correct version IDs.

Actions

Can the agent execute multi-step create/update flows end-to-end?

Example:

"Create a 'Conciseness 1-100' evaluator … make sure it shows up on the Evaluators page when you're done."

The expected flow is evaluator_create → evaluator_run → evaluator_commit → evaluation_pipeline_create — skipping the last step is the most common failure mode (you create a grader but never wrap it, so nothing shows on the UI).

Cross-domain

Does the same ID-discipline transfer across domains?

Example:

"Set up an automation that runs my Hallucination evaluator on every failed request log and Slack-alerts me if the score drops below 70."

The pipeline ID lives inside an automation task config, the notification_method_id is a foreign key, and the agent has to find an existing Slack target instead of inventing one.

Automation / Monitor

Multi-task workflow chains, dotpath fluency, honest-failure probes.

Example:

"Create an automation that runs my evaluator every Monday at 9am ET."

The honest answer is "Respan workflows trigger on events, not on a cron schedule — the closest options are X or Y." A grounding-violating agent silently invents a "scheduled" trigger type.

The data: v2 wins on every dimension

The Respan experiment-comparison view for the internal slate, scored by the default judge (claude-haiku-4-5):

Dimension	v1	v2	Δ
Tool-Use Efficiency	43.3	55.8	+12.5 (+28.8%)
Helpfulness & Completeness	50.1	58.1	+8.0 (+16.0%)
Hallucination & Grounding	55.6	61.0	+5.3 (+9.6%)
Latency (per turn)	19.8s	38.8s	+18.9s

Experiment comparison on the internal slate, scored by claude-haiku-4-5. v2 wins every dimension

v2 is slower per turn because it actually completes the action end-to-end; v1's "fast" responses were mostly stops at "Want me to proceed?".

Real customer questions (out-of-distribution check)

The internal regression net guarantees we don't regress on known failure modes — but it doesn't tell us whether the improvements generalize to questions we haven't pre-anticipated. That's what the second layer is for: a smaller set of real customer questions paraphrased from our support inbox, deliberately kept out of the regression net so they stay novel.

Examples:

"What's my current monthly request volume and how is it trending?"
"I want to set up an automation: every time a request fails, I want it added to my Failed Requests dataset."
"My error rate jumped this morning — find me the failing logs and tell me what's going wrong."

The data: improvements transfer

Real customer questions paraphrased from the support inbox, deliberately kept out of the regression net. The same pattern holds

On these out-of-distribution questions, v2 follows the same pattern as the internal slate: it actually finishes the multi-step actions (creates the automation, runs the experiment, returns concrete numbers and IDs traceable to tool output). The Helpfulness & Completeness gain on this set is the most visible — because the consequence of stopping at confirmation on a real customer question is "the customer didn't get what they asked for." The internal slate's improvements transfer to questions the slate has never seen.

That's the case for shipping v2. The regression net is intact, the cross-judge ranking holds, and the gains generalize to real customer language.

What changed inside v2

One prompt instead of five. The router-and-specialist split was a coordination tax we no longer pay. The whole agent is one prompt — same model, same settings, way less total prompt mass.
SDK-native two-tier tool loading. v2 loads 21 core tools at startup (the most-used surface: log_*, dataset_*, experiment_*, etc.) and defers ~70 long-tail tools behind tool_search. The agent fetches the schema only when it needs it. Same full coverage as v1, much smaller resting context.
Tool-namespace grouping for argument accuracy. Tools are grouped by domain (evaluators, workflows, datasets, prompts) and the agent picks the namespace before the tool — which gave us measurable gains on argument shape correctness, especially on the ID-asymmetry traps (passing id vs workflow_id to the right field).
Task tools for self-organization. task_create / task_update give the agent a place to plan multi-step work visibly instead of carrying everything in reasoning tokens. (More on the user-facing version of this below.)
Hard rules on ID asymmetry. "Version PK is what experiment_create wants, not the family ID" lives in the prompt now, with a pointer to the diagnose-not-retry pattern when an error mentions WorkflowVersions not found.
List-no-match means ask, not invent. When notification_method_list returns nothing matching the user's described Slack channel, v2 asks the user instead of silently creating one.
Async-aware. Long-running operations (experiments, dataset imports) get launched and reported as "running async" instead of polled in the same turn — avoiding a class of "the agent says it's done but it's actually still running" bugs.

The unplanned magic: the loop reads the docs on its own

Back when v2's prompt was still a rough draft, we noticed the agent doing something we hadn't told it to. When it hit an unfamiliar request or an unexpected tool error, it would call docs_search on its own, multiple times — verifying the current schema, double-checking which ID type a field expected, working out why a tool had failed. Nothing in the prompt said "look up the docs when unsure." The loop just had the room.

v1 couldn't do this. A specialist that hit an unfamiliar error returned a string to the router, which had to decide on that summary without re-entering the loop. v2's loop takes the error, reads the docs, retries with the right shape, and only then turns back to the user — all in one pass, no state lost.

The lesson: a clean loop with full context isn't just a tidier engineering surface. It changes what the model is capable of, because every recovery path stays open.

What's next

Making the agent's process visible

v2's loop does more interesting work than the chat UI shows — users only see the wall of text at the end. Three improvements coming, all in the same direction:

A live plan checklist. v2 already uses task_create / task_update internally to structure multi-step work; those calls just aren't surfaced yet. The next iteration promotes them to first-class deferred tools that render as a live checklist in the conversation — the agent commits to a plan up front and ticks off steps as it executes them, so the user sees what's happening in real time.
Inline tool outputs. A tool call currently shows up as a one-line "called X" pill. We're going to surface a structured, expandable view of each tool's actual return value — the dataset row that came back, the count of failed logs, the exact UUID the agent looked up — so the user sees the same evidence the agent is reasoning over instead of trusting a final summary.
Visible reasoning blocks. v2 emits reasoning items between tool calls (the agent's internal thinking about what to do next). We collapse those by default today; the next version makes them a togglable detail under each step, so a user who wants to audit the logic — "why did you pick the duplicate workflow?" — can read the chain instead of guessing.

Going deeper on log-investigation questions

The other gap is depth on investigative questions. When a user asks "why did my error rate spike this morning?" or "which requests are slow today?", v2 today returns the right counts and the right time bucket — and stops. That's a triage starting point, not an answer.

The next iteration teaches the agent to drill when the question is investigative: open the failing logs, read the actual error messages, walk the trace tree, group by error class, and surface concrete root causes — "three of the five failures were rate-limit errors from Anthropic on claude-sonnet-4-6; the other two were malformed prompts missing the messages field" — instead of stopping at "5 logs failed." High-level analytics are how a user decides whether to investigate; once they have, they want the investigation and the fix. The agent should know the difference.

Why we rebuilt the agent

v2 collapses all three into one architectural change: a single agent with a single context, on the same model and settings as v1. No router. No specialists. No handoff.

How we measured "is v2 actually better?"

We built a regression net using our own product. Every run of the test slate dogfoods the exact same flow a customer uses: dataset → evaluator pipeline → experiment → score.

Three evaluators, all judging on a 1–100 scale as the default judge:

Helpfulness & Completeness — does the final assistant message actually answer the user's question? Did the agent take the action when asked, or stop at "Want me to proceed?" Caps at 40 if the agent stalled at confirmation when the user clearly asked for an action.
Tool-Use Efficiency — count function_call items, classify question complexity (pure-knowledge / simple / medium / complex), penalize redundancy (duplicate creates, scratchpad spam, repeat list calls without filter changes). Rewards direct paths.
Hallucination & Grounding — extract every concrete claim from the final message (numbers, IDs, named resources), then check whether each one is traceable to an earlier function_call_output. Heavy penalty for invented UUIDs or numbers that don't appear in any tool result.