For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DiscordPlatform
DocumentationIntegrationsAPI referenceSDKsChangelog
DocumentationIntegrationsAPI referenceSDKsChangelog
  • Get started
    • Overview
    • Trace your first call
    • Run your first eval
    • Use gateway & prompts
    • Live demo
  • Features
      • Concepts
      • Evaluators
      • Datasets
      • Experiments
      • Online evals (Beta)
    • Users
  • Admin
    • API keys
    • Provider keys
    • Workspaces & projects
    • Collaborate
  • Resources
  • Security & Support
    • Support
    • Status
LogoLogo
DiscordPlatform
On this page
  • What is an evaluator?
  • Example
  • Graders
  • Blocks
  • Deploy
FeaturesEvals

Evaluators

Set up graders and evaluator workflows to score your AI outputs.
Was this page helpful?
Previous

Datasets

Create and manage datasets for evaluations and experiments.
Next
Built with

What is an evaluator?

An evaluator is a workflow built from graders (the scoring units) connected with conditions. Create them in the Evaluators page, then trigger from experiments or online evals.

Evaluator builder: create grader, edit grader, use blocks, create workflow, deploy

The evaluator builder has five steps:

  1. Create grader: define your scoring metrics in the grader section. Each grader has a name, output data type, score range, and passing score.
  2. Edit grader: configure the scoring logic. Write an LLM definition, Python code, or both. A single grader can include all evaluation types at once.
  3. Use blocks: drag blocks from the palette onto the canvas. Blocks include markers, conditions, graders, compute, metrics, and constants.
  4. Create evaluator workflow: connect blocks on the canvas to define the evaluation flow. Chain graders with conditions to build routing logic.
  5. Deploy evaluator: test the full flow with Test run, then click Deploy to publish as a versioned evaluator.

Example

Simple: Original input -> LLM grader -> Final result.

Advanced: In this example, the “Response Quality” grader is configured with both an LLM definition and code evaluation. The workflow checks cost first: if cost > 5, it runs the LLM grader and routes low scores (< 3) to a human reviewer. Otherwise, it averages the LLM and code grader scores as the final result.

Evaluator builder workflow
Original input
|
If (Cost > 5)
|
|- Then:
| If (Response Quality [LLM] < 3)
| |- Then: Final result = Response Quality [Human]
|
|- Else:
Final result = Average of 2 values:
Response Quality [LLM], Response Quality [Code]

Graders

A grader defines what to measure and how to score it. Click + in the Graders section to create one. Configure these fields:

  • Grader name: what this grader measures (e.g. “Response Quality”, “Format Check”)
  • Description (optional): helps human annotators understand the scoring criteria
  • Output data type: Number, Boolean, Categorical, or Comment
  • Score range: min and max values (e.g. 0-5)
  • Passing score: the minimum score to pass (e.g. 3)

A single grader can include LLM, code, and human definitions. During a run, Respan uses whichever config matches the evaluation mode.

LLM evaluation
Code evaluation
Human evaluation

Under LLM evaluation, write a definition that tells the judge model what to measure and how to score it.

The definition must include {{output}}. Optional variables:

VariableDescription
{{input}}The original input sent to the application
{{output}}The generated output being graded (required)
{{expected_output}}The reference or expected output, when provided
{{metadata}}Custom metadata attached to the run
{{metrics}}System metrics captured during the run

Click the pencil icon to configure the judge model and settings like temperature.

Test run the grader against sample inputs to verify scoring behavior before using it.


Blocks

Blocks are the building pieces of the evaluator workflow. Drag them from the palette and connect them on the canvas.

BlockOptionsDescription
MarkersOriginal input, Final resultEntry and exit points. Every evaluator starts with Original input and ends with Final result or a condition.
GradersYour custom graders (e.g. “Response Quality”)Run LLM, code, or human grading. Each grader block can be assigned to a specific evaluation mode (LLM, Code, or a human reviewer).
ConditionsIf ... Then, If ... Then ... ElseBranch the flow based on values. Operators: >, <, >=, <=, ==, !=.
ComputeAverage of N values, Weighted average of N valuesAggregate scores from multiple graders. Set custom weights for weighted averages.
MetricsCompletion tokens, Cost, Latency, Model, Prompt tokens, Total tokensBuilt-in span values. Use in conditions or compute blocks.
ConstantsNumber, Text, True, FalseFixed values for thresholds, labels, or flags.

Deploy

Once the evaluator is ready:

  • Test run: validate the full flow against sample data
  • Deploy: publish as a new version
  • Versions: view history and load older versions back into the editor