An experiment runs evaluations over a dataset and produces scores. It generates outputs by running every row in your dataset through a prompt or model, then runs the evaluator workflow over each output, and aggregates the results. Go to Experiments and click New experiment to get started.
When creating an experiment, you choose a task type that determines how outputs are generated. Pick the type that matches what you want to test.
Use Prompt when you have a saved prompt template with variables like {{question}} and want to test how it performs across a dataset.
Respan fills the template with each row’s variables, generates an output for every row, and runs the evaluators on the results.
To compare prompt versions, create multiple experiments with the same dataset and evaluators but different prompt versions.

Use this when you want to compare models or generation settings. No prompt template needed.
Respan sends each row’s input directly to the model, generates outputs, and runs the evaluators.
To compare models, create multiple experiments with the same dataset and evaluators but different model configurations.

Use this when your dataset already contains outputs and you only want to score them without calling a model.
No generation happens. Respan runs the evaluators directly on the outputs stored in your dataset.
After an experiment finishes, inspect the generated outputs and evaluator scores per row.

The Analytics tab compares evaluator score distributions across experiments. The histogram groups results into score ranges so you can spot patterns and compare runs side by side.

To improve your outputs, run multiple experiments and compare the results:
Repeat until you are satisfied with the quality, then deploy the winning prompt or model to production.