Skip to main content
Experiments lets you run repeatable evaluations over a dataset and inspect outputs, evaluator scores, and run status. Choose a workflow type based on your use case.

Prompt workflow

Render a saved prompt template with dataset variables, then run LLM calls automatically.
1

Click New experiment

Go to Experiments and click New experiment.
New experiment
2

Select a dataset

Choose the dataset you want to run on.
Select dataset
3

Select task = Prompt

Pick Prompt as the task type, then choose the prompt and version.
Select task
Select prompt
4

Select evaluators and create

Select evaluators to score outputs, then click Create. Inspect outputs and scores once the run finishes.
Select evaluators
Experiment output

Completion workflow

Run direct LLM completions on dataset messages automatically — no prompt templates needed.
1

Click New experiment

Go to Experiments and click New experiment.
New experiment
2

Select a dataset

Choose the dataset you want to run on.
Select dataset
3

Select task = LLM generation

Pick LLM generation (chat completion), then configure the model and parameters (temperature, max tokens).
Select task
Configure model
4

Select evaluators and create

Select evaluators, then click Create. Inspect outputs and scores once the run finishes.
Select evaluators
Experiment output

Custom workflow

Fetch inputs, run your own code/model, then submit outputs back for automatic evaluation.
1

Click New experiment

Go to Experiments and click New experiment.
New experiment
2

Select a dataset

Choose the dataset you want to run on.
Select dataset
3

Select task = Custom

Pick Custom as the task type, select evaluators, then click Create.
Custom task
Select evaluators
4

Submit outputs via API

The system creates placeholder rows. Use the API to submit outputs — evaluators run automatically.
The UI is used to monitor progress and review results. Outputs are submitted via API.

Troubleshooting

The experiment may still be processing — wait 5–10 seconds and retry. Also check that your dataset is not empty.
Confirm the evaluator slug exists and is accessible. Evaluators run asynchronously — poll the log detail endpoint after submission.
Use the detail endpoint to retrieve the full span tree and untruncated fields.