Set up Respan
Set up Respan
- Sign up — Create an account at platform.respan.ai
- Create an API key — Generate one on the API keys page
- Add credits or a provider key — Add credits on the Credits page or connect your own provider key on the Integrations page
Overview
The best evaluation datasets come from real production data. This cookbook walks through the full evaluation loop:- Find failing or low-quality logs in production
- Curate them into a dataset
- Create an evaluator to score responses
- Run experiments to compare prompt versions
- Deploy the winning version
1. Identify problem logs
Start by reviewing production logs on the Logs page. Look for:- Logs with low evaluation scores
- Logs with error status codes
- Logs flagged by users (negative feedback)
- Logs with specific metadata tags (e.g.,
feature: "support_chat")
2. Save logs to a dataset
Once you’ve identified representative failure cases, save them as a testset for offline evaluation.Via the platform
- Select logs on the Logs page
- Click Save to dataset
- Map the log fields to testset columns:
- Log input → testset variables (match your prompt template variables)
- Log output →
ideal_output(or manually write better expected outputs)
Via the API
3. Create an evaluator
Set up an evaluator that scores the quality criteria you care about. Go to Evaluation > Evaluators > + New evaluator. Example: Helpfulness evaluator| Field | Value |
|---|---|
| Name | Helpfulness |
| Type | LLM |
| Model | gpt-4o |
| Score type | Numerical (1-5) |
| Definition | Rate how helpful the response is to the user’s question. Consider: Does it directly answer the question? Is the information accurate? Is it actionable? Score 1 = unhelpful, 5 = excellent. |
{{llm_input}}— the input sent to the LLM{{llm_output}}— the LLM’s response{{ideal_output}}— the expected output (if provided in the dataset)
4. Run experiments
Now test different prompt versions against your dataset:- Go to Experiments > + New experiment
- Select your prompt and choose the versions to compare (e.g., v1 current vs v2 candidate)
- Load your testset
- Run the experiment
- Run evaluations on the outputs
5. Deploy the winner
Once you’ve identified the better prompt version:- Go to Prompts and set the winning version as the active version
- Your production code automatically picks up the new version (if using the Prompts API)
- Set up online evaluation to continuously monitor the deployed version
The full loop
This loop gets faster over time. As your dataset grows and evaluators mature, each iteration requires less manual work.