Skip to main content
  1. Sign up — Create an account at platform.respan.ai
  2. Create an API key — Generate one on the API keys page
  3. Add credits or a provider key — Add credits on the Credits page or connect your own provider key on the Integrations page

Overview

The best evaluation datasets come from real production data. This cookbook walks through the full evaluation loop:
  1. Find failing or low-quality logs in production
  2. Curate them into a dataset
  3. Create an evaluator to score responses
  4. Run experiments to compare prompt versions
  5. Deploy the winning version

1. Identify problem logs

Start by reviewing production logs on the Logs page. Look for:
  • Logs with low evaluation scores
  • Logs with error status codes
  • Logs flagged by users (negative feedback)
  • Logs with specific metadata tags (e.g., feature: "support_chat")
Use saved views to create a filter for logs you want to investigate.

2. Save logs to a dataset

Once you’ve identified representative failure cases, save them as a testset for offline evaluation.

Via the platform

  1. Select logs on the Logs page
  2. Click Save to dataset
  3. Map the log fields to testset columns:
    • Log input → testset variables (match your prompt template variables)
    • Log output → ideal_output (or manually write better expected outputs)

Via the API

import requests

headers = {"Authorization": "Bearer YOUR_RESPAN_API_KEY"}

# 1. Fetch recent logs with issues
logs_response = requests.get(
    "https://api.respan.ai/api/request-logs/",
    headers=headers,
    params={
        "status": "error",
        "limit": 50,
    },
)
logs = logs_response.json()["data"]

# 2. Transform into testset rows
test_cases = []
for log in logs:
    test_cases.append({
        "user_message": log["input"],       # Map to your prompt variables
        "ideal_output": "",                   # Fill in manually or leave empty
    })

# Use these test cases in your experiment
print(f"Created {len(test_cases)} test cases from production logs")

3. Create an evaluator

Set up an evaluator that scores the quality criteria you care about. Go to Evaluation > Evaluators > + New evaluator. Example: Helpfulness evaluator
FieldValue
NameHelpfulness
TypeLLM
Modelgpt-4o
Score typeNumerical (1-5)
DefinitionRate how helpful the response is to the user’s question. Consider: Does it directly answer the question? Is the information accurate? Is it actionable? Score 1 = unhelpful, 5 = excellent.
The evaluator has access to these variables in the definition:
  • {{llm_input}} — the input sent to the LLM
  • {{llm_output}} — the LLM’s response
  • {{ideal_output}} — the expected output (if provided in the dataset)
See LLM evaluators for more.

4. Run experiments

Now test different prompt versions against your dataset:
  1. Go to Experiments > + New experiment
  2. Select your prompt and choose the versions to compare (e.g., v1 current vs v2 candidate)
  3. Load your testset
  4. Run the experiment
  5. Run evaluations on the outputs
The experiment results show scores side-by-side for each test case across prompt versions.

5. Deploy the winner

Once you’ve identified the better prompt version:
  1. Go to Prompts and set the winning version as the active version
  2. Your production code automatically picks up the new version (if using the Prompts API)
  3. Set up online evaluation to continuously monitor the deployed version

The full loop

Production logs → Identify failures → Curate dataset

                             Create/update evaluators

                              Run experiments (offline)

                              Deploy winning prompt

                         Monitor with online eval (back to production logs)
This loop gets faster over time. As your dataset grows and evaluators mature, each iteration requires less manual work.

Next steps