Build an eval pipeline from production logs

Set up Respan

Sign up — Create an account at platform.respan.ai
Create an API key — Generate one on the API keys page
Add credits or a provider key — Add credits on the Credits page or connect your own provider key on the Integrations page

Overview

The best evaluation datasets come from real production data. This cookbook walks through the full evaluation loop:

Find failing or low-quality logs in production
Curate them into a dataset
Create an evaluator to score responses
Run experiments to compare prompt versions
Deploy the winning version

1. Identify problem logs

Start by reviewing production logs on the Logs page. Look for:

Logs with low evaluation scores
Logs with error status codes
Logs flagged by users (negative feedback)
Logs with specific metadata tags (e.g., feature: "support_chat")

Use saved views to create a filter for logs you want to investigate.

2. Save logs to a dataset

Once you’ve identified representative failure cases, save them as a testset for offline evaluation.

Via the platform

Select logs on the Logs page
Click Save to dataset
Map the log fields to testset columns:
- Log input → testset variables (match your prompt template variables)
- Log output → ideal_output (or manually write better expected outputs)

Via the API

import requests

headers = {"Authorization": "Bearer YOUR_RESPAN_API_KEY"}

# 1. Fetch recent logs with issues
logs_response = requests.get(
    "https://api.respan.ai/api/request-logs/",
    headers=headers,
    params={
        "status": "error",
        "limit": 50,
    },
)
logs = logs_response.json()["data"]

# 2. Transform into testset rows
test_cases = []
for log in logs:
    test_cases.append({
        "user_message": log["input"],       # Map to your prompt variables
        "ideal_output": "",                   # Fill in manually or leave empty
    })

# Use these test cases in your experiment
print(f"Created {len(test_cases)} test cases from production logs")

3. Create an evaluator

Set up an evaluator that scores the quality criteria you care about. Go to Evaluation > Evaluators > + New evaluator. Example: Helpfulness evaluator

Field	Value
Name	Helpfulness
Type	LLM
Model	gpt-4o
Score type	Numerical (1-5)
Definition	Rate how helpful the response is to the user’s question. Consider: Does it directly answer the question? Is the information accurate? Is it actionable? Score 1 = unhelpful, 5 = excellent.

The evaluator has access to these variables in the definition:

{{llm_input}} — the input sent to the LLM
{{llm_output}} — the LLM’s response
{{ideal_output}} — the expected output (if provided in the dataset)

See LLM evaluators for more.

4. Run experiments

Now test different prompt versions against your dataset:

Go to Experiments > + New experiment
Select your prompt and choose the versions to compare (e.g., v1 current vs v2 candidate)
Load your testset
Run the experiment
Run evaluations on the outputs

The experiment results show scores side-by-side for each test case across prompt versions.

5. Deploy the winner

Once you’ve identified the better prompt version:

Go to Prompts and set the winning version as the active version
Your production code automatically picks up the new version (if using the Prompts API)
Set up online evaluation to continuously monitor the deployed version

The full loop

Production logs → Identify failures → Curate dataset
                                           ↓
                             Create/update evaluators
                                           ↓
                              Run experiments (offline)
                                           ↓
                              Deploy winning prompt
                                           ↓
                         Monitor with online eval (back to production logs)

This loop gets faster over time. As your dataset grows and evaluators mature, each iteration requires less manual work.

Get started

Features

Admin

Security

Resources

Help & Community

Build an eval pipeline from production logs

Overview

1. Identify problem logs

2. Save logs to a dataset

Via the platform

Via the API

3. Create an evaluator

4. Run experiments

5. Deploy the winner

The full loop

Next steps

Experiments

Online evaluation

Get started

Features

Admin

Security

Resources

Help & Community

​Overview

​1. Identify problem logs

​2. Save logs to a dataset

​Via the platform

​Via the API

​3. Create an evaluator

​4. Run experiments

​5. Deploy the winner

​The full loop

​Next steps

Experiments

Online evaluation

Overview

1. Identify problem logs

2. Save logs to a dataset

Via the platform

Via the API

3. Create an evaluator

4. Run experiments

5. Deploy the winner

The full loop

Next steps