A/B test prompts in production

Set up Respan

Sign up — Create an account at platform.respan.ai
Create an API key — Generate one on the API keys page
Add credits or a provider key — Add credits on the Credits page or connect your own provider key on the Integrations page

Overview

Prompt changes can have unpredictable effects on output quality. Instead of deploying a new prompt version to all users at once, you can A/B test it: route a percentage of traffic to the new version, compare evaluation scores, and promote the winner.

This cookbook walks through:

Creating two prompt versions
Routing traffic by customer segment
Comparing results with evaluators

1. Create prompt versions

In the Prompts page, create a prompt with two versions:

v1 (current): Your existing production prompt
v2 (candidate): The new prompt you want to test

Each version can have different system instructions, templates, models, or parameters. See Version control for details.

2. Fetch prompts in code

Use the Respan SDK to fetch prompt versions at runtime:

1 from openai import OpenAI
2 import requests
3 
4 client = OpenAI(
5     base_url="https://api.respan.ai/api/",
6     api_key="YOUR_RESPAN_API_KEY",
7 )
8 
9 def get_prompt(prompt_name, version=None):
10     """Fetch a prompt from Respan."""
11     headers = {"Authorization": "Bearer YOUR_RESPAN_API_KEY"}
12     params = {"prompt_name": prompt_name}
13     if version:
14         params["version"] = version
15     resp = requests.get(
16         "https://api.respan.ai/api/prompts/",
17         headers=headers,
18         params=params,
19     )
20     return resp.json()

3. Route traffic by segment

Split users between prompt versions using customer_identifier or any segmentation logic:

1 import hashlib
2 
3 def get_variant(customer_id: str, rollout_pct: int = 50) -> int:
4     """Deterministic assignment: same user always gets the same variant."""
5     hash_val = int(hashlib.md5(customer_id.encode()).hexdigest(), 16)
6     return 2 if (hash_val % 100) < rollout_pct else 1
7 
8 def call_with_ab_test(customer_id: str, user_message: str):
9     variant = get_variant(customer_id, rollout_pct=20)  # 20% get v2
10     prompt = get_prompt("support_agent", version=variant)
11 
12     response = client.chat.completions.create(
13         model=prompt["model"],
14         messages=[
15             {"role": "system", "content": prompt["messages"][0]["content"]},
16             {"role": "user", "content": user_message},
17         ],
18         extra_body={
19             "customer_identifier": customer_id,
20             "metadata": {
21                 "prompt_version": f"v{variant}",
22                 "experiment": "support_prompt_ab_test",
23             },
24         },
25     )
26     return response

4. Evaluate both variants

Set up an online evaluation automation to score both variants automatically:

Create an evaluator that scores response quality (e.g., helpfulness, accuracy)
Create a condition that matches logs with metadata.experiment = "support_prompt_ab_test"
Create an automation that runs the evaluator on matched logs

5. Compare results

Filter the Dashboard by metadata.prompt_version to compare:

Average evaluation scores per variant
Cost per variant
Latency per variant
User feedback per variant

Once you have enough data, promote the winning version and update your rollout to 100%.

Next steps

Prompt management

Create and version prompts

Online evaluation

Set up automated scoring on live traffic