GEPA: self-improving prompt optimization

Set up Respan

Sign up — Create an account at platform.respan.ai
Create an API key — Generate one on the API keys page
Add credits or a provider key — Add credits on the Credits page or connect your own provider key on the Integrations page

Overview

GEPA (Generate, Evaluate, Promote, Analyze) is a self-improving workflow for prompt optimization. Instead of guessing which prompt changes will work, you run a structured loop that uses evaluation data to drive improvements.

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  Generate    │────▶│  Evaluate    │────▶│  Promote     │────▶│  Analyze     │
│  candidates  │     │  with data   │     │  the winner  │     │  failures    │
└─────────────┘     └─────────────┘     └─────────────┘     └──────┬──────┘
       ▲                                                           │
       └───────────────────────────────────────────────────────────┘
                              Next iteration

Each cycle produces a measurably better prompt. Over multiple iterations, you converge on an optimized prompt backed by evidence.

The GEPA loop

1. Generate — Create prompt candidates

Start with your current prompt and create variants. Variations can target:

Instructions: Rewrite the system message with different framing
Examples: Add, remove, or change few-shot examples
Constraints: Add guardrails, output format rules, or tone guidelines
Models: Test the same prompt on different models

Create these as prompt versions in Respan:

Go to Prompts
Open your prompt
Create new versions with your variations

Example: For a support chatbot prompt, you might create:

v1 (baseline): Simple instructions
v2: Added few-shot examples of good responses
v3: Added explicit constraints (“never say I don’t know”)
v4: Restructured as step-by-step reasoning

2. Evaluate — Test with data

Run all candidates against the same dataset using experiments:

Go to Experiments > + New experiment
Select your prompt and check all versions to compare
Load your testset (curated from production logs or manually created)
Run the experiment
Run your evaluator(s) on all outputs

Key evaluators to set up:

Evaluator	What it measures	Score type
Task accuracy	Does the output correctly complete the task?	Numerical (1-5)
Instruction following	Does it follow all prompt constraints?	Boolean
Tone	Is the tone appropriate for the use case?	Numerical (1-5)

Use at least 20-30 test cases per experiment. Fewer than that and your results may not be statistically meaningful.

3. Promote — Deploy the winner

Compare experiment results across all versions:

Average scores per evaluator per version
Pass rate for boolean evaluators
Cost and latency differences between versions/models

If a candidate clearly outperforms the baseline:

Set it as the active version in the Prompts page
Your production code automatically picks up the new version (if using the Prompts API)
Monitor with online evaluation to confirm the improvement holds in production

If no candidate is clearly better, move to the Analyze step.

4. Analyze — Learn from failures

Look at the test cases where the promoted version still scored poorly:

Filter experiment results for low-scoring rows
Read the inputs, outputs, and evaluator reasoning
Identify patterns:
- Are there specific question types the prompt handles poorly?
- Are there edge cases not covered by the instructions?
- Is the model struggling with certain reasoning tasks?

These insights feed the next Generate step. Each pattern you identify becomes a targeted improvement in your next prompt candidate.

Implementing GEPA with Respan

Here’s a practical implementation using Respan’s features:

Setup (one-time)

1 from openai import OpenAI
2 import requests
3 
4 client = OpenAI(
5     base_url="https://api.respan.ai/api/",
6     api_key="YOUR_RESPAN_API_KEY",
7 )
8 headers = {"Authorization": "Bearer YOUR_RESPAN_API_KEY"}
9 
10 
11 def get_prompt(name, version=None):
12     params = {"prompt_name": name}
13     if version:
14         params["version"] = version
15     resp = requests.get(
16         "https://api.respan.ai/api/prompts/",
17         headers=headers,
18         params=params,
19     )
20     return resp.json()

Run the loop

1 def gepa_iteration(prompt_name: str, test_cases: list[dict]):
2     """One GEPA iteration."""
3 
4     # 1. GENERATE — Fetch all versions
5     # (Create versions manually in the platform first)
6     versions = [1, 2, 3]  # Your prompt versions to compare
7 
8     # 2. EVALUATE — Test each version
9     results = {}
10     for version in versions:
11         prompt = get_prompt(prompt_name, version=version)
12         version_results = []
13 
14         for case in test_cases:
15             response = client.chat.completions.create(
16                 model=prompt["model"],
17                 messages=[
18                     {"role": "system", "content": prompt["messages"][0]["content"]},
19                     {"role": "user", "content": case["input"]},
20                 ],
21                 extra_body={
22                     "metadata": {
23                         "gepa_iteration": "1",
24                         "prompt_version": f"v{version}",
25                     },
26                 },
27             )
28             version_results.append({
29                 "input": case["input"],
30                 "output": response.choices[0].message.content,
31                 "ideal": case.get("ideal_output", ""),
32             })
33 
34         results[version] = version_results
35 
36     # 3. PROMOTE — Compare scores on the platform
37     # View experiment results in the Respan dashboard
38 
39     # 4. ANALYZE — Review failures
40     # Filter for low-scoring outputs, identify patterns
41     # Use patterns to create v4, v5... for next iteration
42 
43     return results

Automate the monitoring

After promoting a version, set up continuous monitoring:

Online evaluation: Automatically score 10-20% of production traffic
Alert on regression: Set up an automation that alerts when average scores drop below your threshold
Collect new failures: Periodically export low-scoring production logs to add to your test dataset

Example: 3 iterations of GEPA

Iteration	Change	Avg score	Improvement
Baseline (v1)	Simple instructions	3.2 / 5	—
Iteration 1 (v2)	Added few-shot examples	3.8 / 5	+19%
Iteration 2 (v4)	Added edge case handling from failure analysis	4.1 / 5	+8%
Iteration 3 (v6)	Switched to step-by-step reasoning for complex questions	4.4 / 5	+7%

Each iteration is small, measurable, and evidence-based.

Tips

Keep a changelog: Document what changed in each version and why
Don’t change too much at once: Isolate variables so you know what caused the improvement
Grow your dataset: Add production failures after every iteration
Set a target: Define what score counts as “good enough” so you know when to stop optimizing
Automate what you can: Use online evals and automations to reduce manual review work

Next steps

Experiments

Run and compare prompt versions

Eval pipeline from logs

Build evaluation datasets from production data