GEPA: self-improving prompt optimization

A Generate-Evaluate-Promote-Analyze loop for systematically improving prompts with data.

  1. Sign up — Create an account at platform.respan.ai
  2. Create an API key — Generate one on the API keys page
  3. Add credits or a provider key — Add credits on the Credits page or connect your own provider key on the Integrations page

Overview

GEPA (Generate, Evaluate, Promote, Analyze) is a self-improving workflow for prompt optimization. Instead of guessing which prompt changes will work, you run a structured loop that uses evaluation data to drive improvements.

┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Generate │────▶│ Evaluate │────▶│ Promote │────▶│ Analyze │
│ candidates │ │ with data │ │ the winner │ │ failures │
└─────────────┘ └─────────────┘ └─────────────┘ └──────┬──────┘
▲ │
└───────────────────────────────────────────────────────────┘
Next iteration

Each cycle produces a measurably better prompt. Over multiple iterations, you converge on an optimized prompt backed by evidence.

The GEPA loop

1. Generate — Create prompt candidates

Start with your current prompt and create variants. Variations can target:

  • Instructions: Rewrite the system message with different framing
  • Examples: Add, remove, or change few-shot examples
  • Constraints: Add guardrails, output format rules, or tone guidelines
  • Models: Test the same prompt on different models

Create these as prompt versions in Respan:

  1. Go to Prompts
  2. Open your prompt
  3. Create new versions with your variations

Example: For a support chatbot prompt, you might create:

  • v1 (baseline): Simple instructions
  • v2: Added few-shot examples of good responses
  • v3: Added explicit constraints (“never say I don’t know”)
  • v4: Restructured as step-by-step reasoning

2. Evaluate — Test with data

Run all candidates against the same dataset using experiments:

  1. Go to Experiments > + New experiment
  2. Select your prompt and check all versions to compare
  3. Load your testset (curated from production logs or manually created)
  4. Run the experiment
  5. Run your evaluator(s) on all outputs

Key evaluators to set up:

EvaluatorWhat it measuresScore type
Task accuracyDoes the output correctly complete the task?Numerical (1-5)
Instruction followingDoes it follow all prompt constraints?Boolean
ToneIs the tone appropriate for the use case?Numerical (1-5)

Use at least 20-30 test cases per experiment. Fewer than that and your results may not be statistically meaningful.

3. Promote — Deploy the winner

Compare experiment results across all versions:

  • Average scores per evaluator per version
  • Pass rate for boolean evaluators
  • Cost and latency differences between versions/models

If a candidate clearly outperforms the baseline:

  1. Set it as the active version in the Prompts page
  2. Your production code automatically picks up the new version (if using the Prompts API)
  3. Monitor with online evaluation to confirm the improvement holds in production

If no candidate is clearly better, move to the Analyze step.

4. Analyze — Learn from failures

Look at the test cases where the promoted version still scored poorly:

  1. Filter experiment results for low-scoring rows
  2. Read the inputs, outputs, and evaluator reasoning
  3. Identify patterns:
    • Are there specific question types the prompt handles poorly?
    • Are there edge cases not covered by the instructions?
    • Is the model struggling with certain reasoning tasks?

These insights feed the next Generate step. Each pattern you identify becomes a targeted improvement in your next prompt candidate.

Implementing GEPA with Respan

Here’s a practical implementation using Respan’s features:

Setup (one-time)

1from openai import OpenAI
2import requests
3
4client = OpenAI(
5 base_url="https://api.respan.ai/api/",
6 api_key="YOUR_RESPAN_API_KEY",
7)
8headers = {"Authorization": "Bearer YOUR_RESPAN_API_KEY"}
9
10
11def get_prompt(name, version=None):
12 params = {"prompt_name": name}
13 if version:
14 params["version"] = version
15 resp = requests.get(
16 "https://api.respan.ai/api/prompts/",
17 headers=headers,
18 params=params,
19 )
20 return resp.json()

Run the loop

1def gepa_iteration(prompt_name: str, test_cases: list[dict]):
2 """One GEPA iteration."""
3
4 # 1. GENERATE — Fetch all versions
5 # (Create versions manually in the platform first)
6 versions = [1, 2, 3] # Your prompt versions to compare
7
8 # 2. EVALUATE — Test each version
9 results = {}
10 for version in versions:
11 prompt = get_prompt(prompt_name, version=version)
12 version_results = []
13
14 for case in test_cases:
15 response = client.chat.completions.create(
16 model=prompt["model"],
17 messages=[
18 {"role": "system", "content": prompt["messages"][0]["content"]},
19 {"role": "user", "content": case["input"]},
20 ],
21 extra_body={
22 "metadata": {
23 "gepa_iteration": "1",
24 "prompt_version": f"v{version}",
25 },
26 },
27 )
28 version_results.append({
29 "input": case["input"],
30 "output": response.choices[0].message.content,
31 "ideal": case.get("ideal_output", ""),
32 })
33
34 results[version] = version_results
35
36 # 3. PROMOTE — Compare scores on the platform
37 # View experiment results in the Respan dashboard
38
39 # 4. ANALYZE — Review failures
40 # Filter for low-scoring outputs, identify patterns
41 # Use patterns to create v4, v5... for next iteration
42
43 return results

Automate the monitoring

After promoting a version, set up continuous monitoring:

  1. Online evaluation: Automatically score 10-20% of production traffic
  2. Alert on regression: Set up an automation that alerts when average scores drop below your threshold
  3. Collect new failures: Periodically export low-scoring production logs to add to your test dataset

Example: 3 iterations of GEPA

IterationChangeAvg scoreImprovement
Baseline (v1)Simple instructions3.2 / 5
Iteration 1 (v2)Added few-shot examples3.8 / 5+19%
Iteration 2 (v4)Added edge case handling from failure analysis4.1 / 5+8%
Iteration 3 (v6)Switched to step-by-step reasoning for complex questions4.4 / 5+7%

Each iteration is small, measurable, and evidence-based.

Tips

  • Keep a changelog: Document what changed in each version and why
  • Don’t change too much at once: Isolate variables so you know what caused the improvement
  • Grow your dataset: Add production failures after every iteration
  • Set a target: Define what score counts as “good enough” so you know when to stop optimizing
  • Automate what you can: Use online evals and automations to reduce manual review work

Next steps