The first time a prompt change ships an incident, every engineering team has the same realization. The prompt lived in a Python string literal, the change passed code review because the diff looked harmless, and now users are getting wrong answers because the model interpreted "concise" differently than the previous "brief." There is no rollback button, no A/B test, no way to know which prompt version was running when a specific customer ticket was filed. The fix is prompt versioning.
This is the engineering pattern we recommend after watching thousands of teams ship and roll back prompt changes through Respan in production. Not theory. The actual schema, the actual versioning workflow, the gotchas that bite, and the features worth paying for in a prompt management tool.
TL;DR
- Prompts in source code are a regression machine. Treat prompts as configuration with versions, not as string literals.
- Five things every prompt deserves: version number, semantic version-like change notes, a hash of the rendered prompt, a deployment stage label (dev/staging/prod), and a creator.
- A/B testing prompts on live traffic is the right way to validate a change. Offline evals catch known regressions. A/B traffic catches the new shape of customer questions you did not anticipate.
- Rollback should be one click. If your "rollback" requires a code deploy, you do not have prompt versioning, you have prompts in code.
- The cost of prompt versioning is small. A registry, a hash, a deployment-aware fetch in your application code, and a UI for non-engineers to read the history.
Why prompt versioning matters
A prompt is a contract between your application and the model. The contract says "given these instructions and this format, produce output in this shape." When the contract changes, behavior changes. The behavior change can be intentional (you wanted the model to be more cautious about citations) or accidental (you tweaked an example and the model now picks up a stylistic tic you did not intend).
Without versioning:
- You cannot answer "what was the prompt running when this customer's bad answer happened in production?"
- You cannot roll back a prompt change without a code deploy.
- You cannot run two prompt variants in parallel to see which one performs better on real traffic.
- A new teammate has no record of why the current prompt looks the way it does, only that it does.
With versioning, all of these become trivial. The cost is one configuration table and a slightly different way to fetch prompts in your application code.
The schema that works
Five fields, all of them carry their weight.
@dataclass
class PromptVersion:
template_id: str # stable identifier, e.g. "rag-system-v8"
version: int # incremented per change
template_body: str # the prompt template with variables
change_notes: str # what changed and why
rendered_hash: str # sha256 of the template body (caching key)
stage: str # "dev" | "staging" | "prod"
created_by: str
created_at: datetimeWhy each one matters:
template_idis the logical identity of the prompt. The chat-assistant prompt has one template_id regardless of how many versions exist.versionis a monotonic integer. Easier than semver for prompts because changes are usually atomic rather than backward-compatible.change_notesis a free-text field. Treat it like a commit message. "Added explicit citation-format instruction after [Linear ticket]" beats "updated prompt."rendered_hashis the sha256 of the template body. Use it as the cache key for provider prompt cache (see LLM cache layers). Two versions with identical rendered bodies share cache. Two versions that differ even by whitespace do not, which is the correct behavior.stagelabels where this version is live. The fetch path uses it to pick the right version per environment.
In our customer base, the implementations that work all have these five fields under different names. The ones that struggle are missing two or more.
The fetch pattern that does not break
Application code should never have a prompt string literal. It should fetch the prompt by template_id and environment:
def call_assistant(user_message: str):
prompt = prompts.get("chat-assistant", stage="prod")
return llm_call(prompt.render(message=user_message))The prompts.get(...) call resolves to the latest version with stage="prod". Cache the result locally for the lifetime of the process and refresh on a schedule (every few minutes). Do not refresh on every call. Two reasons:
- Latency. A prompt fetch on every call adds a network hop.
- Determinism. A long-running request that started under version 17 should finish under version 17 even if version 18 ships mid-call.
The right shape is "cache locally, refresh on a timer, log the version on every call." That last part matters for the next section.
Logging the version with every LLM call
Every span that records an LLM call should carry the prompt template_id and version it used. With Respan, this is automatic when you fetch prompts through the SDK. Manually, you attach two attributes:
span.set_attribute("prompt_template_id", "chat-assistant")
span.set_attribute("prompt_version", 17)This is the smallest change that solves the biggest debug problem. "Did the prompt change between Monday and Tuesday" stops being a guess. "Which prompt version produced this bad answer" stops being a guess. The combination of template_id + version + rendered_hash gives you precise replay capability: given any production trace, you can pull the exact prompt that ran, the exact context that was in it, and rerun it in a notebook against the pinned model. See agent debugging for why this matters for production debugging.
A/B testing prompts on live traffic
Offline evals catch known regressions. They cannot catch the regression that only shows up on the new shape of user question your product surface just introduced. Live A/B is the answer.
The pattern that works:
def call_assistant(user_message: str, user_id: str):
# Deterministic bucketing by user_id, 90/10 split
bucket = "control" if hash(user_id) % 10 < 9 else "treatment"
version = 17 if bucket == "control" else 18
prompt = prompts.get("chat-assistant", version=version)
return llm_call(prompt.render(message=user_message))Log the bucket on the span. Run online evals (sampled production traffic, asynchronous LLM-as-judge) on both buckets. Wait for a statistically meaningful sample (usually 2-7 days for high-traffic apps, longer for low-traffic). Compare faithfulness, answer relevance, or whatever the relevant metric is. Ship the winner.
The trap to avoid: shipping the new version because the metrics look better after 6 hours. Bucketed A/B tests need real sample sizes. A 2% difference in faithfulness on 200 samples is noise.
For the broader telemetry pattern, see RAG observability and LLM evaluation in production.
Rollback in one click
If rolling back a prompt requires a code deploy, you do not have prompt versioning. You have prompts in code with a registry on top. The rollback button is the test.
The right shape:
- Each version has a
stagefield.stage="prod"means "this is live." - Setting an older version's stage to
prodand the current version's stage toarchivedis the rollback. - Your application's prompt cache refreshes within a few minutes (your refresh interval), and the old version starts serving.
This works because nothing in your application code referenced the version number directly. It always asked for "latest prod" and got it.
For Respan users, the rollback is a UI action in the prompt registry. For self-built versioning, it is an UPDATE on the prompts table. Either way, no engineering involvement at the moment of rollback.
Gotchas we have seen
Mistakes the teams that struggle make, in rough order of frequency:
- Prompt body still in code as a fallback. "We have versioning AND a literal fallback for safety." The fallback becomes the actual prompt the day someone forgets to update it. Delete the fallback.
- Version field not logged on spans. When a regression appears, you cannot tell which version caused it. Always log it.
- Refreshing the prompt cache on every call. Adds 50-200ms per request and creates inconsistency mid-conversation. Cache locally, refresh on a timer.
- Treating major prompt rewrites as small versions. If you rewrote the entire system message, that is a v18 to v25 jump, not v18 to v19. Use change_notes to explain.
- No A/B traffic before shipping prompt changes. "It looked fine in dev" is what people say before the incident.
- Prompt registry that only engineers can edit. Product, support, and content people often have the best judgment about prompt tone and wording. Give them edit access with a review workflow.
- No connection between prompt versions and eval results. Each prompt version should accumulate eval scores over time so you can see which versions performed best on real metrics.
Build vs buy
The decision is the same as for any internal tool.
Build your own prompt versioning when:
- Your team is small and the surface area is one or two prompts.
- You have unusual requirements (regulated industry, custom approval workflows).
- You enjoy maintaining the tool.
Use a managed prompt management tool (Respan, Langfuse, Promptfoo, etc.) when:
- You have more than 5 prompts to manage.
- Non-engineers should be able to read and edit prompts.
- You want A/B testing, eval scoring, and observability connected in one platform.
Respan's prompt management ships with versioning, an editor, A/B deployment, and the same span data model as the rest of the platform, so prompt versions show up automatically on the traces they ran inside. The free tier covers most teams' early-phase usage. See the prompt management quickstart for the wiring.
FAQ
Should I version prompts in Git or in a separate system? A separate system. Git's strength is code review on small commits. Prompts change often, by non-engineers, and need a UI for diffing rendered output across versions. Git is a poor fit.
What about secrets in prompts? Do not put secrets in prompts. Variables that reference environment-specific configuration (API endpoints, customer identifiers) are fine. Actual secrets (keys, passwords) should never appear in a prompt registry because anyone with read access to the registry could see them.
How often should we ship prompt changes? As often as you have a reason. Daily is fine if you are iterating on quality. The point of versioning is that frequent changes become safe.
Does prompt caching break with versioning? No. The rendered hash is the cache key. Two prompt versions with identical rendered bodies share cache. Two that differ do not. See LLM cache layers for the cache mechanics.
How do I roll back if the new version has been live for a week and lots of state depends on it? Same as any other rollback. If new state is incompatible with the old prompt, you have a migration problem regardless of prompt versioning. Most production prompt rollbacks are recent (within days) and the state implications are minimal.
Should the prompt registry have approval workflows? For regulated industries, yes. For everyone else, depends on team trust. We see most teams operate without approvals and rely on the rollback path when something goes wrong. Some add review for prompts that affect specific high-stakes surfaces (compliance answers, legal responses).
Can I export the prompt history? Any production-grade registry should support export. Check before signing.