Caching cuts LLM cost and latency by 30-90%. Cache invalidation is what turns that win into an incident. A cached response served after the underlying model, prompt, or knowledge has changed is, at best, an outdated answer and, at worst, a hallucination your monitoring is treating as a cache hit.
Cache invalidation in traditional web systems is hard because state changes are hard to track. Cache invalidation in LLM applications is harder because there are more sources of state, and several of them are invisible to the application code that controls the cache. Most production incidents we see from customers ship caching and forget to wire invalidation to one or more of the six triggers below.
This article is the practical playbook. The six triggers that should invalidate an LLM cache, the cache-key design that makes invalidation cheap, the TTL recommendations per content class, the special problem with semantic caches, and how the Respan gateway handles invalidation per-customer.
TL;DR
- Six things should invalidate an LLM cache. Model version changes, prompt template changes, tool/function schema changes, RAG corpus updates, system prompt changes, and user/customer state changes.
- Most teams handle 2 of the 6. TTL and model name. The other four are silent bugs waiting to surface as a wrong answer in production.
- Cache key design beats invalidation logic. If your cache key includes every input that can change the output (model, prompt hash, tool-schema hash, RAG-corpus version, customer ID), most invalidation happens automatically because the key changes.
- TTL is your safety net, not your strategy. Use TTLs (24h-30d depending on content volatility) so bugs in versioned invalidation are bounded in blast radius.
- Semantic caches are harder to invalidate than exact-match caches because keys are vectors, not strings. Plan invalidation up front or skip semantic caching.
Why LLM cache invalidation is different
Traditional cache invalidation handles two cases: time passes (TTL), or some application event signals "this data is now stale" (event-based, e.g. user updates their profile, invalidate their profile cache).
LLM applications have a third case: the output of the same input changes silently from sources outside your application. The user query "what's our refund policy" produces output X today. Tomorrow your team:
- Switches the assistant from
gpt-5.0togpt-5.4. Output changes. - Updates the system prompt to add a tone instruction. Output changes.
- Adds a new tool to the function-calling list. Output may now call that tool.
- Reindexes the RAG corpus with the new refund policy doc. Output changes.
- Updates the prompt template (now passes user's tier). Output changes per-customer.
If your cache key was just hash(user_query), every one of these changes silently serves yesterday's answer. You won't see it as a cache bug. You'll see it as a "the assistant is giving wrong answers to customers" support ticket weeks later, when nobody remembers that the change went out 3 weeks ago.
The fix is to make the cache key include every dimension that can change the output, plus to wire explicit invalidation for the dimensions you can't include cheaply.
The 6 triggers that should invalidate an LLM cache
Trigger 1: Model version
Different model versions produce different outputs for the same input. Even minor version bumps (gpt-5.0 → gpt-5.0-2026-04-15) shift behavior on edge cases. Major bumps (gpt-5.0 → gpt-5.4) shift behavior everywhere.
The pattern: include the full model identifier in the cache key. cache_key = hash(model_id, query, params). When you upgrade, the cache key changes, old entries expire by TTL, no manual invalidation needed.
Gotcha: if you alias model names in your gateway config (e.g. default-fast → gpt-5.4-mini), include the resolved model, not the alias. Otherwise switching what default-fast points to silently serves old cached answers.
Trigger 2: Prompt template version
When you change the system prompt or the prompt template, every cached response that was generated under the old prompt is now stale. The user's question hasn't changed, but the assistant's behavior contract has.
The pattern: version your prompts and include the version in the cache key. cache_key = hash(prompt_version, model_id, query, params). Tools like Respan prompt management version prompts automatically; if you store prompts in code, use the git SHA of the prompt file or a manually-bumped version constant.
Gotcha: if your prompt template includes injected variables (user tier, locale, time of day), the hash needs to cover the resolved template, not the unbound template.
Trigger 3: Tool / function schema
For function-calling and tool-using assistants, the set of available tools influences every response. Add a new tool (refund_order) and the assistant's behavior on refund questions changes. Remove a tool and the assistant may fall back to text it would never have generated before.
The pattern: hash the tool list (names, descriptions, parameter schemas) and include the hash in the cache key. cache_key = hash(prompt_version, tools_schema_hash, model_id, query, params).
Gotcha: tool descriptions are part of the assistant's behavior. Editing a description (even for "clarity") changes how often the assistant uses the tool. Include descriptions in the hash, not just names and parameter schemas.
Trigger 4: RAG corpus updates
Retrieval-augmented generation pulls context from a knowledge base. When you reindex the corpus, the same user query retrieves different documents and produces different answers. Cached responses generated against the old corpus are stale.
The pattern: version your corpus. Every reindex bumps a corpus_version. Include it in the cache key. cache_key = hash(prompt_version, corpus_version, tools_schema_hash, model_id, query).
Gotcha: if your RAG retrieval is non-deterministic (cosine similarity ties broken by vector store ordering), even identical inputs can produce different cached answers across runs. Either freeze the random seed or accept that RAG-backed responses cache less effectively than non-RAG ones.
Trigger 5: System prompt or persona changes
A subtype of prompt versioning, but worth calling out because most teams version their user-facing prompt template and forget about the assistant's system prompt or persona. Changing "You are a helpful assistant" to "You are a concise technical assistant" changes responses dramatically.
The pattern: include the system prompt in prompt_version or hash it separately. Treat persona changes as a cache-invalidating event, the same way you'd treat a code deployment.
Trigger 6: User / customer state
If the assistant's response depends on the user (their account tier, their preferences, their purchase history), the cache cannot be shared across users without producing wrong answers.
The pattern: include a customer_identifier in the cache key. The Respan gateway exposes this as cache_options.cache_by_customer. Set it to true and the gateway scopes the cache per-customer automatically.
Gotcha: the inverse problem is worse. If a response IS shareable across users but you key by customer, your hit rate is 0. The judgment call is: would I be comfortable serving this user's cached response to a different user? If yes, don't key by customer. If no, do.
Cache key design (the heart)
Good invalidation flows from good keys. The cache key for an LLM response should hash, at minimum:
import hashlib
import json
def cache_key(
model_id: str,
prompt_version: str,
tools_schema: list[dict],
corpus_version: str,
customer_id: str | None,
user_message: str,
params: dict,
) -> str:
tools_hash = hashlib.sha256(
json.dumps(tools_schema, sort_keys=True).encode()
).hexdigest()[:12]
params_hash = hashlib.sha256(
json.dumps(params, sort_keys=True).encode()
).hexdigest()[:12]
key_parts = [
model_id,
prompt_version,
tools_hash,
corpus_version,
customer_id or "shared",
params_hash,
hashlib.sha256(user_message.encode()).hexdigest(),
]
return ":".join(key_parts)With keys structured this way, four of the six triggers (model, prompt, tools, corpus) invalidate automatically by changing the key. The remaining two (system prompt, customer state) are folded into prompt_version and customer_id respectively.
The cost is hit-rate dilution. Each dimension in the key reduces the chance two requests share a cache entry. The tradeoff worth making: a 20% hit rate that always returns correct answers beats a 40% hit rate that returns wrong answers 5% of the time.
TTL playbook by content class
TTL is your safety net for whatever versioning logic missed. Recommended bounds:
| Content class | TTL recommendation | Reasoning |
|---|---|---|
| Pricing, policy, compliance answers | 1-6 hours | High update frequency, high cost of stale |
| Product feature explanations | 6-24 hours | Moderate update frequency |
| Conceptual / educational content | 24-72 hours | Rarely changes day-to-day |
| Generic FAQ ("how do I reset my password") | 7-30 days | Almost never changes |
| Internal-tool queries | 1-24 hours | Match to data refresh cadence |
| RAG-backed answers | corpus refresh interval | Bound to next reindex |
| Anything with model_id in key | 30 days max | Bounded by next model deprecation |
The Respan gateway's cache_ttl parameter defaults to 30 days. For policy-sensitive workloads, set it lower at the gateway call site:
from openai import OpenAI
client = OpenAI(
api_key="YOUR_RESPAN_API_KEY",
base_url="https://api.respan.ai/api/",
)
resp = client.chat.completions.create(
model="gpt-5.4",
messages=[
{"role": "system", "content": "You are a refund policy assistant."},
{"role": "user", "content": "How do I get a refund?"},
],
extra_body={
"cache_enabled": True,
"cache_ttl": 3600, # 1 hour - policy content
"cache_options": {
"cache_by_customer": False, # Same policy for everyone
},
},
)For prompt-cache layers (OpenAI, Anthropic), TTL is provider-controlled: OpenAI uses 5-10 minute basic with up to 24-hour extended, Anthropic uses 5-minute and 1-hour breakpoints. See our prompt caching guide for the details. Those layers refresh too frequently for staleness to be your top concern; your application-level cache (gateway or in-process) is where invalidation logic lives.
Invalidation patterns
Four patterns, ranked by complexity:
Pattern 1: TTL only. Pick a tight enough TTL that staleness is acceptable. Easiest to reason about, no invalidation logic needed. Fails when an urgent correction needs to ship faster than the TTL window. Use as the default and the fallback for everything else.
Pattern 2: Key versioning. Bump a version constant (prompt_version, corpus_version) on deploy, old keys age out. No active invalidation, no pub/sub, but old entries linger in the cache store until TTL expires. Effective when (a) you can afford the cache misses on the cold start after the bump, and (b) you have TTL discipline so old keys don't accumulate forever.
Pattern 3: Explicit delete by key prefix. On deploy, fire a job that scans the cache store for keys matching model_id=gpt-5.0:* and deletes them. Requires the cache store to support prefix scans (Redis SCAN, most KV stores). Faster cleanup than key-versioning alone, but adds operational complexity.
Pattern 4: Pub/sub invalidation broadcast. On an event (model swap, prompt deploy, corpus reindex), publish to a topic that all application instances subscribe to and flush relevant entries. Highest complexity, lowest latency. Worth it for workloads where (a) staleness window is measured in seconds, not hours, and (b) you have the infra to operate a pub/sub layer. Most LLM applications don't.
For most teams, the right answer is TTL + key versioning. Skip the operational complexity of patterns 3 and 4 unless you have a specific reason.
The special problem with semantic caches
Semantic caches make invalidation dramatically harder than exact-match caches. The reason: keys are not strings. They are vectors. You cannot prefix-scan a vector index for "all entries where prompt_version was X" because the version isn't in the key, the key is the embedding of the query.
Three options for semantic-cache invalidation, none great:
- Maintain a parallel metadata store keyed by entry ID, with prompt_version, model_id, corpus_version. On invalidation event, scan the metadata, find the affected entries, delete them by ID from the vector store. Doubles the storage cost and adds a join.
- Tag entries at write time with the dimensions you might want to invalidate by, if your vector store supports metadata filtering (Pinecone, Qdrant, Weaviate do; raw pgvector requires hybrid query setup). Then invalidate with a filtered delete.
- Just flush the whole semantic cache on any invalidation event. Cold-start cost but operationally simple. Acceptable for low-traffic workloads.
This is one of the reasons we recommend exact-match caching at the gateway as the default, with semantic caching as an opt-in layer only when the cost math justifies the operational complexity.
How the Respan gateway handles invalidation
The Respan gateway ships exact-match caching with a small set of invalidation knobs that map to the patterns above:
cache_ttl(per-call, seconds, default 30 days). Bound your staleness window. Set lower for policy-sensitive content.cache_options.cache_by_customer(boolean). When true, scopes the cache per-customer using yourcustomer_identifier. Use for personalized responses; skip for shareable ones.cache_options.is_cached_by_model(boolean). When true, includes the model in the cache key. Defaults true. Off only when you genuinely want responses cached across models (rare).cache_options.omit_log(boolean). Doesn't invalidate but worth knowing. Keeps the cache hit from appearing in traces, which is useful for cost-attribution but can hide cache bugs from your observability. Default off; flip on cautiously.
Cache hits show up as the respan/cache model tag in your traces. When you suspect an invalidation bug ("we shipped the new prompt but users are still getting old answers"), filter traces by model=respan/cache to see exactly which queries are being served from cache and verify their age. See LLM observability for the trace patterns.
For prompt version and tool schema invalidation, you bump the inputs that go into your customer_identifier or cache_ttl decision at the call site, the gateway hashes the resolved request, and old keys age out naturally.
Common gotchas
Mistakes we see most often:
- Cache key doesn't include model ID. Switching models silently serves old answers. Always include the resolved model name.
- Prompt version constant gets forgotten. Team ships a prompt change in code, forgets to bump the version constant, cache returns yesterday's responses. Tie the version constant to the prompt file's git SHA, or use a prompt-management system that tracks versions automatically.
- Tool schema included by name only, not parameter schema. Edit a tool's parameter description, behavior changes, cache returns old. Hash the full schema, not just the tool name.
cache_by_customerset globally instead of per-route. Result: either 0% hit rate (everything keyed by customer when it shouldn't be) or wrong-customer answers (nothing keyed by customer when some routes should be). Decide per-route.- No upper-bound TTL. Cache entries from 6 months ago linger because nothing ever evicts them and the cache store has no max size set. Always set a TTL; always set a max cache size at the store level.
- Invalidating on the wrong event. "Invalidate when the user updates their account" is too coarse. Most account fields don't influence assistant responses. Invalidate when fields that the prompt or tools depend on change, not on every update.
- No way to flush a specific entry urgently. Customer reports "the assistant told me the wrong refund window." You have no way to invalidate just that response. Build a single admin path for emergency flush-by-key from day one.
- No staleness telemetry. You can't fix invalidation bugs you don't measure. Log cache age on every hit and alert when responses are older than your expected TTL bound.
FAQ
What's the most common LLM cache invalidation bug?
Forgetting to invalidate on prompt version changes. Engineers ship a system prompt update, the cache is still keyed by an older prompt_version, and users continue receiving responses generated under the old prompt for as long as the TTL window. Catch it by tying prompt_version to the git SHA of the prompt file.
Should I invalidate cache on every deploy?
No. Most deploys don't change LLM behavior. Invalidate when the inputs to your cache key change: model swap, prompt update, tool schema edit, corpus reindex, system prompt rewrite. Code deploys that don't touch any of these don't need invalidation.
How do I invalidate a single cached answer?
Build an admin endpoint that accepts a cache key and deletes it from the cache store. You'll need it the first time a customer reports a wrong cached answer. The Respan gateway exposes the cache key in trace metadata; copy it from the trace and pass it to your admin endpoint.
Can I cache RAG responses safely?
Yes, but include the corpus version in the cache key and either freeze the retrieval seed or accept that identical inputs may return different cached answers across runs depending on which corpus snapshot was active when each was cached.
How long should the cache TTL be?
Bound by the staleness you can tolerate. Policy / pricing / compliance: 1-6 hours. Conceptual / educational: 24-72 hours. Internal FAQ: 7-30 days. The Respan gateway default is 30 days; lower it per-call for high-volatility content.
Does prompt caching (OpenAI, Anthropic) require its own invalidation?
No, the provider manages it. OpenAI's basic cache has a 5-10 minute TTL with up to 24 hours extended; Anthropic has 5-minute and 1-hour breakpoints. Both refresh automatically. Your application-level cache (in-process or gateway) is where invalidation logic lives.
What happens to old cache entries after I bump prompt_version?
They age out by TTL. With key-versioning alone, they'll sit in the cache store consuming memory until the TTL expires. For tighter cleanup, either set a TTL aligned with your prompt deploy cadence, or scan and delete entries with the old prefix on deploy.
Should I use pub/sub for cache invalidation in LLM apps?
Probably not. The operational complexity of running a pub/sub layer for cache invalidation is only worth it when staleness windows under 1 minute matter. For most LLM applications, TTL plus key-versioning is sufficient.
Related
- Semantic Cache for LLMs: When to Ship, When to Skip. Why semantic caches are harder to invalidate.
- Prompt Caching: OpenAI, Anthropic + Gateway. The provider-side cache layer with provider-controlled invalidation.
- LLM Cache Layers. The 3-layer cache architecture (provider, exact-match, semantic).
- LLM Gateway: The Complete Guide. Where cache invalidation fits in the overall architecture.
- LLM Observability. How to wire cache hits and cache age into your traces.
- How to Reduce OpenAI API Costs. The broader cost playbook caching plugs into.