Search for "llm cache" and the first ten results are talking about at least three different things. KV-cache libraries that speed up inference inside a model server. Prompt caches that providers like Anthropic and OpenAI bill at a discount. Semantic caches that match user queries to past answers via embedding similarity. They are all called "cache." They solve different problems. And the wrong one for your workload will cost you money or burn correctness instead of saving either.
This is the practical version for engineers building on top of LLM APIs. Three layers, what each does, when each pays off, the hit rate to expect, and the production gotcha that will trip you up the first time. Skip the wrong-fit cache and you save 70 to 90% on the bill that matters.
TL;DR
- Three cache layers that matter for application code: provider prompt cache, exact-match cache, semantic cache.
- Provider prompt cache (Claude, OpenAI) cuts input-token cost roughly 10x for repeated prefixes. Closest to free wins, do this first.
- Exact-match cache keyed on a prompt hash is the safest and cheapest. It only helps when the exact same prompt actually repeats, which is a smaller share of traffic than most teams expect.
- Semantic cache matches similar-but-not-identical queries. Highest hit rate, highest risk of returning the wrong answer. Only ship behind a confidence threshold and only for tolerant-of-wrong-answer use cases.
- Hit rate is your top metric. If your cache hit rate is under 15% in production, you are spending more on infrastructure than you save on inference.
Why "cache" means three different things
The three layers operate at different points in the request path. A typical production call goes:
user query
-> your code: build prompt
-> cache check (your code: exact or semantic)
-> LLM API call
-> provider-side prompt cache check (Anthropic, OpenAI)
-> model inference
-> response
Provider prompt cache runs inside the LLM service. Your exact-match and semantic caches run in your code, before the API call. Each layer can hit independently. The math compounds. A query that hits your exact-match cache costs zero. A query that misses your exact cache but hits the provider's prompt cache for the system prompt costs roughly 10% of input price for the cached portion plus full price for the rest.
If you only set up one of the three, set up the provider prompt cache. It is the closest to free and works for the cases the other two can't catch.
Layer 1: Provider prompt cache
What it is. Anthropic and OpenAI both let you mark stable prefixes of a prompt as cacheable. On the next call within a short TTL with the same prefix, the model reads those tokens from cache at roughly 10% of input-token price.
When it helps. Anywhere you reuse long context: a 15K-token system prompt, a 50K-token RAG context block, conversation history that accumulates across turns. The pattern hits for any workload where most of the prompt repeats.
Hit rate to expect. With sensible breakpoint placement, you should see cache reads on 80 to 95% of input tokens in a high-volume application. For Claude specifically, we walked through the math and the code patterns in detail in Claude prompt caching pricing. For OpenAI's equivalent (Automatic Prompt Caching), the behavior is similar but the controls are different.
The gotchas.
- Cache breakpoint position is exact. A single whitespace change in the prefix breaks the cache.
- Tool definitions count as part of the prefix. Reordering or renaming a tool busts the cache for everything after it.
- The default TTL is 5 minutes from last hit. Bursty workloads with 10-minute gaps fall out of cache. Anthropic's 1-hour TTL is worth the higher write multiplier for shared system prompts.
The contrarian take. Most teams place breakpoints based on what seems "stable" without measuring. The correct way is to log cache_creation_input_tokens and cache_read_input_tokens per call, plot the ratio over a week, and only adjust breakpoints when the read ratio drops below 0.8.
Layer 2: Exact-match cache
What it is. Hash the full prompt (system + tools + messages + tool definitions). Look up in Redis or another KV store. If hit, return the cached completion. If miss, call the API and store the result.
When it helps. Repeat queries with the same exact prompt. Classification tasks where the same input arrives many times. Eval pipelines that re-run on the same golden set. Anything with low input diversity.
Hit rate to expect. Depends entirely on your workload. We see 5 to 20% for general chat applications. 40 to 70% for narrow classification or extraction tasks where the input distribution is small. Sub-5% for open-ended user queries (don't bother).
A minimal implementation in Python:
import hashlib
import json
from redis import Redis
redis = Redis(decode_responses=True)
def cache_key(model: str, messages: list, tools: list | None) -> str:
payload = json.dumps({"model": model, "messages": messages, "tools": tools or []}, sort_keys=True)
return "llm:" + hashlib.sha256(payload.encode()).hexdigest()
def cached_call(client, model, messages, tools=None, ttl=3600):
key = cache_key(model, messages, tools)
cached = redis.get(key)
if cached:
return json.loads(cached)
resp = client.messages.create(model=model, max_tokens=2048, messages=messages, tools=tools)
redis.setex(key, ttl, json.dumps(resp.model_dump()))
return resp.model_dump()The gotchas.
- The key must include the model name. Cache hits across model versions will return stale answers.
- Temperature above 0 makes exact-match cache philosophically weird. You're returning a single sampled response for what is supposed to be a distribution. Fine for many use cases, broken for tasks that rely on diversity.
- The key must include the system prompt and tools. We have seen teams omit these because "they don't change much" and then ship a prompt update that quietly returns yesterday's answers.
The contrarian take. Exact-match cache has bad PR. It looks unsophisticated next to semantic search. Skip semantic and add exact-match first. For most production workloads, exact-match plus provider prompt cache catches 90% of the savings semantic would catch, at zero correctness risk.
Layer 3: Semantic cache
What it is. Embed the incoming query. Search a vector store of past queries plus their cached responses. If the cosine similarity to a past query exceeds a threshold (commonly 0.93 to 0.97), return the cached response. Otherwise call the model.
When it helps. High-traffic user-facing systems where users ask the same question phrased many ways. Customer support FAQs. Documentation search assistants. "How do I do X?" patterns.
Hit rate to expect. With a 0.95 threshold, we see 20 to 40% hit rate on conversational and support workloads. Higher than exact-match, but the hits include some borderline matches.
The gotchas. This is where most semantic cache deployments break.
- The threshold is everything. Set it at 0.97 and your hit rate drops to 5%. Set it at 0.90 and you start returning answers to questions that were similar-sounding but meant different things. "How do I cancel my subscription?" and "How do I cancel my order?" both about cancellation, totally different answers.
- Single-turn assumption. Most semantic cache implementations embed only the user query, not the conversation. A follow-up question in a multi-turn chat will match cached responses from totally different contexts.
- Drift. The corpus of cached responses ages. A question about a feature that was removed last month will still hit cache for weeks.
- No safety net. Unlike exact-match, a wrong semantic match is invisible. The user gets a confidently wrong answer with no easy way to detect it.
A minimal implementation:
from openai import OpenAI
import numpy as np
client = OpenAI()
SIMILARITY_THRESHOLD = 0.95
def embed(text: str) -> np.ndarray:
r = client.embeddings.create(model="text-embedding-3-small", input=text)
return np.array(r.data[0].embedding)
def semantic_cache_lookup(query: str, cache_store):
q_vec = embed(query)
# cache_store: list of {"query": str, "vec": np.ndarray, "response": str}
if not cache_store:
return None
sims = [(np.dot(q_vec, item["vec"]) / (np.linalg.norm(q_vec) * np.linalg.norm(item["vec"])), item) for item in cache_store]
sims.sort(key=lambda x: x[0], reverse=True)
best_sim, best = sims[0]
if best_sim >= SIMILARITY_THRESHOLD:
return best["response"]
return NoneIn real production you would use a proper vector index (Pinecone, pgvector, or similar) and store cache entries with TTLs.
The contrarian take. Semantic cache is the most-talked-about and the least-deployed cache layer for a reason. The correctness risk is real and hard to monitor. Most teams that ship it end up disabling it for any high-stakes use case (medical, legal, financial). Use it only where a confidently-wrong-but-related answer is acceptable.
The decision matrix
When to add each layer:
| Workload | Provider prompt cache | Exact-match cache | Semantic cache |
|---|---|---|---|
| RAG with long retrieved context | Yes, always | Maybe, if queries repeat | Risky, false matches likely |
| Classification or extraction | Yes if system prompt is long | Yes, often 50%+ hit rate | No, overkill |
| Customer support chat | Yes | Yes, for canned questions | Maybe, behind a confidence gate |
| Code generation | Yes | Rarely repeats | No, code intent varies |
| Eval pipelines | Yes | Yes, huge wins on re-runs | No |
| Open-ended chat | Yes | Low hit rate | Maybe, with care |
If you are starting from zero today, the order I would ship in is: provider prompt cache, then exact-match cache for any narrow workload, then semantic cache last and only with measurement infrastructure ready.
Wiring caches into your observability
The metric that tells you whether a cache is working is hit rate, broken down by layer. Without it you cannot tell which layer is paying off.
Three attributes to attach per request:
cache.provider_read_tokens(from the API response)cache.exact_hit(boolean)cache.semantic_hit_similarity(float, or null on miss)
Then dashboard them. A healthy stack shows provider cache reads as the bulk of input tokens, exact-match catching a steady fraction, and (if deployed) semantic hits sitting at a known similarity range. See LLM observability for the broader telemetry picture and LLM gateway for where to host the cache layers (your gateway is the right place for both exact-match and semantic).
Production gotchas
Mistakes we have watched teams make:
- Caching at temperature > 0 without telling users. Two users get the same response to the same query when one of them expected diversity (creative writing, brainstorming). Either set temperature=0 for cached calls, or skip cache for those endpoints.
- No TTL on the cache. Stale answers persist for months. Set a TTL appropriate to your content (1 hour to 7 days depending on volatility).
- Caching errored responses. A 429 or a malformed JSON gets stored and returned on next hit. Always check the response shape before storing.
- Inconsistent cache keys across deploys. A change to how you build the cache key invalidates the entire cache silently. Version the keying logic.
- Forgetting that prompt cache breakpoints are positional. Reorder one sentence in your system prompt and the provider cache drops to 0%.
- Measuring cost saved without measuring correctness lost. Semantic cache costs go down. So do the bug reports about it returning wrong answers. Wire eval scores on cached vs. uncached responses.
FAQ
What's the difference between LLM cache and KV cache? KV cache is internal to the model server (vLLM, TGI, lmcache.ai). It speeds up token generation by reusing attention key-value pairs across tokens. It is invisible to your application code. The three layers in this article all live above the KV cache.
Is provider prompt cache enabled by default?
For OpenAI's Automatic Prompt Caching, yes, on supported models. For Anthropic, no. You must mark cache breakpoints explicitly with cache_control. See Claude prompt caching pricing for details.
How long should cache TTL be? Provider prompt cache: 5 minutes default, 1 hour extended. Exact-match cache: 1 hour to 24 hours for most workloads, longer for eval pipelines. Semantic cache: shorter, 1 to 6 hours, because the corpus drifts.
Will caching break my A/B testing? Yes, if the prompt variants share a cache and you don't include the variant ID in the cache key. Always include experiment flags in your hash.
Can I cache streaming responses? Yes. Store the assembled response, replay it on cache hit. Most clients will not notice the difference. If you must preserve streaming, replay chunks with a small artificial delay so the consumer code paths exercise.
Should I cache tool calls? Carefully. Tool calls have side effects. A cached "send email" call is a bug. Only cache pure-read tool calls (lookups, searches) and put the cache inside the tool implementation, not at the LLM-response layer.
What's the highest-ROI single cache I can add right now? Provider prompt cache on your longest stable prefix. Usually a 30-minute change for a 70 to 90% cut on input-token cost for that prefix.