Semantic caching is the LLM cost-reduction technique most teams pick up second, after provider prompt caching disappoints them. The pitch is obvious: instead of caching exact-string matches, cache "queries that mean the same thing." A user asks "how do I cancel my subscription?" and a previous user asked "cancel my plan how?" Both should return the same cached answer.
Done well, it cuts cost 30-50% on conversational workloads and improves latency to sub-100 ms. Done poorly, it returns confidently wrong answers to users who never asked, fails silently because nothing logs a "false positive cache hit," and erodes user trust.
The difference between done-well and done-poorly is one number: the similarity threshold. This article is the practical guide we recommend at Respan after watching customers ship and (occasionally) unship semantic cache in production. What it is, the threshold tradeoff in detail, the workloads where it wins and where it backfires, code for two implementations, and how to measure whether yours is hurting your users.
TL;DR
- Semantic cache embeds each incoming query, searches a vector store of past queries plus their cached responses, and returns the cached response if cosine similarity to the closest match exceeds a threshold.
- The threshold is the whole decision. 0.97 = high precision, low hit rate (5%). 0.93 = balanced (20-30% hit rate). 0.88 = high hit rate but real correctness risk. Most teams that ship and fail land between 0.88-0.91.
- Workloads where it wins: customer support FAQs, internal-tool canned queries, documentation Q&A. Conversational repeat patterns with stable canonical answers.
- Workloads where it backfires: open-ended generation, code completion, anything where small input differences produce meaningfully different outputs, regulated industries where confident-wrong is a compliance event.
- You can't ship semantic cache without an eval loop. Unlike exact-match or provider prompt cache (where wrong hits are mostly impossible), semantic cache failures are silent. You need to sample and grade cache hits to detect false positives before users do.
- For most teams the right path is: exact-match cache first (zero correctness risk, easy wins), provider prompt cache for long prefixes (free win), and only add semantic cache when (a) exact-match isn't catching enough and (b) you have the eval discipline to monitor it.
What semantic cache actually is
Three steps per request:
- Embed the incoming query. Use a text-embedding model (text-embedding-3-small, all-MiniLM-L6-v2, etc) to convert the user's question into a vector.
- Vector-search the cache. Compare the query vector against vectors of all previously-cached queries. Get the closest match by cosine similarity.
- Decide: hit or miss. If similarity exceeds a threshold (typically 0.93-0.97), return the cached response and skip the LLM call. If below, call the LLM, store query + response in the cache, return the fresh answer.
In code, the loop looks roughly like:
import numpy as np
from openai import OpenAI
client = OpenAI()
SIMILARITY_THRESHOLD = 0.95
embed_model = "text-embedding-3-small"
def embed(text: str) -> np.ndarray:
resp = client.embeddings.create(model=embed_model, input=text)
return np.array(resp.data[0].embedding)
def cosine(a: np.ndarray, b: np.ndarray) -> float:
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def semantic_get(query: str, cache_store: list[dict]) -> str | None:
if not cache_store:
return None
q_vec = embed(query)
scores = [(cosine(q_vec, item["vec"]), item) for item in cache_store]
scores.sort(key=lambda x: x[0], reverse=True)
best_sim, best_match = scores[0]
if best_sim >= SIMILARITY_THRESHOLD:
return best_match["response"]
return None
def semantic_set(query: str, response: str, cache_store: list[dict]) -> None:
cache_store.append({"query": query, "vec": embed(query), "response": response})In production you'd swap the list-of-dicts for a vector store like pgvector, Redis with vector search, Pinecone, Qdrant, or Weaviate.
The threshold problem (the heart of the trade-off)
The similarity threshold is the parameter that decides everything: cost savings, latency, AND whether your users get wrong answers. Here is what we have measured across production deployments at different threshold settings.
| Threshold | Hit rate (customer support workload) | False positive rate |
|---|---|---|
| 0.99 | 1-3% | under 0.1% |
| 0.97 | 5-10% | ~0.5% |
| 0.95 | 15-25% | 1-3% |
| 0.93 | 25-40% | 3-7% |
| 0.90 | 35-55% | 7-15% |
| 0.85 | 45-70% | 15-30% |
The trade-off is brutal. To get above 30% hit rate (which is the typical break-even where the cost of running the embedding model pays off), you have to operate around 0.93-0.95. At those thresholds, 3-7% of cache hits return the wrong answer to the user.
3% sounds small. In a customer support bot handling 10,000 queries per day at 30% cache hit rate, that's 3,000 cached responses, 90-200 of which are wrong. 90-200 users per day getting confidently-wrong answers your system thinks are correct.
This is why semantic cache is harder to operate than provider prompt cache. With provider prompt cache, wrong cache hits are roughly impossible (exact prefix match). With semantic cache, you trade a guaranteed 90% input-token savings on the prefix for a probabilistic hit on the whole response. The math only works if your false-positive rate is acceptable for your use case.
For comparison: our LLM prompt caching guide covers provider-level caching that has zero correctness risk. Use that first, semantic second.
When semantic cache wins
Three workload shapes where semantic cache is clearly worth the effort:
Customer support FAQ patterns. Users phrase the same question 50 different ways. "How do I reset my password" / "I forgot my password" / "password reset link" / "can't log in to my account." A semantic cache turns 50 unique strings into 1 cached answer. Expected hit rate 30-45% at threshold 0.93-0.95.
Documentation Q&A bots. Same dynamic: the corpus of legitimate user questions is large, the corpus of legitimate answers is small. Semantic cache compresses many-to-one. Threshold 0.95 is typically safe because doc questions tend to be specific enough that paraphrases are genuinely equivalent.
Internal tools (BizOps assistants, analytics queries). Same employees asking similar questions repeatedly ("how many signups last week?" vs "how many new accounts last 7 days?"). The cost of a wrong answer is low (employees catch it). The cost of latency or per-query fees adds up. Semantic cache is great here.
When semantic cache backfires
Three workload shapes where shipping semantic cache is actively harmful:
Open-ended generation. "Write me a haiku about cats" and "write me a poem about my cat Whiskers" are semantically close (high similarity) but should produce completely different outputs. A semantic cache returns the cat-Whiskers poem to the haiku request, user gets confused, trust drops.
Code generation and code completion. Small input differences should produce meaningfully different outputs. "Sort an array" vs "sort an array in descending order" are 0.94 similar but require different code. Caching one for the other is a bug.
Regulated industries. Healthcare, legal, financial-advice assistants. A confidently-wrong cached answer in these spaces is not just a UX issue, it is a compliance event. The 3-7% false positive rate at usable thresholds is unacceptable.
Any product where users expect diversity. Brainstorming tools, creative writing assistants, anything where the implicit contract is "give me something new." Cached responses break that contract.
Implementation 1: Redis with vector search
For teams that already run Redis, the Redis vector search feature provides a production-ready semantic cache without adding a new service. The pattern:
import redis
from redis.commands.search.field import VectorField, TextField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
r = redis.Redis(host="localhost", port=6379)
INDEX_NAME = "semantic_cache_idx"
DIM = 1536 # text-embedding-3-small
# Create the index once at startup
def create_index():
schema = (
TextField("query"),
TextField("response"),
VectorField("vec", "HNSW", {"TYPE": "FLOAT32", "DIM": DIM, "DISTANCE_METRIC": "COSINE"}),
)
try:
r.ft(INDEX_NAME).create_index(
schema, definition=IndexDefinition(prefix=["cache:"], index_type=IndexType.HASH)
)
except Exception:
pass # already exists
def cache_get(query_vec: bytes, threshold: float = 0.95) -> str | None:
q = (
Query("*=>[KNN 1 @vec $vec AS score]")
.return_fields("response", "score")
.dialect(2)
.paging(0, 1)
)
results = r.ft(INDEX_NAME).search(q, query_params={"vec": query_vec})
if not results.docs:
return None
top = results.docs[0]
sim = 1 - float(top.score) # Redis returns distance, convert to similarity
if sim >= threshold:
return top.response
return NoneThe full Redis blog post on this pattern is worth reading: what is semantic caching.
Implementation 2: GPTCache
If you don't already run Redis or want a simpler abstraction, GPTCache is the most popular open-source library for LLM semantic caching. Drop-in replacement for the OpenAI client:
from gptcache import cache
from gptcache.adapter import openai
from gptcache.embedding import OpenAI
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
openai_embedding = OpenAI()
cache.init(
embedding_func=openai_embedding.to_embeddings,
similarity_evaluation=SearchDistanceEvaluation(),
)
cache.set_openai_key()
# Now your existing OpenAI calls get semantic-cached transparently
response = openai.ChatCompletion.create(
model="gpt-5.4",
messages=[{"role": "user", "content": "How do I reset my password?"}],
)GPTCache handles the embedding, the vector store, the similarity comparison, and the cache hit/miss logic. Configuration is in the cache.init() call. Threshold defaults are conservative; tune them based on the eval loop below.
How to actually measure whether yours is working
Semantic cache without an eval loop is a footgun. Here is the minimum:
1. Sample 1-5% of cache hits for blind human or LLM-as-judge grading. Every time you serve a cached response, optionally log it for review. A reviewer (human or strong LLM) compares the cached response to what the LLM would have actually returned for that specific query, and flags the cache hit as "appropriate" or "wrong."
2. Compute false-positive rate per week. False positives / total cache hits sampled. If it climbs above your tolerance threshold (we use 2% for non-regulated, 0.5% for regulated), raise the similarity threshold.
3. Track per-segment FP rate. Aggregate FP rate is 3% but FP rate on legal questions is 12% means semantic cache is unsafe for legal queries even if average looks fine. Segment by intent class.
4. Compare drift over time. Re-grade old cache entries periodically. The corpus drifts. A response that was correct in February may be wrong in May if the underlying product or policy changed.
This is the same online-eval pattern we describe in RAG observability and LLM evaluation in production. Same loop, applied to cache hits instead of LLM outputs.
Common gotchas
Mistakes we see most often:
- Threshold set by intuition, not data. "0.9 feels right" is not a setting. Run the threshold sweep against your own golden set of paired queries (where you know what should match and what shouldn't) and pick the threshold that maximizes hit rate while keeping FP rate under your tolerance.
- Embedding model treated as a constant. Upgrading your embedding model (text-embedding-3-small → text-embedding-3-large) shifts the similarity distribution. Your old threshold is no longer calibrated. Re-tune after any embedding change.
- No staleness handling. A cached answer about pricing from 6 months ago will return today even if pricing changed. Add TTLs (24h-30d depending on volatility) and track per-cache-entry age in your dashboards.
- Caching personalized responses across users. If responses depend on the user (their account state, their previous answers), don't share cache across users. Partition the cache by
customer_identifieror skip caching for personalized endpoints. - Including timestamps or session IDs in cacheable inputs. Strip them before embedding. Otherwise no two queries ever match.
- Counting embedding latency as "free." Each cache lookup adds 50-200 ms for the embedding step. At low hit rates this is pure overhead vs just calling the LLM. Make sure embedding-call cost + latency is below the savings from cache hits at your real hit rate.
- No fallback when vector store is down. If your vector store fails, the cache should fail open (call the LLM) not fail closed (return errors). Wrap the cache lookup in a try/except with a hard timeout.
What we recommend at Respan
Respan's gateway ships exact-match caching by default. Same conversation in, same conversation out, zero correctness risk. That covers classification, extraction, eval pipelines, and any workload with predictable repeated inputs. See the gateway caching docs for the exact configuration.
We do not ship a built-in semantic cache. The reason is the correctness-risk tradeoff above. We have watched teams ship semantic cache and then ship hot-fixes for "the assistant gave the wrong refund policy to a user" enough times that we believe the right default is exact-match plus provider prompt cache, with semantic as an opt-in layer for teams that have done the eval discipline to operate it safely.
If you do need semantic caching, the cleanest architecture is: GPTCache or Redis vector running alongside the Respan gateway. The gateway handles routing, exact-match cache, tracing, prompt management. The semantic cache layer (yours) handles the embedding-similarity logic. Trace both layers through Respan's observability so you can see when each one hits, see false-positive rates per segment, and ship rollback fast if something breaks. See LLM observability for the trace patterns.
FAQ
What's the difference between semantic cache and exact-match cache?
Exact-match cache returns a cached response only when the new request is byte-identical to a prior request. Semantic cache returns a cached response when the new request is "similar enough" to a prior request, measured by embedding cosine similarity. Exact-match has zero correctness risk; semantic has 1-15% depending on threshold.
What similarity threshold should I start with?
0.97 for any first deployment. It is conservative, hit rate will be low (5-10%), but false positive rate will be under 0.5%. Once you have an eval loop running, you can lower toward 0.93-0.95 only if your FP rate stays under your tolerance.
Which embedding model is best for semantic caching?
text-embedding-3-small is the standard default. text-embedding-3-large gives better similarity precision but costs more and adds latency. For most semantic cache workloads, small is enough. For high-stakes workloads (legal, healthcare) consider large plus stricter thresholds.
Does semantic cache work with streaming responses?
Yes. Store the assembled response on first generation, replay it on cache hit. Most clients will not notice the difference. If you need streaming preserved on cache hits, replay chunks with a small artificial delay.
Can I cache tool calls or function calls semantically?
Generally no. Tool calls have side effects. Two queries that semantically match might still need DIFFERENT tool calls (different arguments, different state). Semantic cache the model's textual responses, not the tool calls themselves.
How do I detect false positives in production?
Sample 1-5% of cache hits and grade with an LLM-as-judge (or human reviewer) against the actual fresh response the LLM would have produced. Aggregate by week. If FP rate climbs above your tolerance, raise the threshold.
Should I run semantic cache and exact-match cache together?
Yes. Hit exact-match first (zero risk, fast). On miss, hit semantic (risk-adjusted, still faster than LLM). On miss, call the LLM. This minimizes risk while maximizing cost savings.
Is GPTCache production-ready?
For low-to-moderate traffic workloads, yes. For very high throughput, the bottleneck often becomes the embedding model latency or the vector store. Run load tests on your specific workload before assuming any library will scale.
Related
- Prompt Caching: OpenAI, Anthropic + Gateway. Provider-side caching with zero correctness risk. Use first.
- LLM Cache Layers. The broader 3-layer overview (provider, exact-match, semantic).
- RAG Evaluation in Production. The eval loop that semantic cache requires.
- RAG Observability. How to wire cache hits into your trace tree.
- LLM Gateway: The Complete Guide. Where caching fits in the architecture.
- Claude Prompt Caching Pricing. Provider-side caching, the safest layer.
- How to Reduce OpenAI API Costs. Broader cost-reduction playbook.