Cache LLM responses by semantic similarity rather than exact match. Near-duplicate questions return cached answers instantly. GPTCache, Redis with vector search, and custom embeddings all implement this.
Traditional exact-match caching is useless for LLM apps — users phrase the same question in dozens of different ways. "What is RAG?" and "Can you explain Retrieval Augmented Generation?" should return the same cached answer. Semantic caching embeds each query and looks for existing cached responses with high cosine similarity. On cache hit, return the cached response immediately. On miss, call the LLM and cache the result.
Hit rates for real applications: 30–60% for customer support chatbots (many repeated questions), 10–30% for general-purpose assistants. At $0.01 per GPT-4o call, a 40% hit rate saves $400 per 1000 queries.
A semantic cache sits between your application and the LLM API:
Cache store options: Redis with vector search (redis-py + RediSearch), Qdrant/Weaviate/Pinecone as vector stores, or in-memory with FAISS for development.
import openai
import numpy as np
from dataclasses import dataclass
client = openai.OpenAI()
@dataclass
class CacheEntry:
query: str
embedding: np.ndarray
response: str
hits: int = 0
class SemanticCache:
def __init__(self, similarity_threshold: float = 0.95):
self.threshold = similarity_threshold
self.entries: list[CacheEntry] = []
def _embed(self, text: str) -> np.ndarray:
resp = client.embeddings.create(model="text-embedding-3-small", input=text)
return np.array(resp.data[0].embedding)
def _cosine_sim(self, a: np.ndarray, b: np.ndarray) -> float:
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def get(self, query: str) -> str | None:
q_emb = self._embed(query)
best_sim, best_entry = 0.0, None
for entry in self.entries:
sim = self._cosine_sim(q_emb, entry.embedding)
if sim > best_sim:
best_sim, best_entry = sim, entry
if best_sim >= self.threshold and best_entry:
best_entry.hits += 1
return best_entry.response
return None # cache miss
def set(self, query: str, response: str):
emb = self._embed(query)
self.entries.append(CacheEntry(query=query, embedding=emb, response=response))
def cached_completion(self, messages: list[dict], model="gpt-4o-mini", **kwargs) -> str:
user_msg = next(m["content"] for m in messages if m["role"] == "user")
cached = self.get(user_msg)
if cached:
return cached, True # (response, from_cache)
resp = client.chat.completions.create(
model=model, messages=messages, **kwargs
).choices[0].message.content
self.set(user_msg, resp)
return resp, False
cache = SemanticCache(similarity_threshold=0.95)
answer, from_cache = cache.cached_completion(
[{"role": "user", "content": "What is retrieval augmented generation?"}]
)
answer2, from_cache2 = cache.cached_completion(
[{"role": "user", "content": "Explain RAG to me"}]
)
print(f"Second query from cache: {from_cache2}") # True (sim ~0.96)
pip install gptcache
from gptcache import cache
from gptcache.adapter import openai
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
# Configure GPTCache with ONNX embedding + SQLite + FAISS
onnx = Onnx()
data_manager = get_data_manager(
CacheBase("sqlite"),
VectorBase("faiss", dimension=onnx.dimension),
)
cache.init(
embedding_func=onnx.to_embeddings,
data_manager=data_manager,
similarity_evaluation=SearchDistanceEvaluation(),
similarity_threshold=0.9,
)
cache.set_openai_key()
# Drop-in replacement — same API as openai
response = openai.ChatCompletion.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "What is a transformer?"}],
)
# Second call with similar query hits cache automatically
response2 = openai.ChatCompletion.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Explain transformer models"}],
)
print(response2["gptcache"]) # {"cache_hit": True}
Semantic caches need invalidation strategies:
class CacheMetrics:
def __init__(self):
self.total_requests = 0
self.cache_hits = 0
self.cost_saved = 0.0
self.latency_saved_ms = 0.0
def record(self, from_cache: bool, tokens: int = 0, latency_ms: float = 0):
self.total_requests += 1
if from_cache:
self.cache_hits += 1
self.cost_saved += tokens * 0.00001 # approx GPT-4o-mini cost
self.latency_saved_ms += max(0, latency_ms - 5) # vs cache lookup
@property
def hit_rate(self) -> float:
return self.cache_hits / max(1, self.total_requests)
def report(self):
print(f"Hit rate: {self.hit_rate:.1%}")
print(f"Cost saved: ${self.cost_saved:.4f}")
print(f"Requests served: {self.total_requests}")
Semantic caching stores LLM responses indexed by the semantic meaning of the query rather than its exact text, so that similar but not identical queries can retrieve cached responses. Unlike exact-match caches that require bit-perfect key equality, semantic caches use embedding similarity to find cache hits, enabling cache reuse across paraphrases, synonyms, and minor query variations.
| Approach | Cache Key | Hit Condition | Hit Rate | Risk |
|---|---|---|---|---|
| Exact match | Query string | String equality | Low | None |
| Semantic (high threshold) | Embedding | Similarity > 0.97 | Medium | Low false hits |
| Semantic (medium threshold) | Embedding | Similarity > 0.90 | High | Occasional wrong cache |
| Cluster-based | Query cluster ID | Same cluster | High | Cluster boundary errors |
Similarity threshold tuning is the most sensitive configuration decision in semantic cache deployment. A threshold that is too high results in low cache hit rates and minimal cost savings. A threshold that is too low causes cache collisions where queries with related but distinct intents receive the wrong cached response — for instance, "What is the capital of France?" and "What is the largest city in France?" might map to the same cache entry despite requiring different answers. Per-topic threshold tuning, where queries about time-sensitive topics use higher thresholds, provides better precision than a single global threshold.
Cache invalidation is more complex for semantic caches than exact-match caches. When a fact in the knowledge base changes — a product price updates, a policy is revised — all cached responses that reference that fact need to be invalidated, even if they were cached under different query phrasings. Tag-based invalidation strategies attach metadata to each cache entry indicating which knowledge sources it draws from, enabling targeted invalidation of all entries that reference a specific source document when that document is updated.
Session-scoped semantic caches restrict cache lookups to the current conversation context, preventing responses from one user's session from serving as cache hits for another user's queries. This is important for applications where personalization matters — the cached response to "summarize my recent emails" from user A should never be returned to user B. Namespacing cache entries by user ID or session ID adds a prefix to the similarity search scope, maintaining cache hit rates within-session while enforcing strict isolation between sessions.
Monitoring semantic cache health requires tracking metrics beyond the standard hit/miss rate. False positive rate — cache hits that returned incorrect responses — is more important than hit rate, since false positives actively degrade response quality rather than just missing an optimization opportunity. Logging sampled cache hits with their similarity scores and submitting them to periodic human review is the most reliable way to detect false positive clusters and tune the threshold accordingly. Alert on any sustained false positive rate above 1–2% as this will noticeably degrade user experience.
Time-to-live (TTL) policies for semantic cache entries must account for the staleness characteristics of the underlying content. Responses about current events, prices, or status information become stale quickly and should have short TTLs (minutes to hours). Responses about stable factual content — explanations of mathematical concepts, historical events, product documentation — can safely use longer TTLs (days to weeks). Implementing topic-based TTL policies, where different cache namespaces have different expiry rules, provides better cache efficiency than a single global TTL applied uniformly across all response types.