Semantic Cache

Why semantic caching
Architecture
Building a semantic cache in Python
GPTCache integration
Cache invalidation
Measuring cache effectiveness
Gotchas

SECTION 01

Why semantic caching

Traditional exact-match caching is useless for LLM apps — users phrase the same question in dozens of different ways. "What is RAG?" and "Can you explain Retrieval Augmented Generation?" should return the same cached answer. Semantic caching embeds each query and looks for existing cached responses with high cosine similarity. On cache hit, return the cached response immediately. On miss, call the LLM and cache the result.

Hit rates for real applications: 30–60% for customer support chatbots (many repeated questions), 10–30% for general-purpose assistants. At $0.01 per GPT-4o call, a 40% hit rate saves $400 per 1000 queries.

SECTION 02

Architecture

A semantic cache sits between your application and the LLM API:

Query embedding: Embed the user query with a fast embedding model (text-embedding-3-small, $0.00002/1K tokens).
Similarity search: Search the cache for existing queries with cosine similarity > threshold (typically 0.95).
Cache hit: Return cached response in <10ms. No LLM call needed.
Cache miss: Call the LLM, cache the (query embedding, response) pair, return response.

Cache store options: Redis with vector search (redis-py + RediSearch), Qdrant/Weaviate/Pinecone as vector stores, or in-memory with FAISS for development.

SECTION 03

Building a semantic cache in Python

import openai
import numpy as np
from dataclasses import dataclass

client = openai.OpenAI()

@dataclass
class CacheEntry:
    query: str
    embedding: np.ndarray
    response: str
    hits: int = 0

class SemanticCache:
    def __init__(self, similarity_threshold: float = 0.95):
        self.threshold = similarity_threshold
        self.entries: list[CacheEntry] = []

    def _embed(self, text: str) -> np.ndarray:
        resp = client.embeddings.create(model="text-embedding-3-small", input=text)
        return np.array(resp.data[0].embedding)

    def _cosine_sim(self, a: np.ndarray, b: np.ndarray) -> float:
        return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

    def get(self, query: str) -> str | None:
        q_emb = self._embed(query)
        best_sim, best_entry = 0.0, None
        for entry in self.entries:
            sim = self._cosine_sim(q_emb, entry.embedding)
            if sim > best_sim:
                best_sim, best_entry = sim, entry
        if best_sim >= self.threshold and best_entry:
            best_entry.hits += 1
            return best_entry.response
        return None  # cache miss

    def set(self, query: str, response: str):
        emb = self._embed(query)
        self.entries.append(CacheEntry(query=query, embedding=emb, response=response))

    def cached_completion(self, messages: list[dict], model="gpt-4o-mini", **kwargs) -> str:
        user_msg = next(m["content"] for m in messages if m["role"] == "user")
        cached = self.get(user_msg)
        if cached:
            return cached, True   # (response, from_cache)
        resp = client.chat.completions.create(
            model=model, messages=messages, **kwargs
        ).choices[0].message.content
        self.set(user_msg, resp)
        return resp, False

cache = SemanticCache(similarity_threshold=0.95)
answer, from_cache = cache.cached_completion(
    [{"role": "user", "content": "What is retrieval augmented generation?"}]
)
answer2, from_cache2 = cache.cached_completion(
    [{"role": "user", "content": "Explain RAG to me"}]
)
print(f"Second query from cache: {from_cache2}")  # True (sim ~0.96)

SECTION 04

GPTCache integration

pip install gptcache

from gptcache import cache
from gptcache.adapter import openai
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

# Configure GPTCache with ONNX embedding + SQLite + FAISS
onnx = Onnx()
data_manager = get_data_manager(
    CacheBase("sqlite"),
    VectorBase("faiss", dimension=onnx.dimension),
)
cache.init(
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
    similarity_threshold=0.9,
)
cache.set_openai_key()

# Drop-in replacement — same API as openai
response = openai.ChatCompletion.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "What is a transformer?"}],
)
# Second call with similar query hits cache automatically
response2 = openai.ChatCompletion.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "Explain transformer models"}],
)
print(response2["gptcache"])  # {"cache_hit": True}

SECTION 05

Cache invalidation

Semantic caches need invalidation strategies:

TTL (time-to-live): Expire entries after N hours/days. Good for facts that change (stock prices, news). Typical TTL: 24h for general knowledge, 1h for current events.
Max cache size: Evict least-recently-used (LRU) entries when the cache exceeds a size limit. Prevents unbounded growth.
Manual invalidation: When you update the underlying data (e.g. product documentation changes), purge related cache entries. Tag entries with source document IDs for targeted invalidation.
System prompt versioning: Cache entries should include a hash of the system prompt. When you change the system prompt, all existing cache entries are invalid.

SECTION 06

Measuring cache effectiveness

class CacheMetrics:
    def __init__(self):
        self.total_requests = 0
        self.cache_hits = 0
        self.cost_saved = 0.0
        self.latency_saved_ms = 0.0

    def record(self, from_cache: bool, tokens: int = 0, latency_ms: float = 0):
        self.total_requests += 1
        if from_cache:
            self.cache_hits += 1
            self.cost_saved += tokens * 0.00001  # approx GPT-4o-mini cost
            self.latency_saved_ms += max(0, latency_ms - 5)  # vs cache lookup

    @property
    def hit_rate(self) -> float:
        return self.cache_hits / max(1, self.total_requests)

    def report(self):
        print(f"Hit rate: {self.hit_rate:.1%}")
        print(f"Cost saved: ${self.cost_saved:.4f}")
        print(f"Requests served: {self.total_requests}")

SECTION 07

Gotchas

Threshold calibration: Too high (0.99) → almost no hits. Too low (0.85) → wrong answers returned for semantically similar but different questions. Test your threshold on a sample of real queries before deploying.
Context-dependent queries: "What did I ask earlier?" or "Summarise the above" contain implicit context that the embedding doesn't capture. Don't cache multi-turn conversations — only single-turn stateless queries.
Personalized responses: If responses are personalised (user name, account details), cached responses may be wrong for different users. Either don't cache personalised queries, or include the personalisation key in the cache key.

Semantic Cache Architecture Patterns

Semantic caching stores LLM responses indexed by the semantic meaning of the query rather than its exact text, so that similar but not identical queries can retrieve cached responses. Unlike exact-match caches that require bit-perfect key equality, semantic caches use embedding similarity to find cache hits, enabling cache reuse across paraphrases, synonyms, and minor query variations.

Approach	Cache Key	Hit Condition	Hit Rate	Risk
Exact match	Query string	String equality	Low	None
Semantic (high threshold)	Embedding	Similarity > 0.97	Medium	Low false hits
Semantic (medium threshold)	Embedding	Similarity > 0.90	High	Occasional wrong cache
Cluster-based	Query cluster ID	Same cluster	High	Cluster boundary errors

Similarity threshold tuning is the most sensitive configuration decision in semantic cache deployment. A threshold that is too high results in low cache hit rates and minimal cost savings. A threshold that is too low causes cache collisions where queries with related but distinct intents receive the wrong cached response — for instance, "What is the capital of France?" and "What is the largest city in France?" might map to the same cache entry despite requiring different answers. Per-topic threshold tuning, where queries about time-sensitive topics use higher thresholds, provides better precision than a single global threshold.

Cache invalidation is more complex for semantic caches than exact-match caches. When a fact in the knowledge base changes — a product price updates, a policy is revised — all cached responses that reference that fact need to be invalidated, even if they were cached under different query phrasings. Tag-based invalidation strategies attach metadata to each cache entry indicating which knowledge sources it draws from, enabling targeted invalidation of all entries that reference a specific source document when that document is updated.

Session-scoped semantic caches restrict cache lookups to the current conversation context, preventing responses from one user's session from serving as cache hits for another user's queries. This is important for applications where personalization matters — the cached response to "summarize my recent emails" from user A should never be returned to user B. Namespacing cache entries by user ID or session ID adds a prefix to the similarity search scope, maintaining cache hit rates within-session while enforcing strict isolation between sessions.

Monitoring semantic cache health requires tracking metrics beyond the standard hit/miss rate. False positive rate — cache hits that returned incorrect responses — is more important than hit rate, since false positives actively degrade response quality rather than just missing an optimization opportunity. Logging sampled cache hits with their similarity scores and submitting them to periodic human review is the most reliable way to detect false positive clusters and tune the threshold accordingly. Alert on any sustained false positive rate above 1–2% as this will noticeably degrade user experience.

Time-to-live (TTL) policies for semantic cache entries must account for the staleness characteristics of the underlying content. Responses about current events, prices, or status information become stale quickly and should have short TTLs (minutes to hours). Responses about stable factual content — explanations of mathematical concepts, historical events, product documentation — can safely use longer TTLs (days to weeks). Implementing topic-based TTL policies, where different cache namespaces have different expiry rules, provides better cache efficiency than a single global TTL applied uniformly across all response types.