Semantic Caching for LLM Applications: What the Benchmarks Don't Tell You
Every vendor selling an LLM gateway will show you a slide with "95% cache hit rate." What that slide won't show you is the fine print: that number refers to match accuracy when a hit is found, not how often a hit is found in the first place. Real production systems see 20–45% hit rates — and that gap between marketing and reality is where most teams get burned.
Semantic caching is a genuinely useful technique. But deploying it without understanding its failure modes is how you end up returning wrong answers to users with high confidence, wondering why your support queue doubled.
What Semantic Caching Actually Does
Traditional caching is deterministic: hash the request, look up the hash, return the stored response. This works well when the same bytes arrive repeatedly. LLM queries are rarely byte-identical. "How do I reset my password?" and "steps to reset my password" are different strings but identical queries. Exact caching misses both after the first one.
Semantic caching solves this by converting each query into an embedding vector and comparing new queries against stored vectors using cosine similarity. When the similarity score exceeds a threshold, the system returns the cached response without calling the LLM.
The architecture looks like this:
- Layer 1 (exact cache): Hash the full query string. If there's a hit, return immediately. This catches 15–30% of traffic in most production systems — automated pipelines and user retries create more exact duplicates than you'd expect.
- Layer 2 (semantic cache): Embed the query, search a vector index, evaluate cosine similarity. If the score exceeds your threshold, return the cached response.
- Miss path: Forward to the LLM, cache the query-response pair for future lookups.
The embedding model running Layer 2 is typically a lighter, faster model than the generation model — something like a sentence-transformer or a small bi-encoder. The latency budget for the cache lookup path needs to stay under ~50ms or you erode the latency benefit of avoiding the LLM call entirely.
The Hit Rate Reality
Research papers routinely report 60–70% cache hit rates. Production traffic is messier. The true range across real deployments breaks down roughly by application type:
- FAQ and support bots: 40–60% — this is where semantic caching shines. Users rephrase the same handful of questions.
- Classification tasks: 50–70% — discrete input space, limited query variety.
- RAG-backed Q&A: 15–25% — users ask questions across a wide factual domain; true duplicates are rarer.
- Open-ended chat: 10–20% — nearly every turn is unique. Semantic caching is basically inert here.
- Agentic tool calls: 5–15% — the query depends on prior context and current state; the same surface-level question might need a completely different response.
The gap between FAQ performance and agentic performance matters because teams often evaluate semantic caching on their simplest use case and then deploy it globally. If your system handles multiple query types, a blended hit rate of 25% is more realistic than 60%.
A 25% hit rate with $5,000 in monthly LLM spend saves roughly $1,000/month before infrastructure costs. That's real money, but it's worth comparing against the engineering time required to tune and maintain the cache correctly — which brings us to the hard part.
The Threshold Is a Knob, Not a Setting
The cosine similarity threshold determines whether your cache is useful or dangerous. It's also the most frequently misconfigured component in semantic caching deployments.
Too low a threshold (below 0.80): You return cached responses for queries that are only superficially similar. A financial services deployment encountered this directly — a customer saying "I don't want this business account anymore" was routed to automatic payment cancellation procedures with 88.7% cosine similarity confidence, when the query needed to trigger account closure review. The queries are related but the required responses diverge completely.
Too high a threshold (above 0.95): You approach the behavior of exact caching. You get the infrastructure complexity without the semantic matching benefit.
The commonly recommended sweet spot is 0.92 — but that's a global default, and global defaults are wrong for heterogeneous workloads. A single threshold treats "sort_ascending" and "sort_descending" identically in a dense code embedding space (they're very similar vectors), while missing valid paraphrases in sparse conversational spaces.
Better approaches:
- Per-category thresholds: Tag queries by intent or topic (classification, coding, factual lookup, open-ended) and apply different thresholds. Classification tasks tolerate 0.85; technical queries need 0.95 or higher.
- Namespace isolation: Partition your cache by query category. A hit against the wrong partition is more dangerous than a miss.
- https://preto.ai/blog/semantic-caching-llm/
- https://venturebeat.com/orchestration/why-your-llm-bill-is-exploding-and-how-semantic-caching-can-cut-it-by-73
- https://arxiv.org/html/2411.05276v2
- https://www.infoq.com/articles/reducing-false-positives-retrieval-augmented-generation/
- https://redis.io/blog/prompt-caching-vs-semantic-caching/
- https://towardsdatascience.com/zero-waste-agentic-rag-designing-caching-architectures-to-minimize-latency-and-llm-costs-at-scale/
- https://particula.tech/blog/when-to-cache-llm-responses-decision-guide
- https://arxiv.org/abs/2508.07675
- https://www.ashwinhariharan.com/semantic-caching-in-agentic-ai-determining-cache-eligibility-and-invalidation/
- https://arxiv.org/html/2510.26835
