Semantic Caching for LLM Applications: What the Benchmarks Don't Tell You
Every vendor selling an LLM gateway will show you a slide with "95% cache hit rate." What that slide won't show you is the fine print: that number refers to match accuracy when a hit is found, not how often a hit is found in the first place. Real production systems see 20–45% hit rates — and that gap between marketing and reality is where most teams get burned.
Semantic caching is a genuinely useful technique. But deploying it without understanding its failure modes is how you end up returning wrong answers to users with high confidence, wondering why your support queue doubled.
What Semantic Caching Actually Does
Traditional caching is deterministic: hash the request, look up the hash, return the stored response. This works well when the same bytes arrive repeatedly. LLM queries are rarely byte-identical. "How do I reset my password?" and "steps to reset my password" are different strings but identical queries. Exact caching misses both after the first one.
Semantic caching solves this by converting each query into an embedding vector and comparing new queries against stored vectors using cosine similarity. When the similarity score exceeds a threshold, the system returns the cached response without calling the LLM.
The architecture looks like this:
- Layer 1 (exact cache): Hash the full query string. If there's a hit, return immediately. This catches 15–30% of traffic in most production systems — automated pipelines and user retries create more exact duplicates than you'd expect.
- Layer 2 (semantic cache): Embed the query, search a vector index, evaluate cosine similarity. If the score exceeds your threshold, return the cached response.
- Miss path: Forward to the LLM, cache the query-response pair for future lookups.
The embedding model running Layer 2 is typically a lighter, faster model than the generation model — something like a sentence-transformer or a small bi-encoder. The latency budget for the cache lookup path needs to stay under ~50ms or you erode the latency benefit of avoiding the LLM call entirely.
The Hit Rate Reality
Research papers routinely report 60–70% cache hit rates. Production traffic is messier. The true range across real deployments breaks down roughly by application type:
- FAQ and support bots: 40–60% — this is where semantic caching shines. Users rephrase the same handful of questions.
- Classification tasks: 50–70% — discrete input space, limited query variety.
- RAG-backed Q&A: 15–25% — users ask questions across a wide factual domain; true duplicates are rarer.
- Open-ended chat: 10–20% — nearly every turn is unique. Semantic caching is basically inert here.
- Agentic tool calls: 5–15% — the query depends on prior context and current state; the same surface-level question might need a completely different response.
The gap between FAQ performance and agentic performance matters because teams often evaluate semantic caching on their simplest use case and then deploy it globally. If your system handles multiple query types, a blended hit rate of 25% is more realistic than 60%.
A 25% hit rate with $5,000 in monthly LLM spend saves roughly $1,000/month before infrastructure costs. That's real money, but it's worth comparing against the engineering time required to tune and maintain the cache correctly — which brings us to the hard part.
The Threshold Is a Knob, Not a Setting
The cosine similarity threshold determines whether your cache is useful or dangerous. It's also the most frequently misconfigured component in semantic caching deployments.
Too low a threshold (below 0.80): You return cached responses for queries that are only superficially similar. A financial services deployment encountered this directly — a customer saying "I don't want this business account anymore" was routed to automatic payment cancellation procedures with 88.7% cosine similarity confidence, when the query needed to trigger account closure review. The queries are related but the required responses diverge completely.
Too high a threshold (above 0.95): You approach the behavior of exact caching. You get the infrastructure complexity without the semantic matching benefit.
The commonly recommended sweet spot is 0.92 — but that's a global default, and global defaults are wrong for heterogeneous workloads. A single threshold treats "sort_ascending" and "sort_descending" identically in a dense code embedding space (they're very similar vectors), while missing valid paraphrases in sparse conversational spaces.
Better approaches:
- Per-category thresholds: Tag queries by intent or topic (classification, coding, factual lookup, open-ended) and apply different thresholds. Classification tasks tolerate 0.85; technical queries need 0.95 or higher.
- Namespace isolation: Partition your cache by query category. A hit against the wrong partition is more dangerous than a miss.
- Confidence decay: Track per-entry hit counts. Entries with many hits and positive feedback signals can use relaxed thresholds; new entries default to conservative ones.
The threshold problem compounds when you upgrade your embedding model. A new model will produce different vector representations for the same queries. All your stored embeddings become invalid — they can no longer be compared meaningfully to new queries. Plan for full cache invalidation and rebuild time whenever you change the embedding model.
Failure Modes That Practitioners Miss
The headline failure mode (wrong answers) gets attention. There are four others that don't:
Cache poisoning from hallucinated responses: When the LLM produces a hallucinated or incorrect response, you cache that error. Future queries that match semantically will receive the wrong answer without hitting the LLM again. The error propagates and compounds. You need quality validation before caching — either confidence scoring from the model or periodic sampling against fresh responses to detect drift.
Stale data with no expiration logic: A cached response that was correct yesterday may be wrong today. Product pricing, policy details, availability — these change. A TTL-only invalidation strategy fails when changes are irregular. The better approach: attach metadata to cached entries indicating what factual domains they depend on, and invalidate by domain when underlying data changes.
Streaming breakage: If your application uses streaming responses (tokens appearing progressively), naive semantic caching breaks the contract. Returning a pre-cached full response as a stream requires a different code path — you're replaying stored content character by character rather than consuming a live token stream. Teams that add semantic caching to an existing streaming pipeline and don't account for this introduce subtle UX regressions.
Context window contamination: In multi-turn conversations, users share session state. The same query in turn 3 might need a completely different response than in turn 7, depending on what happened in between. Caching based on the standalone query text while ignoring conversation context produces responses that are semantically correct for the query in isolation but contextually wrong for the user.
This last one is the reason semantic caching fundamentally doesn't belong in conversational AI flows without major architectural work. The cache key needs to represent the full context, not just the current turn — and at that point, the probability of a cache hit collapses toward zero.
When to Skip It Entirely
Semantic caching makes sense when:
- Your queries come from a bounded domain (support, FAQ, classification)
- Users consistently rephrase the same set of underlying questions
- Responses aren't user-specific or context-dependent
- You have infrastructure to monitor hit rate and quality
Semantic caching doesn't belong in your stack when:
- Queries are generated dynamically with user-specific context
- Your application uses multi-turn conversation state
- Responses need to reflect real-time data (pricing, inventory, live events)
- You're caching tool call results in an agentic pipeline where prior tool outputs affect the current query's correct answer
- Query volume is too low to amortize infrastructure costs
A quick decision test: if you look at 100 queries from the past week and fewer than 20 are semantically similar to another query in that set, semantic caching will have a negligible impact on your costs.
A Practical Deployment Pattern
If you've decided semantic caching is the right fit, the two-layer architecture reduces risk:
Start with exact caching only for the first month. Instrument your hit rate, latency, and response quality. Understand what your actual duplicate traffic looks like before introducing semantic matching.
Add semantic caching in shadow mode: compute similarity scores and log would-be hits, but don't return cached responses. After two weeks of shadow traffic, analyze the would-be hits manually. What fraction are genuinely the same question? What fraction are superficially similar but require different answers? This calibration step will tell you the right threshold for your specific workload far more reliably than benchmarks.
When you go live, instrument for quality, not just hits. Track a metric called "corrected cache rate" — the fraction of cache hits that were later corrected by a human or detected as wrong by your evaluation pipeline. If this number exceeds 1%, your threshold is too aggressive.
Finally, plan for invalidation. Every cached entry should have a TTL even if it's long. Create invalidation hooks tied to your content update events. Build a way to purge by embedding model version. These feel like over-engineering until the moment they aren't.
Semantic caching is the right tool for a specific class of problem: high-volume, repeated, closed-domain queries where the cost per token is the binding constraint. For most LLM applications in production today — RAG pipelines, conversational assistants, agentic workflows — it's either ineffective or actively dangerous without careful instrumentation. The infrastructure complexity is real; size the benefit against your actual workload before committing to it.
- https://preto.ai/blog/semantic-caching-llm/
- https://venturebeat.com/orchestration/why-your-llm-bill-is-exploding-and-how-semantic-caching-can-cut-it-by-73
- https://arxiv.org/html/2411.05276v2
- https://www.infoq.com/articles/reducing-false-positives-retrieval-augmented-generation/
- https://redis.io/blog/prompt-caching-vs-semantic-caching/
- https://towardsdatascience.com/zero-waste-agentic-rag-designing-caching-architectures-to-minimize-latency-and-llm-costs-at-scale/
- https://particula.tech/blog/when-to-cache-llm-responses-decision-guide
- https://arxiv.org/abs/2508.07675
- https://www.ashwinhariharan.com/semantic-caching-in-agentic-ai-determining-cache-eligibility-and-invalidation/
- https://arxiv.org/html/2510.26835
