Skip to main content

Semantic Caching for LLMs: The Cost Tier Most Teams Skip

· 11 min read
Tian Pan
Software Engineer

Most teams building LLM applications know about prompt caching — the prefix-reuse mechanism that API providers offer to discount repeated input tokens. Far fewer have deployed the layer above it: semantic caching, which eliminates LLM calls entirely for queries that mean the same thing but are phrased differently. The gap isn't laziness; it's a widespread misunderstanding of what "95% accuracy" means in semantic caching vendor documentation.

That 95% figure refers to match correctness on cache hits, not to how often the cache actually gets hit. Real production hit rates range from 10% for open-ended chat to 70% for structured FAQ systems — and the math that determines which side of that range you're on should happen before you write any cache code.

What Semantic Caching Actually Is

Traditional caching operates on exact string matches. SHA-256 hash "What's the capital of France?" and you get a cached response only if someone sends that exact string again. This works for structured database queries and API calls, but user-facing LLM applications don't repeat exact strings — they repeat intent.

Semantic caching operates at the intent level. The pipeline works as follows: incoming queries are converted to vector embeddings, those embeddings are searched against a vector store of previously cached query-embedding pairs using cosine similarity, and if the similarity score exceeds a configured threshold, the cached response is returned without any LLM call. Below threshold, the LLM is called, the response is stored, and the query embedding is indexed for future matches.

"How do I reset my password?" and "I forgot my password, what do I do?" and "Can't log in, help" all hash to different strings but can match the same semantic cache entry. The vector search adds 5–20ms overhead but eliminates the 1–5 second LLM round trip on hits.

This is categorically different from prompt/prefix caching, which operates inside the LLM inference layer. The two are complementary, not competing, and understanding the boundary is essential for reasoning about cost.

The Two Caching Layers

Prompt/prefix caching (offered by OpenAI and Anthropic) works by reusing the computed key-value attention states for repeated prompt prefixes. If you have a 100,000-token system prompt, the LLM processes it once and caches the result. Subsequent requests with the same prefix skip reprocessing those tokens — you still make an LLM API call, you still generate output tokens, but input token costs drop 50–90%.

Semantic caching eliminates the LLM call entirely on a hit. The response was generated in a previous request; the cache just returns it. No inference happens.

The difference in failure modes follows from this. A stale prefix cache returns a correct answer generated from old context — the LLM is still reasoning freshly. A stale semantic cache returns the old answer verbatim. That distinction becomes critical when you start thinking about time-sensitive content, personalization, and what "cache invalidation" actually requires.

A sensible multi-layer architecture looks like:

Request → [Exact hash match] → [Semantic similarity match] → [Prefix cache] → Full LLM inference

Exact hash matching is free to implement and typically covers ~18% of real production traffic (users and integrations that repeat identical requests). Semantic caching extends coverage to paraphrases and near-duplicates. Prefix caching reduces costs on the remaining LLM calls by reusing shared context. Teams that implement all three typically see 70–80%+ reduction in effective token spend compared to naive inference on every request.

The Hit Rate Reality Check

The number you see in most semantic caching vendor content — "95%" — is match accuracy: the fraction of cache hits that return a correct response. It is not the fraction of requests that hit the cache.

Actual production hit rates vary enormously by use case:

  • FAQ and customer support applications: 40–70%
  • EdTech and tutoring platforms: ~45%
  • Classification and intent routing tasks: 40–60%
  • General RAG pipelines: 18–60% (wide range, highly traffic-dependent)
  • Open-ended conversational chat: 10–20%
  • Code generation: 5–20%

The distribution of your traffic determines your ceiling. Analysis of production query logs across multiple systems shows roughly 18% of requests are exact duplicates, and 60–70% of real queries are genuinely unique. If your application sits in that upper band — developers asking novel coding questions, users exploring varied research topics — semantic caching may never pay for itself.

The right starting point is a week of query logs. Embed your queries, run pairwise similarity, and count how many fall above 0.85. That number is your theoretical ceiling. If it's 20%, your realistic ceiling after threshold tuning is closer to 15%.

At $5,000/month in LLM API spend, a 20% hit rate saves roughly $1,000/month. A 45% hit rate saves $2,250/month. Vector database and embedding infrastructure costs typically run under 5% of the savings. The break-even point is usually reached within a few weeks for applications with repetitive traffic.

Threshold Selection and the Grey Zone Problem

The similarity threshold is the most consequential configuration decision in semantic caching, and it has no single right answer.

Common guidance suggests starting at 0.88 cosine similarity and tuning from there. The intuition is sound, but the underlying geometry creates a problem that threshold tuning alone cannot solve.

Similarity score distributions for correct cache hits (where the query really is equivalent to the cached entry) and incorrect cache hits (where the query is adjacent but genuinely different) overlap heavily between roughly 0.85 and 0.92. In this range, no single threshold cleanly separates paraphrases from distinct intents. Increasing the threshold to 0.95 cuts incorrect hits but also misses valid paraphrases. Decreasing to 0.85 catches more paraphrases but introduces wrong answers.

Several practical responses to this:

Domain-specific embeddings shrink the grey zone. Generic embedding models (all-MiniLM-L6-v2, text-embedding-ada-002) achieve 64–78% precision at standard thresholds. Models fine-tuned on domain-specific query pairs achieve 84–92% precision on the same thresholds, while also achieving lower embedding latency. The investment in training or adopting a domain-specific model is usually worthwhile for any production system with meaningful traffic.

LLM reranking adds a confidence layer. For high-stakes use cases, route the top semantic candidates through a cheap model (GPT-4o-mini or similar) that explicitly judges whether the cached query and the incoming query are equivalent before returning the cached response. This adds ~100ms and a small marginal cost, but eliminates the grey zone for the fraction of queries that land in it.

Tiered thresholds by response type. Use stricter thresholds (0.92–0.97) for factual queries where wrong answers cause harm, and more permissive thresholds (0.85–0.90) for FAQ or support queries where the cost of a slightly imprecise answer is low. AWS's verified cache approach uses three tiers: above 80% similarity returns the cached answer directly; 60–80% uses the cached answer as a few-shot example but still calls the LLM; below 60% falls through to standard inference.

Cache Invalidation: The Four Hard Problems

The maxim about cache invalidation being one of the two hard problems in computer science is more literal for semantic caches than for most systems.

Time-sensitive content. Any response that was correct yesterday may be wrong today — prices, availability, news, model behavior after a system prompt change. TTL-based expiration is the standard approach: 15–30 minutes for real-time data, hours for business data, days-to-weeks for stable reference content. But TTLs require you to know the freshness window in advance, and real content doesn't divide cleanly into categories.

Personalization. Semantic caches are global by default. If two users ask "what's my account balance?", their embeddings will be nearly identical, and without scoping, the second user gets the first user's answer. The fix is to include tenant or user identifiers as metadata filters on cache lookups — supported in most production implementations with negligible overhead. The architectural implication is that personalized responses should not be cached globally, which immediately shrinks the cacheable fraction of your traffic.

Embedding model upgrades. Old embeddings and new embeddings are incommensurable. When you upgrade your embedding model, similarity scores computed against old embeddings using a new model are meaningless — a 0.9 similarity score no longer means what it used to mean. The entire cache must be invalidated or versioned when the embedding model changes. This is consistently cited by practitioners as the failure mode teams discover the hard way after an upgrade causes a spike in wrong cached responses.

Cached hallucinations. If the original LLM response was incorrect, every future query semantically similar to that original query will receive the same wrong answer. The cache doesn't just preserve correct responses — it amplifies errors. The practical mitigations are: apply quality gates before caching (minimum response length, format validation, confidence scores for structured outputs), implement user feedback loops that trigger eviction on flagged responses, and never cache responses that failed validation.

Security: Cache Poisoning at Scale

Semantic caching introduces a security surface that purely server-side caching doesn't have: an adversary who can influence what gets cached can poison future responses across many users.

Research demonstrated a cache poisoning attack where crafted adversarial prompts cause the LLM to generate malicious responses, which are then indexed in the semantic cache. Because semantic matching is fuzzy, subsequent legitimate queries that are semantically similar to the injected prompt retrieve the malicious cached response. In agentic scenarios where tools are invoked based on cached content, this attack achieved a 90.6% hit rate for injected responses.

The defense-in-depth approach involves: validating cached responses against allowed formats and content policies before storing; applying output boundary enforcement (don't cache raw LLM outputs that haven't passed content filtering); scoping cache entries with user/tenant metadata to limit blast radius; and monitoring cache hit patterns for anomalies (sudden spikes in hits for specific query clusters can indicate injection).

When Semantic Caching Doesn't Pay Off

The specific workloads where semantic caching reliably fails to justify the operational complexity:

Open-ended conversational chat. Users in an ongoing conversation are generating genuinely novel queries at each turn. Prior conversation context means that two queries with identical text can require different responses. Cache hit rates of 10–20% in this category rarely cover infrastructure and maintenance costs.

Code generation. Developer queries are often long, specific, and unique. Even when they're topically similar ("write a Python function that..."), the required outputs differ enough that semantic matching produces incorrect responses at any reasonable threshold.

Complex multi-hop reasoning. Questions that require integrating multiple pieces of information rarely have semantically equivalent prior queries, and cached responses for similar-seeming questions are frequently wrong on the specific details that matter.

Rapidly evolving domains. If your underlying content changes frequently enough that most cached responses have short effective lifetimes, you're paying the operational cost of the cache without capturing much of its benefit.

The correct decision process is: measure your traffic first, calculate the theoretical hit rate ceiling, model the cost savings at that ceiling, subtract infrastructure and maintenance costs, and only then decide whether to build. For many applications, prompt/prefix caching alone — which is zero operational overhead at the application layer — achieves 50–90% of the cost reduction that semantic caching would add.

A Practical Starting Architecture

If your traffic analysis shows a meaningful fraction of semantically similar queries, the minimal viable implementation is:

A lookup path: embed the incoming query, search the vector store with a cosine threshold around 0.88, apply metadata filters for tenant/user scope, return the cached response on a hit.

A store path: after receiving an LLM response, run it through quality validation, then store the response along with the embedded query, a TTL appropriate to the content type, and relevant metadata.

Don't start with a custom vector store. GPTCache, Redis LangCache, and Upstash Semantic Cache each provide production-ready implementations that handle the embedding, storage, and retrieval plumbing. The engineering investment should go into threshold calibration, metadata scoping, and TTL policy — the parts that can't be solved by a library.

Measure hit rate, false positive rate (cache hits that returned a wrong response), and latency distribution separately. A system with 40% hit rate and 5% false positives is probably worth running. A system with 15% hit rate and 8% false positives is probably not, regardless of what the vendor documentation says about accuracy.

References:Let's stay in touch and Follow me for more thoughts and updates