Semantic Cache Is a Safety Problem, Not a Perf Win

April 23, 2026 · 12 min read

Software Engineer

A semantic cache hit is the only LLM optimization that can serve the wrong answer to the wrong user in under a millisecond. SQL caches return your row or someone else's because somebody wrote a bad join — the failure mode is a query bug. Semantic caches return another tenant's response because two embeddings landed within 0.03 cosine of each other, which is the system working exactly as designed. The cache is doing its job. The job is the problem.

Most teams ship semantic caching as a cost initiative — there's a "70% bill reduction" deck floating around every AI engineering Slack — and review the cache key the way they'd review a Redis TTL: not at all. That review goes to the perf team. The safety team never sees the design doc because nobody filed a security review for "we added a faster path." Six months later somebody's compliance audit finds that "I can't log into my account, my email is [email protected]" and "I can't log into my account, my email is [email protected]" both vectorized within threshold of "I can't log into my account" and the cache cheerfully served Bob the response originally generated for Jane, including the password reset link her account had asked for.

This post is about why semantic caches deserve the same review rigor as SQL predicates, the cache-key design that prevents cross-user leak by construction, and the audit trail you need to distinguish "cache hit served the right answer" from "cache hit served someone else's answer at sub-millisecond latency."

Why Embedding Similarity Is the Wrong Equivalence Class

A traditional cache key is a hash. Two inputs are "the same" iff they hash to the same value, and good hash functions guarantee that small input changes produce uncorrelated outputs. That avalanche property is what makes hash-based caching safe: nothing about the cache lookup mechanism can confuse two semantically distinct queries unless they are byte-identical.

Semantic caching deliberately throws this property away. The whole point is that "How do I reset my password?" and "I forgot my password" should hit the same cache entry. The lookup mechanism rewards locality — small input changes produce small embedding changes — and a similarity threshold (typically 0.85–0.95 cosine) decides what counts as "close enough." This is incompatible with the cryptographic notion of collision resistance. A 2026 paper formalized exactly this trade-off and showed that an attacker can craft inputs that sit on the razor's edge of the threshold: semantically distinct enough to carry a different intent, mathematically similar enough to hit a target query's cache entry. Their CacheAttack framework reports 86% success rates on response hijacking against major industrial implementations.

The point isn't that adversarial inputs are the main risk. The main risk is benign collisions — ordinary users, no malice, asking similar questions whose answers depend on context the cache key didn't capture. Authorization scope is context. Tenant identity is context. User personalization is context. None of it lives inside the embedding of the natural-language query, because the embedding doesn't know any of those things exist. The cache key is what the embedding thinks the user asked. The right answer depends on what the user is allowed to know.

The Failure Modes That Don't Show Up in Hit-Rate Dashboards

Cache observability is built around hit rate, latency, and cost saved. None of those metrics will alert on a cross-user leak. A leak is a 0.5ms cache hit that returns a stored response. The dashboard turns green.

Here are the leak shapes that I have personally watched ship to production semantic-cache deployments at three different teams:

Authorization scope collapse. A query like "show me my open tickets" gets cached with the response — the actual ticket list — keyed only on the query's embedding. The next user to ask "show me my open tickets" hits the cache and sees the first user's tickets. This sounds too dumb to ship, and yet I have seen it ship twice, both times because the team was treating semantic cache as "Redis with vectors" and forgot that the response was personalized.

Personalization spillover. "Recommend a workout" returns a cached response that mentions the previous user's age, injury history, and equipment. The query embedding doesn't know any of that; the response leaks it. This one is hard to detect because both queries are legitimately the same intent — the cache should match. The bug is that the response shouldn't have been cached, or it should have been cached with the personalization variables stripped.

Entitlement bypass. A free-tier user asks "summarize this earnings call" and gets back the response that was cached for a paid-tier user yesterday — the one that included the proprietary analyst commentary the free tier doesn't have access to. The cache happily serves it because access tier wasn't part of the key. The user got a feature upgrade for free and the product team's metrics showed unusual engagement on the free tier that nobody connected back to the cache.

Stale-after-revoke. User A shares a document with User B, then revokes access. User B asks a question about the document; the cached answer from before the revoke is served. The retrieval layer correctly excluded the document from the new query — but the cache short-circuited retrieval entirely. The user can read content they no longer have access to until the cache entry expires.

Cross-tenant linguistic collision. Tenant Acme and Tenant Globex both ask "what's our Q4 revenue forecast?" The embeddings of those two queries are nearly identical because the natural-language form is identical, even though the semantically intended documents, models, and answers are completely different. With a global cache and no tenant prefix, one tenant's forecast number is served to the other.

Every one of these is invisible in the standard cache dashboard. They show up in support tickets, compliance audits, and — in the worst cases — the news.

Cache Keys Are Authorization Predicates In Disguise

The fix is structural, not statistical. A semantic cache key is a contract about which other queries can hit this entry. That contract should be designed with the same scrutiny as a SQL WHERE clause that filters by user ID, because functionally that's what it is.

The minimum viable safe key for a semantic cache in any multi-user system has three required components and one optional one:

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Semantic Cache Is a Safety Problem, Not a Perf Win

Why Embedding Similarity Is the Wrong Equivalence Class

The Failure Modes That Don't Show Up in Hit-Rate Dashboards

Cache Keys Are Authorization Predicates In Disguise

Recommended Reading

About Tian Pan

Why Embedding Similarity Is the Wrong Equivalence Class​

The Failure Modes That Don't Show Up in Hit-Rate Dashboards​

Cache Keys Are Authorization Predicates In Disguise​

Recommended Reading

About Tian Pan

Why Embedding Similarity Is the Wrong Equivalence Class

The Failure Modes That Don't Show Up in Hit-Rate Dashboards

Cache Keys Are Authorization Predicates In Disguise