Skip to main content

Semantic Cache Is a Safety Problem, Not a Perf Win

· 12 min read
Tian Pan
Software Engineer

A semantic cache hit is the only LLM optimization that can serve the wrong answer to the wrong user in under a millisecond. SQL caches return your row or someone else's because somebody wrote a bad join — the failure mode is a query bug. Semantic caches return another tenant's response because two embeddings landed within 0.03 cosine of each other, which is the system working exactly as designed. The cache is doing its job. The job is the problem.

Most teams ship semantic caching as a cost initiative — there's a "70% bill reduction" deck floating around every AI engineering Slack — and review the cache key the way they'd review a Redis TTL: not at all. That review goes to the perf team. The safety team never sees the design doc because nobody filed a security review for "we added a faster path." Six months later somebody's compliance audit finds that "I can't log into my account, my email is [email protected]" and "I can't log into my account, my email is [email protected]" both vectorized within threshold of "I can't log into my account" and the cache cheerfully served Bob the response originally generated for Jane, including the password reset link her account had asked for.

This post is about why semantic caches deserve the same review rigor as SQL predicates, the cache-key design that prevents cross-user leak by construction, and the audit trail you need to distinguish "cache hit served the right answer" from "cache hit served someone else's answer at sub-millisecond latency."

Why Embedding Similarity Is the Wrong Equivalence Class

A traditional cache key is a hash. Two inputs are "the same" iff they hash to the same value, and good hash functions guarantee that small input changes produce uncorrelated outputs. That avalanche property is what makes hash-based caching safe: nothing about the cache lookup mechanism can confuse two semantically distinct queries unless they are byte-identical.

Semantic caching deliberately throws this property away. The whole point is that "How do I reset my password?" and "I forgot my password" should hit the same cache entry. The lookup mechanism rewards locality — small input changes produce small embedding changes — and a similarity threshold (typically 0.85–0.95 cosine) decides what counts as "close enough." This is incompatible with the cryptographic notion of collision resistance. A 2026 paper formalized exactly this trade-off and showed that an attacker can craft inputs that sit on the razor's edge of the threshold: semantically distinct enough to carry a different intent, mathematically similar enough to hit a target query's cache entry. Their CacheAttack framework reports 86% success rates on response hijacking against major industrial implementations.

The point isn't that adversarial inputs are the main risk. The main risk is benign collisions — ordinary users, no malice, asking similar questions whose answers depend on context the cache key didn't capture. Authorization scope is context. Tenant identity is context. User personalization is context. None of it lives inside the embedding of the natural-language query, because the embedding doesn't know any of those things exist. The cache key is what the embedding thinks the user asked. The right answer depends on what the user is allowed to know.

The Failure Modes That Don't Show Up in Hit-Rate Dashboards

Cache observability is built around hit rate, latency, and cost saved. None of those metrics will alert on a cross-user leak. A leak is a 0.5ms cache hit that returns a stored response. The dashboard turns green.

Here are the leak shapes that I have personally watched ship to production semantic-cache deployments at three different teams:

Authorization scope collapse. A query like "show me my open tickets" gets cached with the response — the actual ticket list — keyed only on the query's embedding. The next user to ask "show me my open tickets" hits the cache and sees the first user's tickets. This sounds too dumb to ship, and yet I have seen it ship twice, both times because the team was treating semantic cache as "Redis with vectors" and forgot that the response was personalized.

Personalization spillover. "Recommend a workout" returns a cached response that mentions the previous user's age, injury history, and equipment. The query embedding doesn't know any of that; the response leaks it. This one is hard to detect because both queries are legitimately the same intent — the cache should match. The bug is that the response shouldn't have been cached, or it should have been cached with the personalization variables stripped.

Entitlement bypass. A free-tier user asks "summarize this earnings call" and gets back the response that was cached for a paid-tier user yesterday — the one that included the proprietary analyst commentary the free tier doesn't have access to. The cache happily serves it because access tier wasn't part of the key. The user got a feature upgrade for free and the product team's metrics showed unusual engagement on the free tier that nobody connected back to the cache.

Stale-after-revoke. User A shares a document with User B, then revokes access. User B asks a question about the document; the cached answer from before the revoke is served. The retrieval layer correctly excluded the document from the new query — but the cache short-circuited retrieval entirely. The user can read content they no longer have access to until the cache entry expires.

Cross-tenant linguistic collision. Tenant Acme and Tenant Globex both ask "what's our Q4 revenue forecast?" The embeddings of those two queries are nearly identical because the natural-language form is identical, even though the semantically intended documents, models, and answers are completely different. With a global cache and no tenant prefix, one tenant's forecast number is served to the other.

Every one of these is invisible in the standard cache dashboard. They show up in support tickets, compliance audits, and — in the worst cases — the news.

Cache Keys Are Authorization Predicates In Disguise

The fix is structural, not statistical. A semantic cache key is a contract about which other queries can hit this entry. That contract should be designed with the same scrutiny as a SQL WHERE clause that filters by user ID, because functionally that's what it is.

The minimum viable safe key for a semantic cache in any multi-user system has three required components and one optional one:

  1. Tenant ID as a mandatory hard prefix. Not part of the embedding — a literal string prepended to the cache namespace. Cross-tenant queries must miss the cache by construction, never by similarity score. If you're tempted to share an entry across tenants, write it down in a design doc and have someone who isn't on the perf team read it.
  2. Authorization scope as a mandatory hard prefix. Roles, entitlements, and feature flags that affect the response shape. Free tier and paid tier are different namespaces. Admin and member are different namespaces. A response generated under one set of permissions cannot be served under another, period.
  3. User ID as a mandatory hard prefix for any response that can include user-specific content. The default should be per-user partitioning; promotion to a shared partition is an explicit, reviewed decision that requires the response to be provably user-agnostic.
  4. Prompt version, model version, and retrieval index version as suffix components. This isn't a safety property, but it interacts: when you upgrade the embedding model, every existing entry is in a different vector space and lookups silently degrade. Versioning forces a controlled rollover instead of slow corruption.

The semantic similarity match happens inside the partition defined by the hard prefixes, never across it. This is the single most important inversion of intuition: the cache is many small per-scope caches that happen to share infrastructure, not one big cache with optional scoping. Vendors that frame the scope as a vary-by filter or an optional metadata tag have it backwards — scope is the namespace, similarity is the lookup within the namespace.

Entitlement-Aware Invalidation Is Not Optional

Even with perfect cache keys, entries grow stale in ways that traditional caches never had to handle. Three triggers that need wiring up before semantic caching ships:

Permission change. Whenever a user's role, entitlement, or document access changes, every cache entry whose key prefix includes that user — and every entry generated under their now-revoked permissions — must be invalidated. In practice this means cache entries need to record their authorization fingerprint at write time so invalidation can target them later. Most semantic cache libraries do not do this; they store the embedding and the response and nothing else.

Document or knowledge-base change. RAG-backed responses are bound to a specific snapshot of the retrieval corpus. When a source document is edited, deleted, or reclassified, every cached response derived from it is stale — possibly dangerously so if the change was a redaction. Provenance metadata at cache-write time (which document IDs and versions contributed to this response) makes this tractable. Without it, the only safe option is a wholesale flush, which destroys your hit rate.

Model change. A new model version may have different safety boundaries, different refusal patterns, and different factual knowledge. Cached responses generated under the old model don't get re-evaluated by the new one's policy. If the new model would have refused the query, the cache is now serving a response the current model wouldn't produce — which is exactly the kind of inconsistency that gets caught by an external red-team and not by your eval suite.

The pattern that makes this manageable is a write-time provenance envelope: every cached entry stores not just (query_embedding, response) but (query_embedding, response, tenant_id, user_id, auth_scope, doc_versions, model_version, prompt_version, generated_at). Invalidation becomes a query against the envelope: "delete every entry where user_id = X," "where doc_versions ∋ Y," "where model_version < Z." Without the envelope, you have no precise invalidation, only TTL — and TTL is the wrong tool for "this user just got fired and their cached responses contain confidential strategy."

The Audit Trail That Distinguishes Hits From Leaks

The other half of the problem is that current semantic-cache observability cannot tell you whether a cache hit was correct. A "hit" event in most implementations records: query embedding, matched entry's embedding, similarity score, latency. None of that tells you whether the response was appropriate for the requesting user.

The audit log a security review will actually accept needs three additional fields per hit:

  • Authorization decision diff. What scope did the requesting user have? What scope was the cached entry generated under? If they differ at all, the hit is suspect by default and should be either blocked, re-validated by calling the model, or flagged for review.
  • Provenance reconciliation. Are the documents the entry was generated from still readable by this user? Are they still the current version? A hit that bypasses retrieval should still log what would have been retrieved, so divergence is visible.
  • Similarity-band classification. A hit at 0.99 cosine is qualitatively different from a hit at 0.86. The latter is operating in the band where adversarial collisions live, and aggregating hits by similarity band lets you spot patterns — for example, a sudden spike in 0.86 hits on a single tenant is the signature of someone probing the threshold.

Few teams instrument any of this. Most cache hits are a single counter increment. The first time you can produce that audit trail is usually after the incident, because the incident is what made someone ask whether you could.

Treat the Cache Key Review Like a SQL Predicate Review

The mental shift that makes semantic caching survivable in production is to stop thinking of it as a performance optimization with a security caveat and start thinking of it as an authorization-bearing data path with a performance benefit.

You wouldn't ship a SQL query that returns user data without a WHERE user_id = ? predicate, and you'd review that predicate at code-review time the same way you review a permission check. The semantic cache key is the same predicate, expressed in a different syntax. It defines who is allowed to receive this response. It just happens to be specified as (tenant_prefix, scope_prefix, user_prefix, embedding) instead of WHERE tenant_id = ? AND user_id = ?.

Three minimums that should be table stakes before any semantic cache reaches production traffic: (1) scope and identity are hard prefixes, never optional vary-by tags; (2) every entry has a provenance envelope sufficient to drive entitlement-aware invalidation; (3) every hit emits an audit record with authorization-diff and similarity-band fields, kept long enough for incident reconstruction. The 70% cost win is real and the perf team isn't wrong about that. The win just isn't yours to keep if you're saving it by occasionally serving the wrong user's data.

Engineers who already operate retrieval systems with row-level security have most of the muscle memory for this. The translation is straightforward: the cache key is the predicate, the provenance envelope is the audit log, and the similarity threshold is the soft filter that runs after the hard authorization check, never instead of it. Build it that way and the safety property is "this cannot happen by construction." Build it the other way — embedding first, scope as metadata — and the safety property is "this hasn't happened yet."

References:Let's stay in touch and Follow me for more thoughts and updates