Skip to main content

Cache Invalidation for AI: Why Every Cache Layer Gets Harder When the Answer Can Change

· 10 min read
Tian Pan
Software Engineer

Phil Karlton's famous quip — "There are only two hard things in Computer Science: cache invalidation and naming things" — was coined before language models entered production. Add AI to the stack and cache invalidation doesn't just get harder; it gets harder at every layer simultaneously, for fundamentally different reasons at each one.

Traditional caches store deterministic outputs: the database row, the rendered HTML, the computed price. When the source changes, you invalidate the key, and the next request fetches fresh data. The contract is simple because the answer is a fact.

AI caches store something different: responses to queries where the "correct" answer depends on context, recency, model behavior, and the source documents the model was given. Stale here doesn't mean outdated — it means semantically wrong in ways your monitoring won't catch until a user notices.

The Four Cache Tiers You're Probably Not Treating Differently

A production LLM application typically has at least four distinct cache layers, each with a different invalidation profile:

Prompt caches (offered by Anthropic and OpenAI) store the KV computation for a prefix of your prompt. These are the closest to traditional caching — they're exact-match on token sequences, and any change to the cached prefix invalidates everything downstream. The invalidation rule is mechanical but unforgiving: restructuring your system prompt to move a frequent change higher up the document evicts the entire cache.

Semantic caches (tools like GPTCache, Redis semantic cache) store LLM responses keyed by embedding similarity rather than exact string match. A new query within a cosine distance threshold of a cached query gets the cached response. These are the most dangerous caches to reason about because there is no sharp invalidation boundary — the same cached response may be served for thousands of semantically similar but not identical queries.

RAG retrieval caches store the results of vector similarity searches: given a query embedding, return these documents. When source documents update, the cached retrieval results may point to stale document versions or omit newly relevant content.

Embedding indexes are the bottom layer — the precomputed vector representations of your documents. These are rebuilt when documents change, but the rebuild is expensive and typically asynchronous, meaning there's a window where your retrieval runs against stale embeddings.

Each layer has a different failure mode, a different appropriate consistency model, and a different set of production incidents waiting to happen.

Why Traditional Invalidation Breaks at Each Layer

The core problem isn't technical — it's semantic. Traditional caches store facts. AI caches store responses that were appropriate given a particular model, a particular set of source documents, and a particular moment in time. TTL and tag-based invalidation assume you can draw a clean line between "still valid" and "stale." AI caches don't offer that line.

Semantic caches amplify hallucinations. If an LLM generates an incorrect response to a query, every future semantically similar query gets that hallucination served from cache. The error doesn't decay — it compounds. A standard cache might serve stale data until you push a fix; a semantic cache serves wrong data until you identify that particular cluster of queries and invalidate it explicitly.

Embedding model upgrades are cache-busting events you won't notice. When you upgrade your embedding model — moving from an older model to a newer, more capable one — the new embeddings are numerically incompatible with the old ones. A similarity score of 0.9 from model A means something completely different from model B. If you don't treat the model version as part of your cache key, you'll silently mix embeddings from different models and get retrieval results that are quietly wrong.

Semantic drift makes TTL meaningless. The word "Python" in queries submitted to your developer documentation system had one distribution in 2022 and a different one now. Product names change meaning. Regulatory terminology evolves. A cached embedding that was a correct representation of your document six months ago may now map to the wrong region of the embedding space — not because the document changed, but because the language around it did.

Document updates don't propagate through the cache stack. A legal tech firm learned this the hard way: a lawyer updated a contract, searched for it again, and was served the cached version of the old document through the RAG retrieval cache. The retrieval cache had no knowledge that the source document had changed. Traditional cache invalidation would have been straightforward — you know which key to invalidate when the document changes. But the retrieval cache was keyed on query embeddings, not document IDs, so there was no clean invalidation path.

Matching Consistency Models to Cache Tiers

Not every cache tier needs the same consistency guarantee, and trying to apply strong consistency everywhere will kill your latency without meaningfully improving correctness.

Strong consistency — every read reflects the most recent write — makes sense for fact-based caches where errors have clear downstream consequences: financial data, compliance-critical policy documents, authentication decisions. It's rarely appropriate for semantic caches because the cost in coordination overhead eliminates most of the latency benefit you were caching for in the first place.

Eventual consistency works for semantic caches and most RAG retrieval caches because users can tolerate brief windows of slightly-stale responses, and the update frequency is low relative to query frequency. The risk is unbounded staleness — eventual consistency doesn't tell you how eventual. You need to bound it.

Bounded staleness is the right model for most production RAG systems. The guarantee is: "This cache entry is no older than X seconds." You implement it with Change Data Capture streams from your document store — when a document updates, you emit an event that triggers invalidation of retrieval cache entries that referenced it, with a maximum lag bound. This gives you the performance of eventual consistency with a worst-case staleness window you can reason about and communicate to users.

Session consistency — where a user always sees their own writes immediately — applies naturally to prompt caches. A user's conversation should always be self-consistent; the prompt cache should be warmed for their session before they notice a cache miss.

Four Design Patterns That Actually Work

Versioned embeddings with namespace epochs. Embed the embedding model version directly in your cache key: key = hash(query, embedding_model_v3). When you upgrade your embedding model, old cache keys naturally go cold via TTL without requiring explicit bulk deletion. During the migration window, you run both versions simultaneously, gradually warming the new cache while the old one drains. This also prevents the worst failure mode — mixing incompatible embeddings in your vector index.

Document fingerprinting for retrieval cache invalidation. Content-addressed caching is the solution to the document-update problem. Include a hash of the source document's content in the retrieval cache key. When a document changes, its fingerprint changes, and all cache entries that reference it naturally miss — no explicit invalidation sweep needed. The same approach that makes browser assets cache correctly works in RAG: you're computing hash(document_content) and using it as part of the cache identity.

Tiered architecture with explicit freshness requirements per tier. Maintain a static tier and a dynamic tier in your semantic cache. The static tier contains offline-vetted, curated Q&A pairs where you're confident the answers don't change frequently — these you can serve with high similarity thresholds and long TTLs. The dynamic tier contains fresh content with aggressive eviction. Queries hit the static tier first; misses fall through to the dynamic tier; dynamic misses go to full LLM inference. The cost savings compound: each tier catches a different class of queries, and the freshness properties of each tier are explicit and auditable.

Incremental index updates rather than full reindexing. A full reindex of a large document corpus is expensive and slow — one benchmark showed 12,000 files taking 22 minutes and significant API cost. Incremental indexing tracks document lineage: when a document changes, only recompute its embedding; when a document is deleted, remove its embedding; when a document is unchanged, reuse the cached embedding. The same benchmark showed incremental updates completing in 45 seconds. The engineering investment is in building the change detection layer — a CDC stream, a document changelog, or a version-tracked document store.

The Thundering Herd Problem Is Worse for Semantic Caches

Traditional cache stampedes happen when many entries expire simultaneously, overwhelming the origin. Semantic cache stampedes are subtler and worse.

When you invalidate a semantic cache — say, after a product policy update — you're not invalidating individual keys. You're invalidating clusters of queries that touched a particular region of the embedding space. You don't know which queries those are without scanning the entire cache. And when those queries start missing, they all hit the LLM simultaneously, with a latency spike that scales with the semantic surface area of the changed content.

Adding random jitter to TTLs reduces the simultaneous expiry problem but introduces semantic inconsistency: similar queries may get responses that are different ages, meaning some users see the old policy and some see the new one during the jitter window. For most content, this is tolerable. For compliance-critical or legal content, it's not.

The solution is to scope your invalidation precisely: when a document changes, emit an event that invalidates only the cached queries that retrieved that specific document. This requires tracking, per cached query, which documents contributed to the response. It's more bookkeeping, but it converts O(cache size) invalidation to O(queries that touched the document).

What This Means for Your Next Architecture Decision

If you're caching LLM responses in production and haven't thought through each layer separately, you're almost certainly making one of these mistakes:

  • Using TTL as your only invalidation mechanism on semantic caches, which means you're either serving stale responses too long or paying for unnecessary LLM calls when entries expire correctly.
  • Upgrading your embedding model without versioning your cache keys, which means your retrieval quality is quietly degrading as incompatible embeddings coexist.
  • Not tracking which documents contributed to which cached responses, which means you can't invalidate precisely when documents change.
  • Applying the same consistency model to all cache tiers, which means you're either over-engineering your semantic cache or under-engineering your fact-based caches.

The mental model shift is this: traditional cache invalidation asks "when is this data stale?" AI cache invalidation asks "when is this response wrong?" — and those are fundamentally different questions. Data staleness has a clear timestamp. Response wrongness is a function of the model, the source documents, the query distribution, and the user's context.

The teams that build reliable AI caching systems are the ones that stop treating their semantic cache like a Redis TTL problem and start treating it like a distributed system with explicit consistency contracts per tier. The contracts aren't complicated — but you have to define them before your first production incident forces the question.

References:Let's stay in touch and Follow me for more thoughts and updates