Skip to main content

Agent Memory Is a Cache With No Invalidation Policy

· 9 min read
Tian Pan
Software Engineer

Every agent framework now ships "long-term memory" as a headline feature, and every team adopts it as an unambiguous good. The agent remembers the user's preferences, prior decisions, project context, and the corrections it was given last week, so each session starts warmer than the last. The demo is irresistible: a user says "set up the project the way I like it" and the agent just does it. Nobody asks the obvious question, because the framing of the feature actively discourages it.

The question is: when does any of that stop being true?

A memory store is a cache. It holds facts about a world that does not hold still. The agent recorded "the user prefers Postgres" eight months ago, and the team has since migrated to a different database. The agent remembers "the user is on the growth team," and the user changed roles in March. The agent stored a tidy summarized conclusion from a conversation whose premises were corrected two messages later. And the memory layer surfaces all of it with exactly the same confident freshness as a fact written this morning. We have spent fifty years learning that a cache without an invalidation policy is a correctness bug. Then we built agent memory and shipped it without one.

The feature framing hides the cache

Phil Karlton's line — "there are only two hard things in computer science: cache invalidation and naming things" — is a joke because it is true. Cache invalidation is hard because it is not a feature you build; it is a discipline you sustain. You have to know what each cached entry depends on, detect when the dependency changes, and propagate that change before anyone reads the stale value. Get any of the three wrong and the cache silently lies.

Agent memory has all three problems and inherits none of the discipline, because it was never introduced as a cache. It was introduced as memory — a word that carries a human, sympathetic, obviously-good connotation. You would never ship a cache layer with no TTL, no eviction, and no dependency tracking and call it production-ready. But "long-term memory" with those same three holes ships every day, because the name smuggled the cache past review.

The 2026 surveys of agent memory frameworks are blunt about where this lands. Across the widely adopted systems — Mem0, Zep, Letta, LangMem, MemOS — retrieval quality and write extraction have improved fast, but staleness handling remains, in Mem0's own words, "unresolved" in most frameworks. The store grows monotonically. Nothing is ever the thing that gets evicted, because the framing of "memory" makes deletion feel like data loss rather than cache hygiene.

Stale is not the same as outdated

The instinct, once you accept the cache analogy, is to reach for a TTL. Decay everything: a memory's confidence drops on a schedule unless something re-confirms it. That helps, and you should do it — but it solves the wrong half of the problem.

TTL handles low-relevance entries gracefully. The throwaway note about a syntax question tied to a temporary project ages out in 7 or 30 days and nobody misses it. The dangerous memories are the opposite: high-relevance facts that are retrieved constantly and correctly, right up until the moment they are not. "The user's employer is Acme" is the single most-retrieved fact in the store. It is correct on a thousand consecutive turns. Then the user changes jobs, and on turn 1001 it becomes not stale-and-low-confidence but confidently, specifically wrong — and it still has the highest retrieval score in the store, because it always did.

This is the failure mode a recent benchmark on memory validity made uncomfortably concrete. When researchers tested whether frontier models and specialized memory frameworks could recognize that their stored memories were no longer valid, the best system scored around 55% — barely better than a coin flip. Models routinely accept the outdated assumption embedded in a user's own question, and they fail to propagate a change: when the user updates one fact, the agent does not invalidate the related facts that depended on it. Tell the agent you moved cities and it will not reconsider the "favorite local coffee shop" it helpfully remembered.

A pure decay model cannot catch this, because the stale fact never looked low-confidence. Invalidation is not a timer. It is a dependency graph.

Four disciplines a memory store actually needs

If you treat the memory layer as the cache it is, the operational requirements are not exotic. They are the same four things any cache needs, translated into the agent's vocabulary.

Provenance, so a memory can be invalidated. Every stored memory should link back to the source event that produced it — the message, the document, the tool result. Without that link, you can never answer "is this still true?" because you do not know what this was derived from. With it, a memory whose source was deleted, corrected, or superseded can be invalidated mechanically. The frameworks that take this seriously — Mem0's scoped provenance, Zep's episode-level provenance, MemOS's versioned provenance metadata — are not adding bureaucracy. They are adding the dependency edges that make invalidation possible at all. A memory with no provenance is a cache entry with no key: you can read it, but you can never safely expire it.

A write policy that separates durable facts from situational ones. "The user's name is Sam" and "the user is frustrated today" should not enter the same store with the same lifetime. Most extraction pipelines flatten both into "a memory" and let them rot together. A useful write policy classifies on the way in — identity facts, stable preferences, project state, transient mood — and assigns each class a different default lifetime and a different confidence-decay curve. The cheapest invalidation is the write that never happened: a lot of what agents currently persist is context that should have died with the session.

A conflict-resolution policy, not a conflict-surfacing one. A naive vector search retrieves "the user prefers Postgres" and "the user prefers a graph database" with equal enthusiasm and hands both to the model, which is now trapped between contradictory instructions. Retrieval needs an adjudication rule before it returns: bias toward recency with an explicit time-decay penalty on the similarity score, prefer facts the user asserted directly over facts the agent inferred, and prefer higher-authority sources over lower ones. The goal is to return the current state of truth, not a transcript of every state it has ever been in.

Tombstones instead of deletes. When a memory is invalidated, do not silently drop it. Mark it INVALID, keep the record, and keep an audit trail. This matters for the same reason it matters in a database: an immutable history lets you debug "why did the agent do that," lets you recover from a bad invalidation, and lets a future engineer see that a fact was once true and was deliberately retired — rather than rediscovering it as a fresh surprise.

Why this falls through the org chart

The technical fixes are not the hard part. The hard part is that "memory" was adopted as a product feature and never assigned to anyone as an operational responsibility.

When a team ships a Redis cache, somebody owns its eviction policy, its hit rate, its memory ceiling, and the alert that fires when it misbehaves. When the same team ships agent memory, the product side owns "the agent remembers things" as a capability, and nobody owns the staleness. There is no dashboard for the percentage of retrieved memories that are stale, because staleness is invisible until a user re-corrects something they already corrected — and that re-correction shows up as a vague "the agent feels dumb sometimes" complaint, not a cache-miss metric.

So the discipline that exists for every other cache in the stack — the TTLs, the eviction alarms, the invalidation tests — never gets built for the one cache that was named to look like it didn't need them. The fix is partly cultural: stop calling it memory in design docs and start calling it the memory cache, so the review questions that protect every other cache get asked about this one too. Ask "what is the eviction policy" in the design review. Ask "what invalidates this entry" for each class of stored fact. Ask "how do we know the hit rate is healthy" and accept that the honest answer today is usually "we don't."

Test against stale memory, not just recall

One last gap, because it is the one that lets the others survive contact with a green dashboard. Most teams that evaluate their memory layer at all evaluate recall: given that the agent stored a fact, does it retrieve and use that fact later? That is necessary and it is the easy half.

The eval slice almost nobody builds is the adversarial one: seed the store with a fact, then change the world so the fact is now wrong, and measure whether the agent notices. Does it act on the stale preference, or does it re-confirm? When two stored memories conflict, does it pick the recent one or surface both and freeze? When the user's query embeds an outdated assumption, does the agent inherit the assumption or correct it? A memory system that scores 95% on recall and was never tested on staleness has been certified for the half of the problem that was never going to hurt you.

The reframe is the whole point. Long-term memory is not a new kind of intelligence the agent acquired. It is a cache of facts about a world that keeps moving, and the longer the agent runs, the more of that cache is quietly wrong. A cache without an invalidation strategy is not a feature. It is a slow-motion correctness bug — and "memory" is just the name that kept it out of code review.

References:Let's stay in touch and Follow me for more thoughts and updates