Agent Memory Eviction: Why LRU Survives a Model Upgrade and Salience Doesn't

May 10, 2026 · 9 min read

Software Engineer

The team that ships an agent with salience-weighted memory eviction has, without realizing it, signed up for a memory migration project at every model upgrade. The eviction policy looks like a quality lever — pick the smartest scoring approach, get the best recall — but it is secretly a versioning contract. When the scoring model changes, the agent's effective past changes too. None of the tooling teams build around prompts and evals catches it, because the artifact that drifted is not a prompt or an eval. It is a sequence of decisions about what to forget, made months ago, by a model that no longer exists.

LRU and LFU don't have this problem. They are deterministic, model-independent, and survive upgrades cleanly. They also throw away information that a thoughtful judge would have kept. That is the tradeoff most teams accept once, on day one, when a demo recall metric is the only thing being measured — and it is the tradeoff that bites quarterly for the rest of the agent's lifetime.

The Three Patterns Teams Actually Choose Between

Long-running agents accumulate observations, tool outputs, and user statements faster than any context window can hold. Even with a million-token model, conversations that span weeks blow past the budget, and inference cost scales roughly linearly with what you keep in context. Some kind of working-memory store with an eviction policy becomes mandatory.

In practice, three patterns dominate.

LRU by access time evicts the entry that was read or written the longest ago. Every retrieval refreshes the recency stamp; entries that never get pulled into context drift toward the eviction frontier. It is what you get if you wire a memory store to a standard cache library and stop thinking about it.

LFU by reference count evicts entries that have been accessed the fewest times. Entries that get pulled into many turns become sticky; one-off mentions get displaced. Redis and most production cache discussions warn that pure LFU has a known failure mode — workload shifts cause hit-ratio degradation of up to 40% because the long-tail counters anchor stale items in place.

Salience-weighted by an LLM judge scores each candidate memory with a small prompt — "rate from 0 to 1 how important this fact is to remember about the user" — and evicts the lowest-scoring entries. Mem0 markets a version of this as "extract salient facts, discard the rest" and reports 80% token reduction. Letta's Recall Memory layer does something adjacent: an LLM decides what gets summarized into Core Memory and what gets paged out.

Day-one demo quality runs roughly: salience > LFU > LRU. Salience makes the right calls about which entries are load-bearing for future questions because the judge actually reads the content. LRU is dumb. LFU is cleverer but biased toward early popular items. The recall metric on a fresh eval set will rank them in that order, and that is usually how the choice gets made.

Salience Is Coupled to a Model Version You Don't Control

Here is what nobody scopes when picking salience: the policy is a function call. score(memory, user_context) → float. That function is implemented by an LLM, and the LLM has a version. When the version changes — because your provider rolled out a new snapshot, because you swapped to a cheaper model, because someone bumped the system prompt for the judge — the function changes.

The Stanford and UC Berkeley study that tracked GPT-4 between March and June 2023 found accuracy on identifying prime numbers dropped from 84% to 51% on the same prompts. That study became the canonical evidence for what the LLM observability crowd now calls "prompt drift" or "silent versioning" — the model behind your endpoint changes without your endpoint changing. A 2026 multi-model study on judge inconsistency made the problem more concrete: even at temperature 0, completeness scores fluctuate substantially across models, and cross-model strictness varies systematically. Two judges looking at the same memory will disagree on whether it is worth keeping.

Now apply that to a memory store that has been running a salience policy for six months. Half the memories were scored and retained by judge v1. The other half were scored and retained by judge v2. The store as a whole reflects neither judge's worldview. The agent's "what it remembers about you" is a stratigraphy of judge versions, with no clean way to know which layer was responsible for keeping or discarding what.

Worse, the failures are silent. The agent does not stop working. It just remembers different things. A user who told the agent about a peanut allergy in March 2026 might find that information evicted in July when judge v2 — which prioritizes recent conversational context over background facts — re-evaluates the eviction queue. There is no error log, no failed test, no regression alarm. The eval set, which gets refreshed against current production state, reflects the new memory shape and looks normal.

LRU and LFU Don't Have a Hidden Dependency

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Agent Memory Eviction: Why LRU Survives a Model Upgrade and Salience Doesn't

The Three Patterns Teams Actually Choose Between

Salience Is Coupled to a Model Version You Don't Control

LRU and LFU Don't Have a Hidden Dependency

Recommended Reading

About Tian Pan

The Three Patterns Teams Actually Choose Between​

Salience Is Coupled to a Model Version You Don't Control​

LRU and LFU Don't Have a Hidden Dependency​

Recommended Reading

About Tian Pan

The Three Patterns Teams Actually Choose Between

Salience Is Coupled to a Model Version You Don't Control

LRU and LFU Don't Have a Hidden Dependency