Hierarchical Memory Compaction: The Four Tiers Your Agent Memory Is Missing
Most agent memory systems collapse a four-layer problem into two layers and then act surprised when the seams show. There is the conversation buffer that gets truncated when it overflows the context window, and there is the vector store of "long-term memory" that everything older than the buffer gets dumped into. That is not a memory architecture. That is a queue and a junk drawer.
The agent that re-asks a regular user the same onboarding question three Mondays in a row is not failing because the model is bad. It is failing because there is no place in the system that holds "things this user has told me across sessions" with a different lifetime than "things every user has ever told me about how the product works." Those are different memories. They have different access patterns, different privacy contracts, and different rules for when to forget. Conflating them is the architectural mistake — and it has a fix.
The Four Tiers, And Why Two Is Not Enough
Cognitive architecture work on language agents — the CoALA taxonomy is the cleanest articulation — splits memory into working, episodic, semantic, and procedural. For most product agents you can leave procedural memory inside the model and the codebase, and focus on the other three plus a session tier that sits between working and episodic. Four tiers, each with a job no other tier can do.
Working memory is the current task. It is the scratchpad: the user's latest message, the last few tool outputs, the partial plan the agent is executing. It resets when the task ends. Its lifetime is measured in seconds to a few minutes. Everything in it is in the prompt right now, and most of it should not survive the task.
Session memory is the current sitting. The user opened the assistant fifteen minutes ago and has been bouncing between three related tasks. They named a project early on and have been referring to it as "the project" since. Working memory of task three has no idea what "the project" was — that fact lived in task one. Session memory holds it: a compact, structured record of what happened earlier in this session that survives across tasks within a sitting and dies when the user closes the conversation.
Episodic memory is the per-user history. Things this specific user has told the agent — preferences, ongoing projects, decisions made last week, tone they like — retrieved when relevant to the current task. Lifetime is weeks to forever. Scope is one user. This is the tier that lets the agent stop re-asking questions.
Semantic memory is the cross-user knowledge. Facts about the product, the domain, the integrations, the way one tool composes with another — distilled from many sessions and many users, deduplicated and verified. Lifetime is forever. Scope is everyone.
The reason two tiers are not enough is that any pair of these has fundamentally different characteristics. A session-memory fact and an episodic-memory fact look identical at the moment of capture ("user said the project is called Atlas") but have different decay rules — one dies at session end, the other persists until the user changes it. A semantic-memory fact and an episodic-memory fact look identical at retrieval time ("the email integration uses OAuth") but have different privacy contracts — one is shared across the tenant, the other must never leak. Lump them together and you get the worst behaviors of both: the system either treats per-user facts as global and leaks them, or treats global facts as per-user and re-derives them on every session.
Promotion And Eviction Are The Architecture
A four-tier memory is not four storage buckets. It is four storage buckets plus the rules for what moves between them. Those rules — promotion (data moves up a tier into longer-lived storage) and eviction (data drops out) — are what make the architecture work. Without explicit rules, the tiers fill up with noise and the retrieval quality at every layer collapses.
Promotion has to be selective. Most of working memory should evaporate at task end. The plan the agent assembled, the intermediate tool outputs, the false starts — none of that belongs in session memory, much less episodic. What should be promoted is the small set of durable facts produced during the task: a decision the user made, a preference they expressed, a project name they introduced. The same selectivity applies up the stack. Most of session memory should evaporate at sitting end. What gets promoted to episodic is the subset that is about the user rather than about this conversation — facts that the user would expect the system to remember next week.
Promotion from episodic to semantic is the hardest of the three and the one most teams skip. It requires de-personalization: a fact that was specific to one user becomes generic enough to apply to others. "User reported that the export-to-CSV button is broken" is episodic; the bug it points at, once verified across users, becomes a semantic fact about the product. This is the tier most worth automating with periodic LLM-driven consolidation jobs that scan episodic patterns and propose semantic facts — but it is also where you absolutely need a verification step, because a bad promotion poisons the shared store for every user downstream.
Eviction is the symmetric problem. The episodic store will grow indefinitely if nothing ever leaves. Time-based expiry is the wrong default — useful facts get older without becoming less useful, and most users will hate an assistant that "forgets" their company's name after ninety days. Better defaults: evict on contradiction (the user said something new that supersedes an older fact), evict on irrelevance (a project the user never mentions anymore decays in retrieval weight before it disappears), and evict on user request (an explicit "forget that" should actually delete, not just mark hidden). Each tier needs its own eviction policy. Working has hard limits on prompt budget. Session has time-bounded TTL. Episodic has activity-based decay. Semantic evicts on supersession, not on age.
Tier-Aware Retrieval Is Cheaper And Better
The naive retrieval pattern asks one question — "fetch everything relevant to this query" — and runs it against the union of all stored memories. That pattern is wasteful and dangerous. Wasteful because it queries semantic memory for facts that are already in working memory two sentences up. Dangerous because it queries episodic memory for facts that should have been served from session memory and pays for a vector lookup the conversation buffer would have answered for free.
The right pattern is to query the closest tier first and only fall back when the local context is insufficient. Start with working memory, which is already in the prompt — if the answer is there, no retrieval call happens. Fall through to session memory, a small structured store that responds in single-digit milliseconds. Then episodic memory, scoped to this user, which is a vector lookup with a tight tenant filter. Only then semantic memory, which is the broadest and most expensive search.
This pattern matters for cost: the cheapest tier is the one you didn't query. It also matters for correctness. If the user just told the agent something five turns ago and the system is reaching into a vector store to maybe-find a stale version of the same fact, the freshness ordering is backwards. Working memory is by definition the freshest. Treating it as a first-class retrieval source — not just as the prompt — eliminates a class of bugs where the agent contradicts itself within a single session because it pulled a stale episodic fact over a fresh working one.
The eval that catches the wrong layering is straightforward to write: take a question whose answer was given by the user three turns ago, and verify it is answered from working or session memory rather than from semantic memory. If the trace shows a vector retrieval against a tenant-wide store for a fact that lived in the conversation buffer, the layering is broken regardless of whether the answer happened to be correct.
The Privacy Contract Is Tier-Local Or It Is Broken
The most consequential property of the four-tier model is that privacy boundaries align with tier boundaries. Working memory is in the user's session and lives in the prompt; it cannot leak by construction because no other user's prompt can read it. Session memory is the same. Episodic memory is per-user but durable; it must be queried with a tenant filter that cannot be bypassed. Semantic memory is shared across users by design; nothing user-specific should ever live there.
When teams collapse episodic and semantic into a single vector store with a user_id field that the query layer is supposed to filter on, they create the cross-session-leak vulnerability that has become a regular finding in red-team reports of multi-tenant LLM systems. A semantically-similar piece of content from another user's history gets retrieved because the embedding similarity score outranked the tenant filter — or because a developer later wrote a query that forgot the filter. The defense is not "remember to filter every query." The defense is to put episodic and semantic in different stores so that querying the wrong tier requires connecting to the wrong database, not just dropping a WHERE clause.
The same principle scales to organizational boundaries. If your agent is multi-tenant in the enterprise sense — different companies use the same deployment — the episodic tier itself needs to be partitioned per tenant, with the query path hard-coded to a single partition. Semantic facts that would otherwise promote to a shared store stay tenant-local unless explicitly approved for global promotion. The promotion pipeline becomes the place where privacy review happens, because it is the only place where data crosses a boundary. Centralizing it there is far easier to audit than scattering tenant filters across every query in the application.
This is why "agent memory" is a dangerous abstraction when treated monolithically. The phrase implies a single system with a single privacy posture, when what you actually have is four systems with four different postures. The most secure deployments treat them as four separate services with four separate auth scopes, and the consolidation pipelines that move data between them as the high-trust boundaries they actually are.
Building It Without Buying Into A Framework
You do not need to adopt a memory framework to get this architecture. The tier model maps to primitives most teams already have: working memory is the prompt buffer, session memory is a TTL'd row in the cache, episodic memory is a per-user partition of the vector store, and semantic memory is the global partition. The architectural commitment is to keep them separate at the storage layer and to write the promotion and retrieval pipelines as first-class code rather than as ad hoc summarization calls.
Start with the explicit boundary, not with sophisticated promotion logic. Define what counts as session vs. episodic in plain language and write a small classifier — even a hand-crafted prompt that runs at session end is better than the implicit "everything goes to the vector store" default. Define the eviction policy per tier before the stores fill up, because retrofitting eviction onto a vector store with two years of accumulated noise is a much harder project than getting the policy right early. Build retrieval as a tiered query plan, not as a single function — the function that answers "what does the user want" should consult the tiers in order, and that function is where you get to amortize most of your retrieval cost away.
The agent memory problem is not solved by a bigger context window or a better embedder. Both of those help, but neither addresses the fact that an agent needs four memories doing four different jobs, and the team that ships only two is going to keep paying for the missing two in retrieval cost, in re-asked questions, and eventually in the kind of cross-tenant leak that ends up on the front page. Build the four tiers, write the promotion rules, scope the privacy contract per tier — and then most of the "agent forgot" and "agent leaked" failure modes stop being mysterious.
- https://arxiv.org/abs/2310.08560
- https://arxiv.org/abs/2309.02427
- https://arxiv.org/abs/2504.19413
- https://www.letta.com/blog/agent-memory
- https://docs.letta.com/advanced/memory-management/
- https://atlan.com/know/episodic-memory-ai-agents/
- https://towardsdatascience.com/a-practical-guide-to-memory-for-autonomous-llm-agents/
- https://arxiv.org/abs/2512.13564
- https://arxiv.org/html/2602.11510v1
- https://www.giskard.ai/knowledge/cross-session-leak-when-your-ai-assistant-becomes-a-data-breach
- https://aiagentmemory.org/articles/llm-hierarchical-memory/
- https://mem0.ai/blog/memory-in-agents-what-why-and-how
