Skip to main content

Agent Memory Garbage Collection: Engineering Strategic Forgetting at Scale

· 10 min read
Tian Pan
Software Engineer

Every production agent team eventually builds the same thing: a memory store that grows without bound, retrieval that degrades silently, and a frantic sprint to add forgetting after users report that the agent is referencing their old job, a deprecated API, or a project that was cancelled three months ago. The industry has poured enormous effort into giving agents memory. The harder engineering problem — garbage collecting that memory — is where the real production reliability lives.

The parallel to software garbage collection is more than metaphorical. Agent memory systems face the same fundamental tension: you need to reclaim resources (context budget, retrieval relevance) without destroying data that's still reachable (semantically relevant to future queries). The algorithms that solve this look surprisingly similar to the ones your runtime already uses.

Why Memory Accumulation Is the Default

Agent memory systems almost universally follow an append-only pattern: detect something important, embed it, store it, retrieve it later. This works beautifully for the first hundred interactions. By the thousandth, the failure mode emerges — not as a crash, but as a quiet degradation in response quality.

The numbers are stark. Agents using an "add-all" memory strategy accumulated over 2,400 records while their accuracy on medical reasoning tasks dropped to 13%. The same agents with active memory management maintained 248 records and achieved 39% accuracy. Storing less produced a 3x performance improvement.

This happens because LLMs exhibit an experience-following property: they replicate the style and quality of whatever context they receive. When your memory store is polluted with stale, contradictory, or low-quality entries, the agent faithfully reproduces those characteristics. The memory system doesn't have a quality problem — it has a garbage collection problem.

The GC Algorithms That Apply

Borrowing from runtime garbage collection gives us a useful taxonomy for agent memory management strategies. Each addresses a different failure mode, and production systems typically need all of them running concurrently.

Generational Collection: Time-Based Decay Tiers

The most effective production pattern mirrors generational GC. Memories are born in a "young generation" with aggressive decay, and only promote to longer-lived tiers if they prove useful.

The implementation uses a Weibull decay function: w(Δτ) = exp(-(Δτ/η)^κ), where elapsed time since last retrieval determines a memory's relevance score. Memories falling below a freshness threshold are pruned before they reach the agent's context window. But the key insight is that not all memories should decay at the same rate.

Production systems assign different time-to-live values based on semantic category:

  • Immutable facts (user identity, system constraints): infinite TTL, never collected
  • Procedural knowledge (how to use a specific tool, workflow patterns): long TTL, weeks to months
  • Preference information (communication style, formatting choices): medium TTL, days to weeks
  • Transient context (current task state, in-progress debugging): short TTL, hours to days

This mirrors how human memory works — the Ebbinghaus forgetting curve shows exponential decay for most information, with meaningful content persisting through rehearsal. Agent memory systems that implement access-frequency reinforcement, boosting a memory's relevance score each time it's successfully retrieved and used, create the same effect as spaced repetition. Useful memories keep refreshing their own TTL.

Mark-and-Sweep: Semantic Deduplication

The most overlooked GC operation in agent memory is deduplication. As agents interact over time, they accumulate near-identical memories: "User prefers Python" stored alongside "User's primary language is Python" and "When coding, use Python." These aren't identical strings, so hash-based dedup misses them. But they consume three retrieval slots to deliver one piece of information.

Semantic deduplication operates as a mark-and-sweep pass over the memory store. The mark phase computes pairwise semantic similarity within topic clusters. The sweep phase merges memories above a similarity threshold, preserving the most complete version and discarding redundant fragments.

Production implementations run this as a background job, not inline with writes. The cost matters: one system reported achieving 51 tokens from a 7,327-token baseline through aggressive subatom extraction and semantic dedup — a 99.3% compression ratio. Even modest deduplication, operating on obvious near-duplicates, typically reduces memory store size by 30-40% without any information loss.

The organizational pattern matters too. In multi-agent systems, deduplication needs to operate across agent boundaries — different agents observing the same event will generate overlapping memories. Coordinated forgetting protocols identify and discard noise while preserving team-critical information that no single agent's dedup pass would surface.

Reference Counting: Contradiction Detection

The subtlest GC problem is contradiction. When a user changes jobs, the old employer memory doesn't become irrelevant — it becomes actively harmful. A highly-retrieved memory about a user's employer is highly relevant until the moment it isn't, at which point it becomes confidently wrong rather than just outdated.

Truth maintenance systems address this by running a consistency check before consolidating any new memory. The check extracts subject-predicate-object triples and tests for four types of contradiction:

  • Negation: "User does not use AWS" vs. stored "User uses AWS"
  • Temporal supersession: "User works at Company B" supersedes "User works at Company A"
  • Value conflicts: "Project deadline is March 15" vs. stored "Project deadline is February 28"
  • Antonym detection: "User prefers verbose output" vs. stored "User prefers concise output"

When a contradiction is detected, the system must decide: update the existing memory, or flag for human review? The safest production pattern rejects updates that contradict core facts (ΔM ∧ M_core ⊧ ⊥) and routes them through a reconciliation layer. This prevents both legitimate updates from being lost and adversarial memory poisoning from corrupting the store.

The hard open problem is distinguishing genuine updates from contextual variation. "I hate Python" said during a frustrating debugging session doesn't actually contradict "User prefers Python." Detecting whether a statement reflects a permanent state change or a transient emotional expression requires reasoning that current systems handle poorly.

The Compression-vs-Retrieval Tradeoff

When an agent's memory exceeds its context budget, you face a fundamental architectural choice: compress the memories to fit, or retrieve selectively and accept incomplete context.

Full-context approaches — stuffing everything into the prompt — achieve the highest accuracy on benchmarks. But they're categorically unusable in production. At 100K tokens per request, each call costs around 0.50.Scaleto10,000agentinteractionsperdayandyourespending0.50. Scale to 10,000 agent interactions per day and you're spending 5,000 daily on context alone. Worse, models exhibit "lost in the middle" degradation where information buried in long contexts is effectively invisible.

Selective retrieval — embedding queries, searching the memory store, returning the top-K results — trades a small accuracy penalty for massive cost and latency savings. Research comparing these approaches found that selective memory pipelines accept only a 6-percentage-point accuracy loss versus full context, in exchange for 91% lower p95 latency and 90% fewer tokens.

Adaptive compression represents the middle ground. Observational memory systems run background agents that continuously compress agent observations, achieving 26-54% peak token reduction while preserving 95%+ task accuracy. The key distinction is that compression happens asynchronously — it doesn't block user interactions — and it's lossy by design. The system decides what information density justifies context budget.

The production decision framework:

  • Under 10K memories, latency-insensitive: full context may work, monitor for lost-in-the-middle effects
  • 10K-100K memories, real-time serving: selective retrieval with semantic dedup, target top-20 retrievals
  • 100K+ memories or multi-agent: tiered architecture with compression, retrieval, and an eviction policy

Building the GC Pipeline

A production-ready memory GC system runs four processes concurrently, each on its own schedule.

Write-time filtering is the first line of defense. Before a memory enters the store, evaluate whether it adds information not already represented, meets a minimum quality threshold, and doesn't contradict core facts without resolution. Research shows that strict write-time filtering alone produces a 10% absolute performance gain over naive memory growth, even without any deletion mechanism.

Background deduplication runs on a schedule — hourly for high-volume agents, daily for lower-volume ones. It clusters memories by topic, computes intra-cluster similarity, and merges near-duplicates. This is computationally expensive (O(n²) pairwise similarity within clusters) but tolerates batching because dedup is eventually consistent.

Decay sweeps run continuously, decrementing relevance scores based on time since last access and semantic category. Memories below threshold are soft-deleted first (excluded from retrieval but not destroyed) and hard-deleted after a grace period. The grace period matters — it's your safety net against over-aggressive collection.

Consistency audits run periodically against the entire store, checking for contradictions that write-time filtering missed. These catch the hard cases: gradual semantic drift where no single update triggers a contradiction, but the cumulative effect makes the memory store internally inconsistent.

The Metrics That Tell You Your GC Is Working

Most teams instrument memory systems with storage metrics: count, size, write rate. These are necessary but insufficient. The metrics that actually tell you whether your GC is working are retrieval-quality metrics.

Retrieval precision at K measures what fraction of the top-K retrieved memories are actually relevant to the current query. If this number drops over time while your store grows, your GC is too passive.

Stale retrieval rate tracks how often retrieved memories contain demonstrably outdated information. This requires ground-truth labeling, which is expensive, but even sampling 1% of retrievals and manually reviewing gives you a leading indicator before users notice.

Context utilization ratio measures what percentage of the context window filled by retrieved memories actually influences the agent's response. Low utilization means you're paying for context that the model ignores — a signal that your memories are too verbose or too numerous.

Contradiction rate counts how often retrieved memories contain mutually contradictory facts. Any nonzero rate here is a direct input to agent unreliability. If you're seeing contradictions in retrieval results, your truth maintenance system has gaps.

Why Memory Management Beats Memory Storage

The agent memory ecosystem in 2026 reveals an instructive pattern. Five major memory frameworks accumulated over 80,000 combined GitHub stars in Q1 alone, each solving the problem differently — verbatim storage, filesystem abstractions, knowledge graphs, single-binary simplicity, multimodal lifelong memory. They fundamentally disagree on storage mechanism, decision-making authority, and specialization.

What they agree on is that storage is the solved problem. Every framework can persist memories reliably. The unsolved problem — and the one that determines whether your agent actually improves over time or gradually degrades — is memory management. Specifically: what to forget, when to forget it, and how to verify that forgetting didn't break anything.

The teams that ship reliable long-running agents are not the ones with the most sophisticated embedding models or the fastest vector databases. They're the ones that treat memory as a managed resource with a lifecycle — creation, promotion, consolidation, deprecation, and deletion. In other words, they're the ones that built a garbage collector.

The parallel to software engineering is exact. We spent decades building faster allocators before we realized that automatic memory management — having the runtime decide what to free — was the unlock that made complex software possible. Agent memory is at the same inflection point. The frameworks that win won't be the ones that store the most. They'll be the ones that forget the best.

References:Let's stay in touch and Follow me for more thoughts and updates