Ghost Context: How Contradictory Beliefs Break Long-Running Agent Memory
Your agent has talked to the same user 400 times. Six months ago she said she preferred Python. Three months ago her team migrated to Go. Last week she mentioned a new TypeScript project. All three facts are sitting in your vector store right now — semantically similar, chronologically unordered, equally weighted. The next time she asks for code help, your agent retrieves all three, hands a contradictory mess to the model, and confidently generates Python with Go idioms for a TypeScript context.
This is ghost context: stale beliefs that never die, retrieved alongside their replacements, silently corrupting agent reasoning.
The problem is underappreciated because it doesn't produce visible errors. The agent doesn't crash. It doesn't refuse to respond. It produces fluent, confident output that's just subtly, expensively wrong.
Why Memory Systems Are Built Wrong by Default
Most agent memory implementations are append-only log: a new fact arrives, it gets embedded and stored. Older facts stay indexed indefinitely. When retrieval happens, the vector search returns whatever is semantically close — which frequently includes both the current state of the world and everything the agent believed three months ago.
The core technical failure is that vector stores are temporally blind. A fact stored in January and a fact stored in April occupy the same flat namespace. Similarity search has no concept of which fact supersedes the other. The model sees "user prefers Python" and "user works in Go" with equal confidence and equal position in the retrieved context window, and it has to guess which one to trust.
Empirical studies confirm the damage. LongMemEval (ICLR 2025), which benchmarks agents across 500 questions spanning multi-session chat histories, found that commercial assistants drop 30% in accuracy on tasks requiring knowledge updates — precisely the case where a user's preferences or facts have changed and the old record survives. The failure isn't retrieval failing to find information; it's retrieval finding too much information, including the stale version.
A related problem, called proactive interference, is even harder to fix. Research shows that as the number of contradictory updates in memory increases, retrieval accuracy declines log-linearly — even when the correct, most recent answer is positioned near the end of the context. Larger model parameter counts are the primary factor affecting robustness; context window length is not. You cannot window your way out of this.
The Four Failure Modes Ghost Context Produces
Understanding the specific ways contradictory memory degrades agent behavior helps prioritize what to fix first.
Update blindness. The agent acts on the old version of a fact because it was retrieved alongside the new one and the model weighted the older entry. This is particularly common with user preferences, environment configuration, and anything that changes gradually rather than explicitly. The user said "I moved to Go" once; Python is in 15 older memories.
Self-reinforcing errors. An agent that incorrectly concludes "API endpoint X fails with parameter Y" adds that belief to memory. Future retrievals surface the belief, the agent avoids the code path, and no new evidence is ever collected to refute it. The error becomes load-bearing.
Over-generalization bleed. Lessons from one context contaminate reasoning in a different context. "Python is slow for ML inference" (directionally true for some workloads) gets retrieved when the user asks about Python for a network service, where it's irrelevant. Context similarity in embedding space does not imply contextual relevance to the current task.
Multi-turn compound failure. In multi-agent systems, the failure compounds. Agent A retrieves contradictory memory and produces a subtly wrong output. Agent B takes that output as input, retrieves more memory, and makes decisions based on a corrupted premise. By the time the error surfaces, it has passed through three reasoning steps and is very difficult to trace back to the original contradictory retrieval.
What Production Systems Do Differently
The systems that handle this well share a common insight: the problem is primarily in the write path, not the read path. Most teams invest in retrieval optimization — better embeddings, hybrid search, reranking. The contradiction problem is upstream of that.
Temporal ownership per fact. Production memory systems like Zep's Graphiti engine assign every stored fact a set of timestamps: when the event occurred in reality, when it was ingested, and when it was last updated. Graph edges carry valid-from and valid-until properties. When a user says "I moved to Shanghai," the system records the new fact and explicitly marks "I live in Beijing" as temporally superseded — it's still in the graph, but flagged as invalid. Retrieval respects these markers. The result in Zep's published benchmarks is an 18.5% accuracy improvement over vector-only baselines on knowledge-update tasks, with latency dropping 90% because the retrieval index is smaller and more coherent.
Explicit contradiction detection at write time. Rather than appending every new memory, Mem0's graph-enhanced variant runs a conflict detection step before committing a fact. The system determines whether the incoming fact should trigger an ADD (no semantic equivalent exists), UPDATE (complement existing memory), DELETE (the new fact contradicts and replaces the old one), or NOOP. This turns the memory write operation from a passive log into an active curation step. The system self-edits rather than accumulating contradictions.
Differentiated temporal decay. Not all facts have the same freshness profile. Preferences change slowly; actions and events become irrelevant quickly. A-MAC, a framework for agentic memory evaluation, models this with type-specific decay rates: preferences decay at 0.5% daily (persisting for years), facts at 2% daily, actions at 20% daily. At retrieval time, recency weighting using exponential moving averages reduces the score of older memories relative to recent ones. This doesn't solve the structural problem — contradictory facts are still in the store — but it changes the retrieval probability distribution toward recency without removing older information that may still be genuinely relevant.
Dual-buffer memory promotion. Newly formed memories should not go directly into long-term storage with permanent retrieval eligibility. TiMem's approach places new memories in a "hot" buffer during a probation period, where they undergo re-verification, deduplication against existing memories, and importance scoring before promotion. This catches contradictions before they propagate and filters out low-quality extractions that would otherwise corrupt the long-term store.
The Temporal Knowledge Graph as the Right Primitive
- https://arxiv.org/html/2501.13956v1
- https://arxiv.org/html/2410.10813v1
- https://arxiv.org/html/2603.07670v1
- https://arxiv.org/html/2504.19413v1
- https://arxiv.org/html/2601.07468v1
- https://arxiv.org/html/2601.02845
- https://blog.getzep.com/state-of-the-art-agent-memory/
- https://mem0.ai/blog/state-of-ai-agent-memory-2026
- https://towardsdatascience.com/a-practical-guide-to-memory-for-autonomous-llm-agents/
