Skip to main content

Context Bloat: The AI Memory Leak You Cannot Grep For

· 12 min read
Tian Pan
Software Engineer

A long-running agent session that opened with a 2K context is now paying for 40K tokens of mostly-dead state. The retrieval results from turn three, the directory listing the agent already navigated past, the JSON dump from a tool call whose answer was a single integer — all of it is still riding shotgun on every subsequent inference call, billed in full, dragging on attention. The pattern is structurally identical to a memory leak: unbounded growth of unreferenced data. But no profiler will surface it, because the leak does not live in process memory. It lives inside the conversation history, and most agent frameworks ship without a collector.

The cost shows up in two places at once. The token bill grows quadratically — a 20-step loop where each step contributes 1,000 tokens produces roughly 210,000 cumulative input tokens, not 20,000, because every prior turn is rebilled on every subsequent call. And the model itself starts to degrade: by 50K tokens of accumulated noise, even a model with a 1M-token window has already lost double-digit points of accuracy on the actual task. You are paying more, to think worse, about a problem the model was already past three turns ago.

This post is about treating that history like a heap: how to do per-turn token attribution that distinguishes load-bearing context from accumulated debris, when to run reachability analysis on tool outputs, what a "context GC" pass actually looks like in production, and why the architectural fix is realizing that conversation history is mutable state — not an audit log.

The Leak Is Real, And It Costs Twice

The first cost is dollars. Transformer inference is stateless: every call ships the entire context, every prior message, every tool result, the full system prompt. If a 4KB JSON blob landed in turn three, it occupies tokens on every subsequent call, at full input price. Practitioners modeling per-turn cost independently consistently underestimate multi-step workflow cost by 3x to 5x once context accumulation is properly accounted for. The triangular series n(n+1)/2 is the right mental model, not n.

The second cost is quality. Chroma's 2025 study tested 18 frontier models across input lengths and found that every one — GPT-4.1, Claude Opus 4, Gemini 2.5 — degrades as input grows, regardless of advertised window size. The classic lost-in-the-middle effect produces 30%+ accuracy drops on information buried in the body of a long context. A single distractor measurably reduces baseline performance; four distractors compound. A 1M-token window still rots at 50K tokens. The number on the model card is not the number you actually get.

The structural insight is that those two costs are coupled. The cheapest token is one you never sent; the cleanest reasoning is on context the model can actually attend to. Pruning is not a cost optimization that trades off against quality — it is a quality optimization that happens to also reduce cost. Teams that frame it as a cost project end up under-investing because the dollar savings on a single product feature rarely justify dedicated engineering. Teams that frame it as a reliability project end up doing the work because the alternative is an agent that gets dumber with every turn.

Why You Cannot Grep For It

Traditional memory leaks are findable because the data sits in a process you can attach a profiler to. You can dump the heap, sort by retained size, and the leaked object stares back at you. Conversation-history leaks have none of those affordances. The "object" is a turn in a JSON array on the server you are talking to. There is no profiler, no GC root set, no retained-size column. The agent framework hands you a messages: [...] array and your job is to keep appending to it.

Worse, the leak is silent in the metrics most teams already track. Latency goes up gradually, but each individual turn still completes. Cost shows up on the monthly bill as a fixed multiplier on traffic, which looks like growth, not waste. The first signal is usually a quality regression — the agent forgets a constraint it was given on turn one, ignores an instruction the user repeated, or hallucinates from retrieval results that contradict the user's stated preference. By the time someone correlates that to context length, the team has been overpaying for months.

The diagnostic move that breaks the silence is per-turn token attribution. Instrument the conversation builder to tag every message with its source: system_prompt, user_turn, tool_result:<tool_name>, agent_reasoning, retrieval:<source>. Then on every inference call, log the token count by tag. Within a week of running this in production, you will see a histogram that looks suspiciously like a memory leak graph — one or two tag categories grow monotonically, while the user_turn category stays roughly constant. That is your leak.

The categories that bloat are predictable. Tool results from broad operations — list_files, search_codebase, read_documentation — are worst, because they return more than the agent needs and the agent only used a slice. Retrieval results bloat next, because vector search returns top-k regardless of whether top-1 was sufficient. Agent reasoning traces bloat third, because chain-of-thought from earlier steps almost never informs later ones. The system prompt itself rarely bloats but is often badly factored — every conditional instruction is paid for on every call whether or not the condition fires.

Reachability Analysis For Tool Outputs

Once you can see the leak, the next question is which entries are dead. The garbage-collection analogy is exact: a tool result is "live" if it is still informing decisions, "dead" if it is not. The trick is that liveness in conversation history is not statically determinable — it is a property of how the agent is using the result, not what the result contains.

The crude version of reachability is recency: drop anything older than N turns. This works for chat assistants where the user's recent message is almost always the most relevant signal, but it fails for agents that have to remember a constraint stated at the start of the session. The fix is a recency policy with pinned exceptions: messages explicitly tagged as constraints (the system prompt, user-stated preferences, task definitions) are never eligible for collection regardless of age.

The better version of reachability is reference counting. When the agent's next turn references a prior tool result — by quoting it, by reasoning about it, by passing its content to another tool — increment a counter on that turn. After M turns with no references, the entry is collectible. This requires the agent framework to track citations, which most do not, so most teams approximate by checking whether the tool result text appears as a substring of any subsequent assistant turn. False positives are common, but the precision is good enough to safely drop the long tail of one-shot tool results that the agent looked at once and moved past.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates