Skip to main content

Compaction Traps: Why Long-Running Agents Forget What They Already Tried

· 9 min read
Tian Pan
Software Engineer

An agent calls a file-writing tool. The tool fails with a permission error. The agent records this, moves on to a different approach, and eventually runs long enough that the runtime triggers context compaction. The summary reads: "the agent has been working on writing output files." What it drops: that the permission error ever happened, and why the original approach was abandoned. Three hundred tokens later, the agent tries the same write again.

This pattern — call it the compaction trap — is one of the most persistent reliability failures in production agent systems. It's not a model bug. It's an architecture mismatch between how compaction works and what agents actually need to stay coherent across long sessions.

What Compaction Actually Discards

Context compaction exists because agent tasks routinely outlive context windows. A code review agent, a multi-step data pipeline, or an autonomous research task can generate hundreds of tool calls before completing. Something has to give.

The dominant approaches are summarization (an LLM compresses older turns), truncation (older messages are dropped), or a combination. These work well for conversational state — preserving goals, recent progress, and general direction. They fail at a specific category of information: negative constraints and failure records.

When a tool call fails, the agent needs to remember three things: that it failed, what specifically went wrong, and what that means for future actions. A summary might capture "the agent encountered a tool error" but reliably loses the causal chain — which constraint caused the error, why a reattempt would produce the same result, and which paths are now closed. These "why-not records" are the difference between an agent that makes forward progress and one that cycles.

A benchmark evaluating context-managed agents on code-centric tasks found that agents operating under compacted context scored around 4.0 out of 10, versus 6.0 for agents with intact context. The gap wasn't from losing high-level goal awareness — it was from re-reading files they'd already processed and re-attempting operations that had already failed.

The Timing Problem Makes It Worse

There's an additional structural issue: compaction typically only runs between turns, not mid-turn. An agent starting a turn with 170,000 tokens of context can expand to 400,000 or more through successive tool calls, all without a compaction pass. By the time the turn ends and compaction runs, the context is so large that the summarization covers enormous amounts of history in a single pass — and lossy compression at scale loses more than lossy compression applied incrementally.

This also means there's no reliable point at which an agent can detect "I'm about to lose state — let me externalize it first." The compaction happens after the fact, on content the agent is no longer actively managing.

The lost-in-the-middle effect compounds this further. When context is summarized or truncated, remaining content gets repositioned — content that was near the end ends up in the middle of the new context. LLMs have well-documented U-shaped attention patterns, with better recall for content near the beginning and end. State that was fresh and salient before compaction becomes middle content with degraded retrieval accuracy afterward, even if it technically survived the compression.

Three Failure Modes Engineers Observe in Practice

Infinite retry loops. The agent calls a tool, receives an error or ambiguous result, and calls it again. Without a global representation of attempt count or a record of the error condition, each local context state looks like "I should try this." The loop continues until the context fills, the process is killed, or a human intervenes. This is structurally identical to CPU thrashing — repeated attempts to access a resource without the metadata needed to recognize the attempt is futile.

Re-doing already-completed work. An agent tracks which files it has modified, which API calls it has made, and which sub-tasks are complete. After compaction, the summary preserves the goal ("update these five files") but drops the completion records. The agent re-reads files it already processed, re-runs transformations it already applied. Beyond wasted compute, this creates correctness problems: applied-twice transformations, double-written records, duplicate notifications.

Constraint amnesia. An agent discovers mid-task that a certain approach is blocked — a rate limit, a missing permission, a schema constraint that rules out a class of queries. This constraint was learned through exploration, not available in the original system prompt. After compaction, the constraint disappears. The agent reverts to treating the blocked approach as viable, and the exploration cost is paid again.

What Actually Helps

The core insight is that context window is not a reliable memory system for agentic state — it's a computation surface. Anything that needs to survive long enough to prevent repeated work needs to live outside the context window.

Externalize critical state to a persistent store. Tool call history, retry counts, discovered constraints, and operation status belong in a key-value store (Redis, DynamoDB) or structured log that the agent queries at the start of each turn. "Have I already tried this?" becomes a lookup, not a recall task. The agent doesn't need to remember what it tried; it needs to know how to ask a system that never forgets.

The pattern is: before calling any destructive or idempotent-sensitive tool, the agent checks an external log for prior attempts on that operation key. After each tool call, it writes the outcome. Compaction can do whatever it wants to the context window — the prior-attempt record persists.

Use sticky context slots for invariant constraints. Infrastructure-level support for a small block of context (~500 tokens) that survives every compaction pass and is re-injected on each turn. This is the right place for system-level constraints: "never retry operation X," "halt if Y condition is detected," "these files must not be modified." The goal is to make invariants literally incompressible, not to rely on summarization to preserve them.

Checkpoint behavioral state at task boundaries. Every time the agent completes a meaningful sub-task — processed a batch, finished a research phase, completed a file group — write a structured checkpoint: goals achieved, steps completed, constraints discovered, what remains. If post-compaction drift is detected (the agent starts re-doing work it checkpointed), restore from the last known-good checkpoint by re-injecting the full checkpoint block.

This is distinct from the LLM-generated summary that compaction produces. Checkpoint data is structured and agent-controlled; it contains exactly the fields the agent needs to resume correctly. The compaction summary is general-purpose text that may or may not capture operationally relevant state.

Enforce step count limits as a hard infrastructure control. This is the simplest and most reliable defense against infinite loops: a hard cap on total actions before the run terminates. The right number depends on the task, but having no cap is the wrong answer. Agents don't self-terminate loops based on recognizing they're stuck; they reason locally from each context state and conclude another step is warranted. A step limit forces a circuit breaker that doesn't depend on agent self-awareness.

Pair this with loop detection via operation fingerprinting: hash (tool_name + input + recent_output) and halt if the same fingerprint appears three times. This catches non-obvious loops where the agent is making slightly different but equally futile attempts.

Design tool outputs to be compact and reference-based. If a tool returns 50,000 tokens of log output and that output gets included in context, it dominates the window and creates enormous compaction pressure. Store large outputs in external state and return a short reference ("log_ref_abc123: 50k lines, key findings: X, Y, Z"). The agent can retrieve the full output if needed; the context window sees the summary. This reduces the rate at which context fills and reduces the pressure for aggressive compaction.

Memory Hierarchies, Not Magic Models

The MemGPT/Letta architecture from Stanford makes this concrete with an OS-inspired analogy: treat the context window as RAM and external storage as disk. The agent explicitly manages what moves between them. Core memory (in-context) holds the hot working set; archival memory (external database) holds everything else. When context fills, content is evicted with LRU-style policies and stored with summaries. When the agent needs evicted content, it retrieves it explicitly.

This is more infrastructure than most teams want to build from scratch, but it points at the right mental model: the context window is managed cache, not the source of truth. Production agents need the same separation between working state and persistent state that any other stateful system requires.

Frameworks are converging on similar patterns. LangGraph's checkpointing layer handles per-turn state persistence. LangMem adds semantic retrieval for long-term memory across sessions. Google's ADK and Anthropic's managed agent infrastructure both include compaction management as first-class concerns. The tool support exists; the gap is usually engineers treating context as durable storage when it isn't.

The Infrastructure Argument

One framing that helps with organizational buy-in: context compaction failures are not model quality problems. Improving the base model does not fix an agent that retries failed operations after compaction drops the failure record. The fix requires external state, not better summarization.

This matters because "fix the model" or "improve the prompt" are common first responses to agent reliability failures. They are the wrong intervention for compaction-class problems. An agent that loses why-not records during compaction will keep losing them regardless of model capability, because the loss is structural. The information is genuinely gone.

The teams that shipped reliable long-running agents in 2025 overwhelmingly converged on the same set of infrastructure decisions: hard step limits, external state for tool call history, sticky context for invariant constraints, and structured checkpoints at task boundaries. These are engineering decisions, not AI decisions, and they belong in the same architectural review as any other stateful distributed system.

The agents that failed in production — and there were notable ones — typically failed because someone assumed the context window would take care of state. It doesn't. It never did. It just took long enough tasks to make that visible.

References:Let's stay in touch and Follow me for more thoughts and updates