Skip to main content

Context Engineering: Memory, Compaction, and Tool Clearing for Production Agents

· 10 min read
Tian Pan
Software Engineer

Most production AI agent failures don't happen because the model ran out of context. They happen because the model drifted long before it hit the limit. Forrester has named "agent drift" the silent killer of AI-accelerated development — and Forrester research from 2025 shows that nearly 65% of enterprise AI failures trace back to context drift or memory loss during multi-step reasoning, not raw token exhaustion.

The distinction matters. A hard context limit is clean: the API rejects the request, the agent stops, you get an error you can handle. Context rot is insidious: the model keeps running, keeps generating output, but performance quietly degrades. GPT-4's accuracy drops from 98.1% to 64.1% based solely on where in the context window information is positioned. You don't get an error signal — you get subtly wrong answers.

This post covers the three primary tools for managing context in production agents — compaction, tool-result clearing, and external memory — along with the practical strategies for applying them before your agent drifts.

Why Context Management Is Not Just About Length

The naive mental model: context window = token budget. If you're under the limit, you're fine.

The reality: models have an effective context window that's considerably smaller than the advertised limit. Beyond ~128K tokens for most current models, attention starts to degrade — information technically present in the window receives proportionally less weight. This is the mechanism behind the accuracy drop cited above.

There's also a compounding reliability effect that's easy to underestimate. At 95% per-step reliability over a 20-step workflow, combined success drops to just 36%. A 2% misalignment introduced early — because a critical instruction was lost in a growing history — compounds into a 40% failure rate by the end of the run. Your 95% reliable agent is completing only a third of its tasks.

The economics reinforce this. Production agents consume roughly 100 input tokens per 1 output token. An unconstrained software-engineering agent runs $5–8 per task. At 10,000 daily interactions, an unoptimized customer-service agent costs over $250K/year. A 60% compression cut brings that to $102K. Context management and cost management are the same problem.

Three Tools, Three Different Tradeoffs

Modern agent platforms — and the Anthropic API in particular — expose three distinct mechanisms for managing context, each with a different cost/fidelity profile.

Compaction: Lossy Summarization with Inference Cost

Compaction runs the entire conversation history through an LLM and replaces all prior turns with a typed summary block. It fires at a configurable token threshold (default 150K tokens, minimum 50K). You pay inference cost. The result is a high-fidelity but lossy representation of everything that happened before.

The tradeoff is real. A 335,279-token research agent context — 96% file-read results, less than 2% actual reasoning — compresses to a ~2,800-token summary. That's roughly 120:1 compression on tool outputs. But information is lost. What survives depends on how the summarization prompt is written.

This is where custom instructions matter. The default summarization prompt makes general decisions about salience. If your task requires preserving every quantitative figure with its source, or every decision branch considered, you need to specify that explicitly. Custom instructions replace the default prompt entirely, so they also bear full responsibility for covering what the default would have caught.

One important finding from JetBrains Research (SWE-bench Verified, Dec 2025): LLM summarization can paradoxically lengthen trajectories by 13–15%. Summaries obscure natural stopping signals, causing agents to keep attempting past their optimal stopping point. Summarization costs in that study exceeded 7% of total per-instance expenses. This doesn't mean avoid compaction — but it means compaction alone isn't a complete strategy, and you should instrument trajectory length when you enable it.

Tool-Result Clearing: Mechanical Pruning, No Inference Cost

Tool-result clearing is simpler and cheaper: it replaces old tool results with the placeholder [cleared to save context] while preserving the tool call record. The agent retains awareness that a call was made, but the full result is gone. No LLM involved, no inference cost.

Configure it with:

  • trigger: When to fire (typically a token threshold)
  • keep: How many recent results to preserve in full
  • clear_at_least: Minimum tokens cleared per firing, to ensure cache invalidation ROI
  • exclude_tools: Tools whose results should never be cleared (memory, auth tokens, task state)

The critical design decision is which tools to exclude. File-read results, API responses, search results — all re-fetchable on demand, safe to clear. Stateful results — memory tool outputs, credentials, accumulated task context — must be protected.

JetBrains' finding is instructive here: observation masking (a rolling window keeping only the last N tool results, replacing older ones with placeholders) achieved 50%+ cost savings and, for Qwen3-Coder 480B, a 2.6% higher solve rate at 52% lower cost compared to full-context approaches. The current best practice appears to be: use tool-result clearing as the primary mechanism, reserve compaction for when you need to preserve reasoning across long dialogues.

External Memory: Persistent State Outside the Window

The memory tool exposes agent-initiated read/write access to external storage. Operations typically include view, create, replace, insert, delete, and rename. The agent decides what to store; the storage survives context resets and session restarts.

This is the mechanism behind patterns like the NOTES.md file that Claude Code uses in long-running sessions. The agent periodically writes important decisions, discoveries, and state to an external file, then retrieves it at the start of new sessions or after context resets. The working memory window stays bounded; the episodic record is unbounded.

The three mechanisms are orthogonal and designed to compose:

  • External memory captures learnings and persistent state across sessions
  • Tool-result clearing keeps the in-session window bounded as work accumulates
  • Compaction summarizes long conversational histories before they rot

You typically need all three in a production agent that runs multi-step tasks over extended time periods.

Memory Architecture: Beyond the Three Tools

Understanding the three API mechanisms is necessary but not sufficient. Behind them is a broader taxonomy of memory that shapes how you architect your agent's state management.

In-context working memory is the live context window — the fastest access, but bounded. Everything above is about managing this layer.

External episodic memory is file-based or database-backed storage. The agent explicitly reads and writes it via the memory tool. This is where long-horizon state lives: decisions made, files modified, APIs called, user preferences discovered. Mem0 is currently the most mature standalone framework for this pattern in production.

Hierarchical semantic memory goes further, organizing state as a graph — nodes for entities (people, files, tasks, decisions), edges for relationships. MemGPT pioneered this architecture: the model itself manages the graph, deciding what to store, what to summarize, and what to forget. Memories strengthen on recall and decay when unused — a deliberate human-like forgetting mechanism. Amazon Bedrock AgentCore Memory, launched mid-2025, abstracts this pattern for production deployments without requiring teams to implement the graph machinery themselves.

The general principle: in-context memory is RAM, external memory is disk. Tool-result clearing and compaction are your memory management layer. You need both hardware and an operating system.

Practical Configuration Strategies

A few concrete guidelines from production deployments:

Set pre-rot thresholds, not hard limits. Don't trigger compaction at context exhaustion — trigger it at ~70% of your effective window (around 90K–100K tokens for models with 128K effective windows). Once context rot sets in, the summary will itself be degraded because the model generating it is already impaired.

Allocate token budgets explicitly. A reasonable starting distribution for a general task agent:

  • System instructions: 10–15%
  • Tool definitions and schemas: 15–20%
  • Retrieved knowledge context: 30–40%
  • Conversation history: 20–30%
  • Buffer reserve: 10–15%

The buffer reserve is not optional. Without it you have no margin before degradation begins.

Minimize tool count. Research consistently shows that more tools hurt more than they help beyond a threshold. 19 well-designed tools outperform 46 in the same context window for quantized models. Decision paralysis from irrelevant tool options consumes attention budget that should go to task reasoning.

Use just-in-time retrieval over front-loading. Rather than loading all potentially-relevant files into context at task start, maintain lightweight identifiers (paths, URLs, query terms) and fetch on demand. This enables progressive disclosure: load content when you need it, clear it when you're done.

Cache static elements aggressively. System prompts, tool definitions, and other static context that appears in every turn should be structured to hit prompt cache. Well-engineered caches achieve 95%+ hit rates, reducing input costs by roughly 90% and prefill latency by 75%. The critical implementation detail: if you're dynamically injecting tool definitions via RAG (selecting which tools to include per turn), you're breaking cache locality. Stable tool sets in stable positions outperform dynamic selection.

Apply the sub-agent pattern for isolated work. Sub-agents consume 10,000+ tokens for focused deep work but return 1,000–2,000 token summaries to the orchestrator. The orchestrator's context stays bounded; the knowledge from specialized work is captured in the summary. This is the MapReduce pattern applied to agentic context management.

The Measurement Problem

None of these strategies work without instrumentation. Context management decisions are optimization problems; optimization without measurement is guesswork.

What to instrument:

  • Token usage per turn, per session, per agent type
  • Compaction frequency and compression ratios
  • Tool result clearing frequency and tokens recovered
  • Task success rate segmented by context size at decision point
  • Trajectory length before and after enabling compaction (watch for the summarization-lengthens-trajectories effect)
  • Cost per task, tracked at the same granularity as success rate

The most important derived metric: task completion rate as a function of context size at the midpoint of the task. This tells you exactly where your agent's effective window ends — not the API limit, but the point where task success starts to degrade.

Research frameworks like ACE (Agentic Context Engineering, Stanford/Berkeley) and ACON treat context management as an optimization problem and solve it empirically — paired trajectory analysis identifying what information was lost when the agent failed, then iterating on summarization guidelines. ACE reports average gains of 10.6% on agent benchmarks vs. strong baselines. The approach is more sophisticated than most production teams need at first, but the underlying principle — measure context-related failures and iterate on your pruning strategy — applies at any scale.

Getting Started

If you're early in building an agent that will run multi-step tasks:

  1. Add token budget tracking to every LLM call from day one
  2. Use tool-result clearing for all re-fetchable tool outputs, with a 60–70% window threshold
  3. Add external memory (even just a markdown file) for any state that needs to survive context resets
  4. Reserve compaction for scenarios where conversational reasoning must be preserved — not as the primary mechanism

If you have an agent in production that's failing on long tasks, start with context size at failure point. If failures cluster around tasks that exceeded 60–70% context fill, you have a context management problem. If they're distributed evenly, you have a different problem.

The context window is not a resource you can ignore and the model will handle. It's a first-class architectural component of your agent, and the difference between agents that drift on step 12 and agents that complete reliably is almost always how intentionally it's managed.


References:Let's stay in touch and Follow me for more thoughts and updates