Four Strategies for Engineering Agent Context That Actually Scales
There's a failure mode in production agents that most engineers discover the hard way: your agent works well on the first few steps, then starts hallucinating halfway through a task, misses details it was explicitly given at the start, or issues a tool call that contradicts instructions it received twenty steps ago. The model didn't change. The task didn't get harder. The context did.
Long-running agents accumulate history the way browser tabs accumulate memory — silently, relentlessly, until something breaks. Every tool response, observation, and intermediate reasoning trace gets appended to the window. The model sees all of it, which means it has to reason through all of it on every subsequent step. As context grows, precision drops, reasoning weakens, and the model misses information it should catch. This is context rot, and it's one of the most common failure modes in production agents.
Context engineering — the discipline of deciding what goes into the agent's context window, when, and in what form — has become the core competency of agentic AI development. One widely-shared framing describes it as "effectively the #1 job of engineers building AI agents." There are four fundamental strategies for tackling this problem: write context to external storage, select context from that storage, compress context that's already in the window, and isolate context across separate agent processes. Each strategy addresses different failure modes. Understanding when to use which — and how to combine them — is what separates agents that degrade at step 50 from agents that hold up at step 500.
Strategy 1: Write Context Out of the Window
The most direct way to prevent context rot is to not let it accumulate in the first place. Write information to external storage and keep only a lightweight reference in the active context.
This takes several forms. The simplest is a scratchpad — a dedicated notes file or structured state field where the agent writes down key findings as it works. Claude Code implements this with to-do lists and working notes files. The agent doesn't need to re-read every prior tool response to remember what it learned; it reads its own notes instead.
For large tool outputs, the threshold-based approach is more effective. LangChain's Deep Agents framework triggers automatic offloading whenever a tool response exceeds 20,000 tokens — the content gets saved to the filesystem, and the context receives a file path plus the first ten lines as a preview. The agent can retrieve the full content when needed, but doesn't carry it in every subsequent step.
Manus, an autonomous general-purpose agent, takes a similar approach through filesystem operations. When an agent browses a web page, it can save the relevant content and then clear the web page from context — retaining the URL as a lightweight pointer rather than the full text. This "delete but preserve the pointer" pattern is deceptively powerful. Information remains recoverable without occupying attention budget.
The mental model here is RAM versus disk in a traditional operating system. Fast, finite, expensive memory for active reasoning; slower, cheaper storage for everything else. Agents that treat their context window like RAM instead of an append-only log stay sharp much longer.
Strategy 2: Select Context Just in Time
Write strategy removes content. Select strategy brings back the right subset when it's needed.
The naive version is retrieval-augmented generation: embed stored content, query by similarity when needed, inject the top results. This works for straightforward lookups but struggles with complex agent tasks where what's relevant depends on reasoning that hasn't happened yet.
More sophisticated selection keeps lightweight identifiers — file paths, document IDs, URLs — in context and retrieves content on demand. Rather than pre-loading every potentially relevant document at session start, the agent discovers what it needs as it works and fetches it with targeted tool calls. This is the just-in-time retrieval pattern.
For coding agents, hybrid retrieval combining traditional grep and abstract syntax tree parsing with semantic similarity tends to outperform pure embedding search. The agent can search by structure (find all callers of this function) as well as meaning (find code related to authentication), and re-ranking reconciles the two approaches.
Tool descriptions themselves are a selection problem. When an agent has access to dozens of tools, presenting all descriptions at every step can consume 50,000 tokens before the user's first message. Progressive disclosure solves this: tools are registered with short discovery descriptions, and full schemas are loaded only when the agent invokes a tool for the first time. Anthropic standardized this pattern in late 2025, and it was rapidly adopted by OpenAI, Google, GitHub, and Cursor.
Strategy 3: Compress What's Already There
Sometimes content can't be offloaded because the agent needs access to it — but it also can't stay in its original form without overwhelming the window. Compression bridges the gap.
The obvious approach is LLM summarization: ask the model to condense prior conversation history before continuing. This works, but benchmarking reveals a counterintuitive cost. Pure summarization can increase total execution time by 13 to 15 percent. The reason is trajectory elongation — the summarization step consumes tokens and time, the agent has to reason through the summary, and the shorter summary often loses enough detail that subsequent reasoning takes more steps. You save tokens per step but add steps.
Observation masking outperforms pure summarization in most long-running agent benchmarks. The technique works differently: instead of rewriting old content, it replaces older observations with placeholder tokens — the agent's own reasoning and actions remain intact, but verbose tool outputs from earlier in the trajectory disappear. The agent can still see what it decided and why; it just can't see the raw web page content or test logs from fifteen steps ago.
The benchmark results are instructive. On SWE-bench-Verified using a large coding model, observation masking alone reduced costs by over 50 percent. A hybrid approach — masking as the primary mechanism, summarization applied selectively for critical sections — reduced costs by an additional 7 percent versus masking alone and 11 percent versus pure summarization. Task success rates improved by roughly 2.6 percent over unmanaged baselines. Across a full benchmark run, the cost difference between optimized and unoptimized context management ran to tens of dollars — which scales dramatically across thousands of production agent runs.
Claude Code triggers automatic context compaction when usage hits 95 percent of the available window, applying a version of this hybrid approach. The key insight that emerged from several independent systems: give the model control over compression decisions rather than applying rule-based heuristics. Models trained to identify what's safe to compress outperform fixed window-size rules, because what matters for a task isn't always what's recent.
Strategy 4: Isolate Context Across Agents
Some tasks genuinely require more information than any single context window can hold, regardless of compression. Multi-agent architectures address this by distributing the load.
The standard pattern: a coordinator agent decomposes a task and delegates subproblems to specialized workers. Each worker runs with a focused, bounded context — just the information relevant to its slice of the problem. Only results flow back to the coordinator, not the full reasoning traces and intermediate observations that filled the worker's window. The coordinator's context stays manageable because it receives summaries, not transcripts.
This is principally a divide-and-conquer strategy applied to attention. A research task that would overwhelm a single agent with ten simultaneous document analyses becomes tractable when each document analysis runs in its own isolated context.
State schemas provide a more fine-grained version of isolation. Rather than exposing an agent's full runtime state in every prompt, Pydantic models define which fields are visible to the LLM at each step. An agent processing a large document might have access to only the current section, the accumulated findings, and the next objective — not the full document or complete prior history. The schema is a filter, not just a container.
The caveat with isolation is coordination overhead. Multi-agent systems introduce their own failure modes: context poisoning at the coordinator level when subagent results contradict each other, synchronization errors when agents have different versions of shared state, and compounding latency when subagent calls chain sequentially. Isolation solves the context length problem but creates a consistency problem. The architecture needs both.
Putting the Strategies Together
These four strategies aren't alternatives — they're complementary layers of a complete approach.
Write context for everything that might be needed later but doesn't need to be in the active window now. Select context just in time, maintaining lightweight pointers instead of pre-loaded full content. Compress what accumulates despite the other strategies, using observation masking as the primary tool with selective summarization for critical reasoning. Isolate context when tasks genuinely exceed single-agent capacity.
The common thread across all four is intentionality. An unmanaged agent context is an append-only log that grows until the model can't reason through it clearly. A managed context is more like working memory — finite, curated, and maintained actively across the entire task.
Engineers who treat context as a first-class architectural concern — not an implementation detail to address after the agent "works" — build systems that hold up in production. The ones who don't find that their agent's performance ceiling is whatever task length their context window tolerates before it starts to rot.
- https://arxiv.org/abs/2510.00615
- https://arxiv.org/abs/2510.04618
- https://blog.langchain.com/context-management-for-deepagents/
- https://blog.jetbrains.com/research/2025/12/efficient-context-management/
- https://www.newsletter.swirlai.com/p/state-of-context-engineering-in-2026
- https://martinfowler.com/articles/exploring-gen-ai/context-engineering-coding-agents.html
- https://www.getmaxim.ai/articles/context-window-management-strategies-for-long-context-ai-agents-and-chatbots/
- https://airbyte.com/agentic-data/ai-context-window-optimization-techniques/
- https://redis.io/blog/context-window-management-llm-apps-developer-guide/
- https://medium.com/the-ai-forum/automatic-context-compression-in-llm-agents-why-agents-need-to-forget-and-how-to-help-them-do-it-43bff14c341d
- https://google.github.io/adk-docs/context/compaction/
- https://www.flowhunt.io/blog/context-engineering-for-ai-agents/
