Tokens Are a Finite Resource: A Budget Allocation Framework for Complex Agents
The frontier models now advertise context windows of 200K, 1M, even 2M tokens. Engineering teams treat this as a solved problem and move on. The number is large, surely we'll never hit it.
Then, six hours into an autonomous research task, the agent starts hallucinating file paths it edited three hours ago. A coding agent confidently opens a function it deleted in turn four. A document analysis pipeline begins contradicting conclusions it drew from the same document earlier in the session. These are not model failures. They are context budget failures — predictable, measurable, and almost entirely preventable if you treat the context window as the scarce compute resource it actually is.
The Maximum Effective Context Window Is Not What You Think
A 2025 analysis of 18 frontier models — including GPT-4o, Claude Opus, and Gemini 2.5 — found that 11 out of 12 dropped below 50% of baseline performance at 32K tokens. GPT-4o fell from 99.3% to 69.7% accuracy when its context filled up. Claude models showed more graceful degradation but were not immune.
Research published in September 2025 introduced the concept of the Maximum Effective Context Window (MECW): the maximum context length at which a model can still reliably complete a task. The MECW is dramatically smaller than the marketed Maximum Context Window. For some tasks, models failed with as little as 100 tokens in context. Most showed severe degradation by 1,000 tokens in specific task configurations.
The foundational mechanism is well understood. The 2024 paper "Lost in the Middle" (Liu et al., Transactions of the ACL) demonstrated a consistent finding: model performance peaks when relevant information is at the beginning or end of context. With 20 documents, accuracy dropped 30%+ when the relevant document sat in positions 5–15. Transformers allocate attention across all tokens, and as context grows, the signal-to-noise ratio for any specific piece of information degrades.
A 2025 study from Chroma made this concrete with a controlled experiment: even when all distractor tokens were replaced with whitespace — giving the model perfect retrieval — performance still degraded with length. Llama-3.1-8B showed up to 85% accuracy drops at 30K tokens. Claude 3.5 showed a 67.6% MMLU accuracy drop at 30K. The length itself is the problem, not the content.
The production implication: you cannot rely on "the model will find what it needs in a large context." Agents that passively accumulate context are slowly degrading as they run.
How Context Budgets Fail in Practice
When an agent runs out of context mid-task, there are three distinct failure modes depending on the infrastructure:
Hard error: Newer Claude models (Sonnet 3.7+) return validation errors rather than silently truncating. This is actually the most recoverable failure mode — at least you know something went wrong.
Silent truncation: Many configurations and legacy models drop oldest messages without warning when the window fills. The agent continues, but with a hole in its working memory. Older tool results vanish. Previous decisions disappear. The agent reasons from an increasingly amputated version of its own history.
Context poisoning: The most dangerous failure mode. When context nears capacity, a hallucination — caused by the model struggling to track everything — gets appended to the conversation and treated as fact. That hallucination then influences every subsequent turn. A 2025 study of LLM agent games documented exactly this: near-full context caused the model to hallucinate game state, which was then fed back into subsequent reasoning, compounding across turns until the agent's world model bore no relation to reality.
The aggregate cost is significant. Industry data from 2025 attributed 65% of enterprise AI pipeline failures to context drift or memory loss during multi-step reasoning — not raw context exhaustion, but gradual degradation as context quality eroded.
A Four-Tier Allocation Model
Treating context as a budget means deciding, explicitly, how tokens get distributed across the components that compete for space. The practical consensus from production deployments is a priority-ordered allocation with four tiers:
Tier 1 — Static anchors (never evict): System prompt and tool schemas. These go at the start of the prompt, they are stable across requests, and they are the most cache-eligible content. These should consume a fixed, known slice of your budget — ideally under 10–15% of total context. If your system prompt is 40K tokens, you have already consumed 20% of a 200K window before a single user message arrives.
Tier 2 — Active retrieved context: RAG results, injected memories, and relevant documents. This tier gets the second-largest allocation. The key discipline here is just-in-time retrieval: do not pre-load large documents into context. Maintain identifiers and load content dynamically through tool calls, returning only the relevant excerpts. A tool call that returns 20,000 tokens of raw JSON to an agent with 22,000 tokens remaining will stall the task.
Tier 3 — Conversation history: This is where most systems fail. History grows unboundedly if left unmanaged. The correct model is rolling compression: oldest exchanges are summarized as new content arrives, rather than keeping raw transcripts until they hit the limit.
Tier 4 — Scratch space: Intermediate reasoning, chain-of-thought traces, and tool call outputs that have already been acted on. This is the most expendable tier. Claude's API automatically strips extended thinking blocks from context accumulation between turns, which is architecturally significant for long agent sessions — you get the reasoning benefit without paying the accumulation cost.
The allocation percentages depend on your task type. A coding agent session needs more conversation history (referencing earlier code). A document analysis pipeline needs more retrieved context. The key is making the allocation explicit rather than emergent.
Enforcement: Hard Limits and Soft Compaction
Explicit allocation requires enforcement. There are two patterns:
Token budget injection (explicit, model-side): Research on Token-Budget-Aware LLM Reasoning (TALE) showed that injecting the budget directly into the prompt — "complete this task in approximately 500 tokens" — produced a 67% reduction in output tokens and a 59% cost reduction while maintaining competitive performance. The model self-regulates when it has a visible budget. Anthropic's context awareness API formalizes this: Claude 4.5+ models can receive live token counters injected after each tool call, letting the model adjust verbosity as the session progresses.
Compaction (explicit, infrastructure-side): When context crosses a threshold — typically around 80–85% of window capacity — a structured summarization pass replaces raw history with a compressed representation. The key implementation detail is anchored iterative summarization: rather than compressing the full history every time, maintain a running structured summary and only compress the newly-dropped span when truncation triggers. This avoids exponential compute costs and preserves the structure of earlier decisions.
The critical unsolved problem with compaction is artifact tracking. A 2025 evaluation of eight agent frameworks found that all of them scored between 2.19 and 2.45 out of 5.0 on reliably remembering which files had been modified across compaction boundaries. When the agent's working memory gets compressed, "what did I already change?" becomes an unreliable answer. For coding agents, this is the specific failure mode that causes duplicate edits, lost modifications, and contradictory code changes in long sessions.
KV Cache and Budget Design Are Coupled
The physical substrate of context is the key-value cache: during inference, every token's attention matrices are computed and stored in GPU memory. Provider-side prompt caching reuses these precomputed tensors when prompts share common prefixes. Anthropic charges cache reads at 10% of the base input price (a 90% discount); Google charges 10% of standard input for Gemini 2.5+ cached tokens.
The budget design constraint that follows: static content must be placed at the start of the prompt, and it must stay stable across requests. A 2026 study found that adding a single token to a static prefix could drop cache hit rates to zero for all subsequent requests in a session. Your system prompt and tool schemas are your cache anchor. If they change between turns, you pay full price for every request.
The implication for allocation is that your Tier 1 content is not just about correctness — it is your primary cost lever. A well-structured, stable 15K-token system prompt that stays at the start of every request amortizes its cost heavily across a session. A 50K system prompt that shifts between requests defeats the cache and drives per-request cost up proportionally.
The Sub-Agent Pattern as a Context Firewall
For multi-agent architectures, the most effective structural solution to context accumulation is preventing it at the architectural level. Sub-agents handle focused tasks and return condensed summaries (typically 1,000–2,000 tokens) to the orchestrator, rather than passing full tool output chains back up the hierarchy.
This pattern prevents the exponential context growth that characterizes naive agent loops: each sub-agent gets a fresh context window, does its work, and returns a summary. The orchestrator never sees the sub-agent's tool call history. The cost and performance characteristics of the overall system improve because each agent is operating in the early, high-performance portion of its context window — not the degraded tail.
The tradeoff is summarization fidelity: important details can be lost in the sub-agent's summary. The mitigation is structured output formats for sub-agent responses, specifying exactly what fields must be preserved (changed files, decisions made, current state, blockers) rather than asking the sub-agent to summarize freeform.
Making It Operational
The shift from treating context as "available space" to treating it as an explicit budget requires a few concrete practices:
- Instrument token usage per component. Know, before each request, how many tokens your system prompt, tool schemas, retrieved documents, and history are consuming. This is table stakes.
- Set per-tier soft limits. When retrieved context exceeds its allocation, compress before injecting. When conversation history hits its limit, trigger summarization.
- Test at context capacity, not just at nominal load. A session that runs for 50 turns with aggressive tool use will have a fundamentally different context profile than a 3-turn test case. Your eval suite should include long-running scenarios that stress context limits.
- Design tool outputs for token efficiency. A tool that returns 20K tokens of raw JSON when the relevant answer is 200 tokens is a context budget liability. Wrap external APIs in thin layers that return structured, minimal responses.
- Track artifact state explicitly, not through history recall. For coding agents and document-editing agents, maintain an explicit state store (files changed, decisions made, current task status) outside the context window and inject it as structured context at the start of each major turn.
The 1M-token context window is a capability ceiling, not an operational baseline. The models that actually run at that scale are the exceptions. For most production agents, the effective working budget is far smaller — and the systems that treat it as such are the ones that stay reliable across long-horizon tasks.
Context management is not a feature you add. It is the foundation you design for.
- https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
- https://platform.claude.com/docs/en/build-with-claude/context-windows
- https://aclanthology.org/2024.tacl-1.9/
- https://research.trychroma.com/context-rot
- https://arxiv.org/abs/2510.05381
- https://arxiv.org/abs/2509.21361
- https://arxiv.org/abs/2412.18547
- https://arxiv.org/abs/2601.06007
- https://aclanthology.org/2024.acl-long.428/
