Skip to main content

Token Budgets Are a Scheduling Problem, Not a Prompt Problem

· 9 min read
Tian Pan
Software Engineer

When an agent gives a worse answer than it did last week, the first instinct is to blame the prompt. Someone reworks the system instructions, trims a few sentences, adds an example, and ships. Sometimes it helps. Often it does nothing, because the prompt was never the problem. The problem is that a single verbose tool result quietly consumed 18,000 tokens, pushed the actual task instructions into the low-attention middle of the context window, and left the model reasoning over a transcript that is 70% noise.

That is not a wording problem. That is a resource-allocation problem. And resource allocation has a name in systems engineering: scheduling. The context window is a fixed-size resource, multiple consumers compete for it, and right now most agent stacks "schedule" it the way a 1960s batch system scheduled memory — first come, first served, until it runs out.

The reframe matters because it changes what you build. If token usage is a prompt problem, the fix is editing. If it is a scheduling problem, the fix is an allocator: a component that assigns budget to each consumer, enforces ceilings, decides what gets evicted under pressure, and makes the whole thing observable. Operating systems solved this for memory decades ago. Agent frameworks are rediscovering it the hard way.

The Context Window Is RAM, and Nobody Wrote a Memory Manager

Think about what actually competes for space in a single inference call. The system prompt. Tool and function schemas. Retrieved chunks from a vector store. The running conversation history. Tool outputs from prior steps. Few-shot examples. Scratchpad reasoning from sub-agents. Each of these is a process asking for memory, and the context window is the RAM they share.

In a real operating system, no process gets to call malloc() without bound. There is a memory manager that tracks allocation, enforces limits, and pages things out when pressure builds. In most agent code, there is no equivalent. The retriever grabs as many chunks as its top_k says. The tool returns whatever the API returned. History grows by one turn every turn. Nothing arbitrates between them. The "allocation policy" is whatever order the strings happened to get concatenated in.

This works until it doesn't. The breaking point is rarely a hard context-limit error — modern windows are large enough that you usually have room. The breaking point is quality. Chroma's context-rot research tested 18 frontier models and found that every one of them degrades as input length grows, even when the relevant information is technically present. Performance follows a U-shaped curve: high accuracy for content at the start and end of the window, and a 30%-plus accuracy drop for content stranded in the middle. A context window is not a uniform resource. Tokens have positional value, and an allocator that ignores position is leaving correctness on the floor.

So the failure is silent. You do not get an exception. You get a slightly worse answer, then another, and a dashboard that still says green because nothing technically broke. That is the signature of an unmanaged resource.

Per-Component Budget Ceilings

The first thing a scheduler does is cap each consumer. You decide, ahead of time, the maximum share of the window each component may take, and you enforce it.

A workable starting split for an agent looks roughly like this: system instructions get 10–15%, tool schemas get 15–20%, retrieved knowledge gets 30–40%, and the rest is reserved for conversation history and the model's own reasoning. The exact numbers matter less than the existence of the ceiling. A ceiling converts an unbounded malloc() into a fixed allocation, which means one component can no longer starve the others by accident.

The component that needs a ceiling most urgently is tool output. In one published context-engineering benchmark, 30,400 of 48,400 total tokens came from tool results alone — and between 40% and 60% of those tokens were removable with no measurable performance loss. Tool output is the memory leak of agent systems. An API returns a 200-row JSON blob, the agent needed three fields, and the other 197 rows now sit in the window degrading every subsequent step. Production harnesses have started treating this directly: Claude Code token-counts every tool result against a budget and persists oversized results to disk instead of inlining them, specifically catching outputs that slip under a byte cap but are token-dense.

The ceiling for tool output should not be a polite suggestion. It should be a hard truncate-or-offload boundary applied before the result ever enters the window. Whatever exceeds it goes to disk, to a summary, or to a follow-up retrieval call — never to the prompt by default.

Priority and Eviction: What Gets Paged Out Under Pressure

Ceilings handle the steady state. Eviction handles the crisis. When the sum of what everyone wants exceeds the window, something has to leave, and "something" should be a decision, not an accident of string order.

Assign each component a priority that reflects how load-bearing it is for the current step. System instructions and the active task description are non-evictable — they are the kernel; if they go, nothing works. Tool schemas for tools the agent is actively using are high priority. Recent conversation turns are medium. Old tool outputs and stale retrieved chunks are low — they were useful three steps ago and are now mostly noise.

Eviction then walks from the bottom. Sliding-window strategies in production harnesses do exactly this: when usage crosses a threshold, drop the oldest 30% of messages, then keep dropping in 10% increments until you are back under budget. That is a page-replacement algorithm with a different name. The refinement worth adding is that recency is a crude proxy for value. An old tool result that established a key fact may matter more than a recent turn of pleasantries. A good evictor scores by priority and relevance, not just by age — closer to LRU-with-hints than pure FIFO.

Crucially, eviction is not deletion. A paged-out memory frame goes to disk and can be faulted back in. The agent equivalent is to summarize an evicted block, store the full version externally with a handle, and let the agent re-retrieve it if a later step needs it. The agent should be able to fault its own context back in. That single property — evict to a recoverable place, not to the void — is what separates a scheduler from a guillotine.

The Starvation Failure Mode

Every scheduling system has a characteristic failure, and for token budgets it is starvation: one greedy consumer takes so much of the resource that another consumer cannot make progress.

The concrete version is the verbose tool result that starves the reasoning budget. The agent calls a tool. The tool returns 18,000 tokens. Those tokens are technically "in context," so nothing errors. But the model now has far less effective attention left for the actual reasoning step, and the task instructions have drifted into the U-curve's low-attention middle. The model is not dumber than it was yesterday. It is starved. It is being asked to think in the few hundred tokens of attention that the tool result did not consume.

This is exactly the bug a scheduler exists to prevent. An OS does not let a single process consume all of RAM and wedge every other process — it has limits and a scheduler that enforces fairness against priority. The agent equivalent is a reasoning-budget reservation: a guaranteed slice of the window that retrieval and tool output are never allowed to touch, no matter how much they want. You reserve it the way a real-time system reserves CPU for a critical task. The reservation is what guarantees the model always has room to think.

Make Context Pressure Observable

The deepest reason token budgets get treated as a prompt problem is that pressure is invisible. You cannot see a context window the way you see CPU in top. So degradation gets diagnosed as "the model seems worse lately" and the team goes prompt-fiddling.

A scheduler fixes this by emitting telemetry. The numbers to put on a dashboard are not exotic:

  • Utilization per component — how much of the window each consumer actually took this step, against its ceiling.
  • Eviction events — what got paged out, when, and from which priority tier.
  • Reasoning headroom — how many tokens were left for the model after fixed costs, per step.
  • Starvation flags — any step where a single tool result exceeded, say, 25% of the window.

With those four numbers, "the agent feels worse" becomes "tool output utilization jumped from 35% to 70% after the schema change on Tuesday, and reasoning headroom is now routinely under 5%." That is a diagnosable incident. The first is a vibe. Observability is what converts silent quality decay into a chart someone can act on before the bill or the support tickets do it for them.

There is a cost dimension here too. LLM APIs bill for the entire context on every call, so an agent loop that lets history grow unchecked compounds cost quadratically — a 20-step run can consume more than 10x the tokens a naive estimate predicts. A scheduler that caps and evicts is not only protecting answer quality; it is the same control surface that keeps the invoice linear.

Start With the Allocator, Not the Prompt

The practical move is to stop editing prompts in response to quality regressions and instead build the smallest possible allocator. It does not need to be sophisticated to beat what most stacks have today, which is nothing.

A minimum viable token scheduler does four things. It assigns a budget ceiling to each context component. It enforces those ceilings before assembly, truncating or offloading anything over the line. It evicts by priority when total demand exceeds the window, sending evicted content somewhere recoverable. And it logs utilization so pressure is visible. That is a few hundred lines of code, and it sits between your components and the model as a real arbitration layer.

The mental shift is the whole point. "Just trim the prompt" treats the context window as a document to be edited. Treating it as a scheduled resource — with ceilings, priorities, eviction, reservations, and telemetry — treats it as what it actually is: shared memory with many competing consumers and, until you build one, no manager. Operating systems learned that lesson half a century ago. Agent systems are large enough now that they have to learn it too, and the teams that build the allocator will stop debugging their context windows by reading tea leaves.

References:Let's stay in touch and Follow me for more thoughts and updates