The Token Budget Is a Product Decision, Not a Config Value
Somewhere in your codebase there is a line that looks like retriever.search(query, top_k=8). An engineer wrote that 8 in an afternoon. It was never reviewed by anyone outside the team, never appeared in a spec, and has never been revisited. That single integer decides how much of your context window goes to retrieved documents instead of conversation history, how much each request costs, how slow the response feels, and — because of how language models actually behave at length — how accurate the answer is.
That is a product decision. It is sitting in an f-string.
The context window is a fixed, scarce budget. Every request spends it across competing line items: the system prompt, conversation history, retrieved documents, accumulated tool output, and the response the model still has to generate. More of one means less of another. Yet almost nobody treats this allocation as a decision with an owner. It gets made implicitly, by whoever last touched the prompt assembly code, and then it ossifies. The result is a product surface — the single most important one in an AI feature — that no product manager has ever seen.
The Context Window Is Zero-Sum, and Nobody Drew the Pie Chart
Start with the physical reality. A model's context window is a fixed number of tokens. Everything you send shares it: instructions, few-shot examples, history, retrieved chunks, tool results, and the model's own output, which has to fit in whatever is left. This is not a soft constraint you can engineer around. It is a budget, and budgets are zero-sum.
In practice the allocation is a quiet tug-of-war. A chatbot with long conversation history wants to keep every prior turn. A RAG feature wants to stuff in ten relevant documents — at 1,500 tokens each, that is 15,000 tokens spent before the model has read the actual question. An agent wants to retain raw tool output: a weather check costs 200 tokens, a database query 3,000, an API call 5,000, and if you append every raw result turn after turn the context balloons within a single session.
Each of these consumers was built by a different person solving a different problem. The history-retention logic lives in one module, the retrieval call in another, the tool-result handling in a third. No single file says "history gets 30%, retrieval gets 40%, tool output gets 20%, the response gets 10%." The pie chart exists — it has to, the tokens are finite — but nobody ever drew it. It is the emergent sum of three independent local decisions, and emergent budgets are always wrong, because no one optimized the whole.
The first move is simply to make the pie chart real. Write down, per feature, where the tokens go on a representative request. The number is almost never what the team guesses. Teams routinely discover that 60% of a "RAG answer" is conversation history nobody reads, or that a tool's raw JSON output is three times the size of the reasoning it informs.
"Add More Chunks" Is a Product Call Wearing Engineering Clothes
Here is the move that hides the decision. A quality complaint comes in — "the assistant missed something it should have known." The obvious fix is to retrieve more: bump top_k from 8 to 12, keep more history, widen the chunk size. It feels like tuning. It is a config change, it ships in a small PR, and the eval score nudges up. Everyone moves on.
But "retrieve more" is not a neutral knob. It spends three currencies at once.
It spends money. Input tokens are billed. One widely cited comparison: a customer service bot handling 20,000 queries a day with a naive 150k-token history approach costs roughly $9,000 per day; the same bot with context filtered down to the relevant 4,000 tokens costs about $240. That is not a rounding error — it is a 37x swing decided by how aggressively you trim.
It spends latency. More input means more prefill. At the extreme, prefill latency for a maximum-length context can exceed two minutes on current hardware, which quietly rules out interactive use. Even far from the limit, every thousand tokens you add is time the user waits before the first token appears.
It spends quality — and this is the counterintuitive part. More context does not reliably mean better answers. It often means worse ones.
So "add more chunks" trades cost, latency, and quality against each other. Deciding how to make that trade is the definition of a product decision. The person who owns the feature's cost-per-request, its latency SLO, and its accuracy target should own it. Instead it is made by whoever happened to be in the retrieval file that week, and it is made invisibly, because a top_k bump does not look like a product change. It looks like a config value.
Context Rot: Why "More" Actively Costs You Accuracy
The assumption underneath "just retrieve more" is that extra context is at worst harmless — wasted money, maybe, but not wrong answers. The research says otherwise.
The "lost in the middle" effect, documented by Stanford researchers, showed that models attend well to the beginning and end of their input and poorly to the middle. With around 20 retrieved documents, accuracy on a fact placed in the middle dropped 15 to 20 percentage points compared to the same fact at the start or end. Position alone — not relevance, not correctness — moved the score that much.
It gets sharper. A 2025 study from Chroma tested 18 frontier models, including the latest from the major labs, and found every one degraded as input length grew, even when the context window was nowhere near full. They named it context rot: more tokens in, worse tokens out. And a late-2025 arXiv paper, pointedly titled "Context Length Alone Hurts LLM Performance Despite Perfect Retrieval," found that even with 100% perfect retrieval — every relevant token present, irrelevant ones replaced with harmless whitespace — performance still degraded as length increased, anywhere from 14% to 85% depending on the task.
Sit with that. Even when retrieval is perfect, length itself is a tax on accuracy. So the engineer who bumps top_k to fix a quality complaint may be making quality worse on every other query — diluting attention, burying the load-bearing fact in the middle, and feeding the model semantically similar distractors that actively mislead it. The eval that "improved" measured the query class that prompted the complaint. The regression is spread thinly across everything else and never shows up as a single broken request.
This is precisely why the decision cannot live in config. A config knob has a "more is safer" intuition baked into it. The token budget has the opposite shape: there is a peak, and both too little and too much fall off it. Only someone reasoning about the whole feature can find the peak.
Measure the Marginal Value of the Last Thousand Tokens
If the token budget is a product decision, it needs the thing every product decision needs: a measurement. Not "is more context good?" but "what did the last thousand tokens buy us?"
This is a marginal-value question, and you can answer it empirically. Hold a feature's eval set fixed and sweep one slot. Run retrieval at top_k of 4, 6, 8, 10, 12 and plot accuracy, cost, and p95 latency at each point. The curve is rarely a line. One production team found that the gain per retrieved passage flattened past four passages; they cut top_k from 6 to 4, held quality steady, and dropped retrieval cost 22%. Independent evaluations tend to find the useful range for retrieved chunks sits somewhere between 4 and 10, with noise and dilution taking over beyond it.
Do the same sweep for every slot. How much history actually changes the answer — is turn 10 still earning its tokens, or could you summarize everything older than the last three turns? How much of each tool's raw output does the model need — could you return a 200-token digest instead of 5,000 tokens of JSON? You can even apply this to the response: research on token-budget-aware reasoning showed that simply telling a model to use a reasonable budget cut chain-of-thought output from 258 tokens to 86 — a 67% reduction — while preserving correct answers.
Each of those is a measured trade, not a guess. And once you have the curves, the allocation stops being a matter of taste. You can say: this feature's history slot has near-zero marginal value past three turns, so we cap it and spend the freed budget on retrieval, where the curve is still climbing. That is a budget meeting. It is the budget meeting that should have happened before the f-string was written.
Give Product Owners a Dial They Understand
The reason the token budget hides is that it is expressed in the wrong units. "top_k=8, 4,000-token history window, raw tool output" is engineering vocabulary. A product owner cannot reason about it, so they do not, so it stays with engineering by default.
Translate it. Re-express each feature's budget in the three currencies product already understands: cost per request, latency at p95, and accuracy on the eval set. Now the conversation is one a product owner can actually have. "This feature costs 11 cents and answers in 3 seconds at 84% accuracy. We can spend down to 6 cents and 1.8 seconds by trimming history, and the eval barely moves. Or we can push accuracy to 88% by widening retrieval, at 15 cents and 4 seconds. Which feature is this?"
A support deflection bot and a financial-analysis assistant will answer that differently, and they should — one is latency-and-cost sensitive at high volume, the other will gladly pay for accuracy. That divergence is the whole point. It is a product judgment about what the feature is for, and product owners make those judgments well when you hand them a dial labeled in their language instead of a config file labeled in yours.
Concretely: set an explicit per-feature token budget, the way you set a latency SLO. Give each consumer of the context window — history, retrieval, tool output — a named allocation within it. Make a context-budget line part of the design review for any new AI feature, alongside the latency and cost lines that are already there. And put a regression check on it, so a future top_k bump that quietly doubles cost or trips context rot shows up as a reviewable change, not an invisible one.
The F-String Was Always a Spec
None of this requires new infrastructure. The budget already exists — the context window is finite whether or not you acknowledge it, and every request is already spending it. The only question is whether the spending is deliberate.
Right now, for most teams, it is not. It is the residue of three engineers solving three local problems, frozen into a magic number, and protected from review by the fact that it looks like config. The fix is not a framework. It is recognizing that the line allocating your context window is a product spec that got typed into an f-string, and moving the decision to the person who owns the feature's cost, speed, and quality.
Draw the pie chart. Measure the marginal token. Label the dial in cost, latency, and accuracy. Then have the budget meeting — before the f-string, not after the incident.
- https://www.trychroma.com/research/context-rot
- https://arxiv.org/abs/2307.03172
- https://www.morphllm.com/context-rot
- https://www.understandingai.org/p/context-rot-the-emerging-challenge
- https://redis.io/blog/llm-token-optimization-speed-up-apps/
- https://www.getmaxim.ai/articles/5-ways-to-optimize-costs-and-latency-in-llm-powered-applications/
- https://developer.nvidia.com/blog/finding-the-best-chunking-strategy-for-accurate-ai-responses/
- https://blog.bytebytego.com/p/a-guide-to-context-engineering-for
- https://www.augmentcode.com/guides/ai-agent-loop-token-cost-context-constraints
