The Summary Tax: When Compaction Eats More Tokens Than It Saves
A long-running agent crosses its compaction threshold every twelve turns. Each pass costs an LLM call sized to the running window — first 8K tokens, then 14K, then 22K — because the span being summarized grows with every trigger. By turn sixty, the user has spent more tokens watching the agent re-summarize itself than they spent on the actual reasoning that mattered. The cost dashboard reads "user inference cost" as a single number, blissfully unaware that half of it paid for compression of context the user will never look at again.
This is the summary tax: a class of overhead that scales with conversation length, fires invisibly between user turns, and shows up as a single line item that conflates the work the user paid for with the bookkeeping the system did to manage itself. It is the closest thing modern agent architectures have to garbage-collection pause time — and most teams are running production with -verbose:gc turned off.
The shape of the problem is mechanical, not exotic. Production agents accumulate state. State exceeds budget. The system condenses. Each condensation is itself a model call billed at the same rates as the user-facing work, and the unit costs compound in ways the per-turn dashboard cannot see. Once you know to look for it, the numbers are striking; once you have measured it once, you stop trusting any cost report that doesn't split it out.
How the Tax Compounds
Naive compaction is implemented as "summarize everything before the cutoff each time we cross the threshold." The problem with that policy is geometric. The span requiring summarization grows with each trigger, so summarization cost and latency increase linearly with conversation length, and the cumulative summarization spend across a session grows like a triangular sum.
Several teams have published the math behind the compounding. Cost growth in long agent sessions scales roughly with n(n+1)/2 where n is the number of context-extending turns, not linearly with turns; teams that model per-turn costs independently underprice their systems by three to five times once accumulation is factored in. A 50% increase in average context length translates directly to a 50% increase in inference cost on the user-visible call — and also feeds into a larger summarization span next time the threshold fires.
The geometric piece is what wrecks intuition. A first-pass mental model says "compaction costs one extra LLM call now and then." A measured one says "compaction costs one extra LLM call sized to the running history, fired on a frequency proportional to context-fill rate, growing across the session." Those two models diverge by an order of magnitude on a thirty-turn session.
Summary-of-Summaries Drifts
Hierarchical summarization is the obvious response: instead of re-summarizing the original transcript every time, summarize the previous summary plus the new turns. It bounds the per-pass cost. It also bleeds fidelity on every iteration, and the bleeding is asymmetric.
Each pass throws away detail. A small fact dropped on pass three cannot be recovered on pass seven; the model is now summarizing a summary that already lost it. Long-running sessions develop a characteristic decay curve where early facts get progressively more abstract until they're gone, and the agent's behavior stops matching what the user said two hours ago. The published tells are familiar: "if a bad fact enters the summary, it can poison future behavior;" "over very long conversations the summary might drift from the original intent or lose important foundational context." Models with around a million-token window still can't single-pass-summarize the longest sessions, so multi-stage chunking compounds the latency and the drift together.
What you end up with is a memory system that's cheap but slowly lying. The cheap part is real — the per-pass cost stays bounded. The lying part shows up as a slow-rising rate of "the agent forgot what I told it" tickets, classified by the prompt team as a model-quality issue and shipped to the prompt-tuning queue, where it lives forever because the prompt isn't actually the bug.
The Per-Session Ledger Most Teams Don't Keep
Cost dashboards in most production agent stacks have one column for inference. That column is the dominant lie. A useful ledger splits at least three lines:
- Primary inference: tokens spent on calls that produce a user-visible response or progress an agent step.
- Compaction overhead: tokens spent on summarization, summary-of-summaries, structured-memory extraction, or any other "manage the context window" call.
- Tool payload accumulation: tokens spent re-sending the cumulative tool-output history on each subsequent turn, which is invisible at the call level but visible per session.
Once you draw that split, two derived metrics fall out and they're more diagnostic than any per-token price. The first is the summary-token ratio — compaction tokens divided by primary tokens, per session. Sessions in the long tail of duration almost always have a ratio that's wrong: north of one is not unheard of. The second is marginal utility per pass — for each summarization, did downstream turns actually re-load less context because of it, or did the agent re-fetch what was summarized via a tool call anyway? Compaction that doesn't reduce later context is pure tax.
- https://learn.microsoft.com/en-us/agent-framework/agents/conversations/compaction
- https://factory.ai/news/compressing-context
- https://kargarisaac.medium.com/the-fundamentals-of-context-management-and-compaction-in-llms-171ea31741a2
- https://blog.jetbrains.com/research/2025/12/efficient-context-management/
- https://mem0.ai/blog/llm-chat-history-summarization-guide-2025
- https://www.morphllm.com/llm-cost-optimization
- https://arxiv.org/html/2510.00615v1
- https://dev.to/waxell/ai-agent-context-window-cost-the-compounding-math-your-architecture-is-hiding-2227
- https://wasnotwas.com/writing/context-compaction/
- https://snap-research.github.io/locomo/
