The Sliding-Window Tax: Why a 30-Turn Conversation Costs More Than 30x a Single Turn
The conversation looks healthy on the dashboard. Average tokens per call is sane, the p50 input length is comfortably inside the cached prefix, the provider invoice ticks up at the rate finance approved. Then someone exports a single 200-turn coding session and the line item for that one user is larger than the rest of the team's daily traffic combined. The dashboard wasn't lying — it was averaging. The bill comes from the long tail, and the long tail does not scale linearly with turn count.
Every multi-turn AI feature eventually meets this surprise. The per-call token count is the wrong unit of measurement, because the cost of a 30-turn conversation is not 30 times the cost of a single turn — it's something between 50× and 200×, depending on how the history is structured, how the prompt cache decays, and what tier the request lands in once the input crosses 200K tokens. The team that priced the feature off the per-call number is underwriting a tail it never modeled.
This is the sliding-window tax. It's structural, it's invisible on per-call dashboards, and it's mostly governed by three couplings the team rarely surfaces: cache locality decays as the window slides, attention cost grows with context length, and retries pay the full history not the marginal turn. Each of those couplings deserves a name, because a team that can't name the failure mode can't budget for it.
Why each turn re-sends the whole conversation
The mental model most engineers ship with is "the model has memory of the conversation." It does not. Every turn re-sends the entire history (or a windowed subset of it) as input tokens. Turn 1 sends the system prompt plus one user message. Turn 28 sends the system prompt plus 27 prior exchanges plus the new user message, and gets billed for every token of it.
If the average exchange is 800 tokens, then by turn 28 the per-call input is roughly 22,400 tokens of history plus whatever the system prompt costs — every turn. The cumulative input across 28 turns is the triangle number, not 28 × 800. Coding agents make this worse: a single agentic session can make 50–100 API calls, each carrying an expanding conversation history with accumulated tool outputs, and the 50th call routinely includes 150K tokens of context.
The naive cost model — "tokens per call × call count" — sums the per-turn numbers and gets a believable answer. The honest cost model treats the conversation as a single object whose total input is roughly quadratic in turn count, and prices the feature against that curve. Most pricing decisions in the industry are still made against the naive model, which is why the long-session tail keeps surprising people.
Prompt cache locality decays as the window slides
The standard mitigation for resending the prefix every turn is prompt caching. Cache the system prompt and the early conversation, pay the read rate on the next turn instead of the write rate, problem solved. That works on turn 2. By turn 28 it has quietly stopped working, for reasons that are not in the marketing material.
Three things break cache hits in a long session:
- TTL expiry. Anthropic's default prompt cache TTL is 5 minutes. Any pause longer than that — a user walks away, an agent waits on a tool — and the cached entry expires. The next turn re-uploads the full prefix at the write rate (1.25× input) instead of reading it at the read rate (0.1× input). The cache change from a 1-hour TTL to 5 minutes in early March 2026 was not announced; teams discovered it through a 30–60% cost spike and the realization that their 5-hour quota now ran out mid-day for the first time.
- Sliding-window drift. When old messages are trimmed to fit the token budget, the trim point shifts by a few messages each turn. Every message index in the context changes, the cached prefix no longer matches by-byte, and the entire cache is invalidated. The fix is non-obvious: you have to pin the trim boundary at a stable position, summarize-don't-truncate the middle, and accept that "trim two more messages" is a more expensive operation than it looks.
- Content-block limits. Anthropic's documented caveat: if your prompt has more than 20 content blocks before a cache breakpoint and you modify anything earlier than those 20 blocks, you don't get a cache hit. Multi-tool agents blow past this routinely.
A system prompt that was 80% cache-hit on turn 2 routinely falls to 30% cache-hit by turn 28. The per-token price did not change. The effective cost per turn doubled or tripled, and the dashboard that averages cache hit rate across all calls is the worst place to notice.
Context-length surcharges are a step function, not a slope
- https://platform.claude.com/docs/en/build-with-claude/prompt-caching
- https://dev.to/whoffagents/claude-prompt-caching-in-2026-the-5-minute-ttl-change-thats-costing-you-money-4363
- https://github.com/anthropics/claude-code/issues/46829
- https://platform.claude.com/cookbook/tool-use-automatic-context-compaction
- https://learn.microsoft.com/en-us/agent-framework/agents/conversations/compaction
- https://factory.ai/news/compressing-context
- https://arxiv.org/html/2510.00615v1
- https://blog.jetbrains.com/research/2025/12/efficient-context-management/
- https://www.cloudzero.com/blog/claude-api-pricing/
- https://ai.google.dev/gemini-api/docs/pricing
- https://thenewstack.io/claude-million-token-pricing/
- https://www.confident-ai.com/blog/multi-turn-llm-evaluation-in-2026
