The KV Cache Eviction Your Provider Called Cache Pressure and Your Bill Called a Doubled Prefix Charge
Your application opens a long conversation with a forty-thousand-token system prompt and a full tool inventory. Turn 1 pays the prefix at write rates and the provider's KV cache warms up. Turn 2 arrives ninety seconds later. You assume it's a cache hit. Sometimes it is. Sometimes the same forty thousand tokens land on your invoice again at uncached prices, and nothing in your code changed between turn 1 and turn 2.
The thing that changed was somebody else's traffic. The KV cache is shared infrastructure. Your tenant was colocated on a serving node that, in the ninety seconds between your two turns, took on enough other tenants to evict your prefix from memory. The provider's dashboard will describe this as "cache pressure." Your finance team will describe it as a line item that doubled. Both descriptions are accurate. Neither is in your code.
The architectural assumption underneath most prompt-caching deployments is that the cache is a discount you can plan against — sign up for the rate, structure your prompts for prefix stability, and the savings show up. The honest version is that the cache is a discount you have to forecast probabilistically, because the input that determines whether you get it is the load profile of strangers sharing your serving node.
The Two Ledgers That Should Agree And Don't
There are two ledgers in any prompt-caching system, and they live on opposite sides of a wall you can't see through.
The provider's ledger tracks "cache hit rate" — the fraction of tokens that matched a warm prefix and got served at the discounted read rate. This is a useful operational metric for the people running the inference cluster, and most providers expose a per-response field (cached_tokens on OpenAI, the cache_read_input_tokens field on Anthropic) so you can see it on your side too.
Your ledger is your invoice. It tracks tokens billed at write rates, tokens billed at read rates, and the total dollar amount you owe at the end of the month. The two should agree: high cache hit rate, low write charges, healthy bill. They often don't, and the gap is the thing nobody planned for.
The gap exists because the cache hit rate the provider shows you is a property of a single response, while the bill is a property of the population of responses. A request that hits the cache reports a high cached_tokens value and is cheap. A request that misses the cache reports zero cached tokens and pays the full prefix at write rates. The provider's dashboard shows you the average. The bill is dominated by the requests that missed — because those are the ones whose cost is several times higher than the cached case.
If your hit rate drops from 80% to 60% on a high-volume workload, that sounds like a 25% degradation. But the 20% of requests that flipped from hit to miss aren't 25% more expensive each — they're roughly 10x more expensive, because cache reads are billed at 0.1x the base input rate and cache writes are billed at 1.25x. So a modest-looking hit-rate move is a multiplicative move on the bill, and the bill is the only ledger that pays rent.
KV Cache As Infrastructure You Don't Own
To understand why your hit rate moves without you touching anything, you have to look at where the KV cache actually lives.
In a modern LLM serving stack — vLLM, TensorRT-LLM, the proprietary forks the major providers run — the KV cache is GPU memory carved into fixed-size blocks. Each request reserves blocks as it generates tokens, and the serving system shares prefix blocks across requests that have a common starting context. When a request finishes, its non-shared blocks are freed. When the serving node is under pressure, the eviction policy (usually a variant of LRU) reclaims blocks from cached prefixes that haven't been touched recently.
This is a beautifully engineered system. It also means three things from your seat:
First, the cache lives in GPU memory on a specific node, not in some abstract provider-wide store. Your prefix is warm on the node that just served you and cold everywhere else. If the provider's load balancer routes your next request to a different node — because the original is busy, or because of a deployment, or because traffic patterns shifted — you take a cold start regardless of how recently you used the cache.
Second, the eviction horizon depends on traffic from tenants you can't see. Your prefix occupies a finite number of blocks. Every other tenant on the same node is competing for the same memory. On a quiet weekend, your prefix may sit warm in cache for the full TTL. On a busy weekday, the same prefix may evict in under a minute even though the provider's nominal cache TTL is five or sixty minutes.
Third, the cache TTL on the contract isn't the cache TTL in practice. Anthropic's published five-minute window is a minimum lifetime — entries are deleted promptly but not immediately after expiration, and each hit resets the timer. But the inverse isn't guaranteed: nothing in the contract says your entry will survive the full TTL under load. Earlier in 2026, a quiet shift in default TTL behavior caused a documented step-change in many teams' bills, and the change was visible only by reading post-mortems and dev.to writeups, not by reading code or release notes.
You don't control any of this. The provider does, and the provider's incentive is to maximize aggregate cluster throughput, not to give your specific workload predictable cache residency.
The Failure Mode That Has No Alert
The shape of the failure is annoying because it has no alarm. Nothing breaks. No latency spike crosses a threshold. No error rate climbs. The product still works. The model still answers. The conversation still completes.
- https://platform.claude.com/docs/en/build-with-claude/prompt-caching
- https://developers.openai.com/api/docs/guides/prompt-caching
- https://openai.com/index/api-prompt-caching/
- https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-caching.html
- https://arxiv.org/abs/2309.06180
- https://arxiv.org/pdf/2509.04377
- https://www.ndss-symposium.org/wp-content/uploads/2025-1772-paper.pdf
- https://dev.to/whoffagents/anthropic-silently-dropped-prompt-cache-ttl-from-1-hour-to-5-minutes-16ao
- https://dev.to/whoffagents/claude-prompt-caching-in-2026-the-5-minute-ttl-change-thats-costing-you-money-4363
- https://github.com/anthropics/claude-code/issues/46829
- https://projectdiscovery.io/blog/how-we-cut-llm-cost-with-prompt-caching
- https://www.morphllm.com/llm-cost-optimization
