Skip to main content

The Slow Turn That Wasn't Yours: KV Cache Eviction Mid-Conversation

· 10 min read
Tian Pan
Software Engineer

A conversation has been moving along on a single Claude session for forty minutes. Eleven turns, each averaging 800ms time-to-first-token, each cheap because the 28,000-token prefix is hitting the prompt cache. Turn twelve arrives and TTFT is 3.4 seconds. The transcript hasn't changed shape. The model didn't switch. The network is fine. Cached input tokens drop from 27,800 to 0. The next turn's prefill bill is paid in full, from the first token.

You go looking for the cause in your traces and find nothing that names it. There is no event in your logs labeled "another tenant's burst evicted you." The only honest reading of the spike is that some other customer's prompt, somewhere on the same GPU pool, made the scheduler decide your warm prefix was the cheapest thing to drop. You cannot replay the turn. You cannot prove the eviction. The cache state at that moment was a function of strangers' traffic, and that traffic is not in your trace because it was never yours to see.

This is the failure mode that the words "prompt caching" hide. The discount is real when it works, but the cache is shared infrastructure your per-request metrics treat as private. Your SLO promises a latency budget that depends on a state your client never owned and your provider does not guarantee. Long-running conversations make the gap visible because they are the workloads that pay the most for cache hits and notice the most when one disappears.

The cache you cannot see

Prompt caching at the provider level is a KV cache: the key and value tensors produced during prefill, held in GPU memory so the next request with the same prefix can skip the prefill compute. Anthropic prices reads at 10% of base input; OpenAI does the same. The math is straightforward when you hit. A 28,000-token system prompt that costs $0.09 per turn at full price costs $0.009 with a cache hit. Multiply by hundreds of turns a day across thousands of conversations and the cache is the difference between a viable product and one that bleeds money.

The trouble is that you do not own the memory the cache lives in. The GPU pool is shared. The cache is filled with whatever recent prefixes have come through, and capacity is finite. When pressure rises — a launch by a larger customer, a regional traffic spike, a long batch of long prompts — entries get evicted. The provider has an eviction policy and does not publish it. In practice it behaves like LRU at the time dimension, with Anthropic documenting a five-minute minimum TTL refreshed on every hit, and OpenAI describing automatic clears after five to ten minutes of inactivity that can stretch to an hour during off-peak. Both leave the actual decision to internal scheduling that responds to the load you cannot observe.

Anthropic's March 2026 change made the boundary sharper: the default TTL was silently dropped from one hour to five minutes. Teams that built around the old window discovered their cache hit rate collapsed without their code changing. The lesson is not about the specific number. It is that the cache contract is not a contract — it is a knob the provider turns when the economics demand it, and the only thing your code holds is a hope that the prefix will still be warm when the next turn arrives.

What re-prefill actually costs

When the cache misses on turn twelve, you are not paying a little more. You are paying for the prefill of every token in the transcript, all at once, from the first token of the system prompt to the last token of the conversation history. On a 28,000-token prefix at Claude Sonnet's base input rate, that is roughly $0.084 of extra input cost for the single turn. TTFT scales linearly with prefill length on most production setups, so a prefix that took 200ms to prefill cold now becomes the floor on first-token latency for that turn.

The user-visible effect is worse than the cost. Eleven turns at sub-second TTFT trained the user to expect responsiveness; turn twelve takes four seconds before any token streams. The conversational rhythm breaks. If the agent is voice-driven, the silence is uncanny. If the agent is in a chat surface, the user starts another turn before the first one returns. If the agent is running unattended in a CI loop, the timeout you set generously around the cached path now fires, and the orchestrator restarts the conversation from scratch — paying the prefill again on the retry, with even less reason to believe the cache will warm in time.

The cost is invisible until you measure it. Most token dashboards report cached and uncached input as one bucket, or report cached tokens as a positive metric to optimize rather than a hit-rate baseline to defend. The right view is hit-rate per turn within a conversation, plotted against turn index, segmented by hour of day. Conversations that show a cliff somewhere past turn ten are revealing where eviction is biting. If you do not have that plot, the bill arrives at the end of the month and the engineer who answers for it has nothing to point at.

The eviction policies your provider does not name

Providers do not publish the rules that determine which prefix gets dropped first. Inference scheduling research from 2026 — Continuum's KV-cache TTL work, vLLM's tail-optimized LRU proposal, MIRAGE's dynamic remapping — makes the implicit policies explicit and lets you guess the shape of what closed-source providers are doing. The common patterns:

  • LRU on prefix blocks. Standard baseline. The prefix that has gone longest without a hit is the cheapest to evict because, by recency, it is the least likely to be reused. Works well when traffic is steady. Fails on conversations with long inter-turn intervals, where a thinking pause makes you look stale to the cache even though your next turn is moments away.
  • Priority by re-prefill cost. Some schedulers prefer to keep prefixes that would be expensive to recompute. A 30,000-token prefix is more valuable to retain than a 2,000-token one. This is good for you if you have long conversations and bad for you if the scheduler is biased toward optimizing throughput rather than tail latency.
  • Capacity-pressure eviction. When GPU memory is tight, the scheduler aggressively drops entries to make room for incoming prefills. The pressure signal is set by total tenant traffic, not yours. A bursty neighbor can push your prefix out even if your conversation is still active.
  • Conversation-aware retention. The frontier policies, like vLLM's T-LRU, track which conversations are likely to make their next request soon and pin their prefixes against eviction. This is rare in production today and not exposed at the API layer of major providers. Your best signal that something like this exists is anecdotal: cache hit rates that hold steady across heavy load.

You will not learn which of these your provider uses by reading the docs. You will learn it by instrumenting hit rate and correlating dips against your provider's status page, regional load patterns, and your own request timing. The trace at the moment of the miss will not name the cause; the pattern across many misses will.

Designing for cache you cannot guarantee

The right posture is to treat cache hits as an optimization, not an invariant. Code that crashes when the cache misses has misunderstood what the cache is. The design patterns that survive the eviction-mid-conversation failure mode are the ones that make the cold path acceptable and the warm path observable.

  • Pin the prefix where the API allows it. Anthropic exposes a five-minute and a one-hour TTL on cache control blocks. The one-hour option exists exactly because long-running conversations were getting evicted under the default. Use it for the system prompt and any persistent context that will be reread across many turns. Pay the cache-write premium once and amortize across the conversation. Anthropic charges 1.25x base input for a five-minute write and 2x for the one-hour write — the breakeven is roughly two hits.
  • Budget conversation length against cache hold. If your average conversation runs forty minutes and your default TTL is five minutes, every quiet stretch is a chance to lose the prefix. Either upgrade the TTL, or design the agent to keep the cache warm with low-cost keepalive requests during pauses. Keepalive is unattractive — it pays for cache writes you may not need — but it is more predictable than a cliff.
  • Build a graceful cold-prompt fallback. When TTFT exceeds a threshold, you know the cache missed. Your client should know what to do: stream a "still thinking" affordance to the user instead of silence, lower the timeout downstream so a retry does not stack, log the event with enough conversation metadata to find the eviction pattern in aggregate. The cold path should not be an exception. It should be a documented mode with its own SLO.
  • Reserve capacity for conversations that cannot tolerate misses. Both major providers offer dedicated capacity tiers. The pitch is throughput; the more interesting property for production agents is cache predictability, because a reserved tier evicts on a smaller, more controllable footprint. The economics rarely work for hobby workloads and often work for B2B agents whose end users will not retry on a four-second pause.
  • Measure hit rate per turn within a conversation. Not per request. Not per tenant. Per turn, indexed by position in the conversation, so you can see exactly where cache cliffs appear. Pair the plot with the same hour-of-day cut from your provider's region, and the noisy-neighbor effect will surface as a daily pattern, not noise.

The deeper move is to stop trusting "cached" as a binary in your telemetry. The metric you need is the ratio of cached input tokens to total input tokens per turn, plotted as a distribution rather than an average. A mean of 90% looks fine until you notice the bottom decile is at 0% — and that bottom decile is the worst turn of someone's conversation, the one they remember.

What the architecture is really telling you

Multi-tenant inference made conversation latency a function of strangers' traffic. No amount of careful prompt engineering changes that. The KV cache is shared infrastructure presented behind an API that hides the sharing. Your client treats the cache as private state; the provider treats it as a fungible resource to allocate against incoming load. The two views meet at exactly the point where your conversation gets billed for a re-prefill that your application has no language to explain.

The honest framing for the team is: the cache is not yours, and the latency contract you wrote against it is implicitly a contract with every other customer of your inference provider. You can buy more predictability with reserved capacity or longer TTLs. You can absorb the loss with graceful degradation. You can negotiate with the provider for cache pinning if your volume warrants it. What you cannot do is pretend the warm path is the only path. The cold path exists at the rate of other people's bursts, and the agent that does not have a story for it is shipping a UX that the next noisy neighbor will break.

References:Let's stay in touch and Follow me for more thoughts and updates