Skip to main content

The Cache Stampede That Hit Your Model Provider Instead of Your Database

· 10 min read
Tian Pan
Software Engineer

The pager went off at 14:02 UTC. Not for latency, not for errors — for spend. The cost dashboard showed a vertical line: three minutes of input-token billing at roughly nine times the trailing hourly average, then back to normal. No regression had shipped. No tenant had onboarded. Traffic was flat to the minute. The only thing that changed is that a single prompt prefix — the 14K-token system message that every agent in the fleet shared — had quietly expired on the provider side, and a thousand workers had all decided, within the same 200ms window, that they were the ones who needed to write it back.

This is a cache stampede. It is the same bug operators have been writing post-mortems about since memcached shipped in 2003. What is new in 2026 is that the cache it stampedes is no longer yours. It lives inside your model provider, you cannot inspect its state, and every miss costs real money instead of a few extra database queries. The synchronization bug that database engineers learned to jitter away two decades ago has quietly reappeared on a bill line item nobody thought to defend.

Why the LLM version is worse than the database one

In the database era, a cache stampede was a capacity event. A thousand workers miss the same key, a thousand identical queries hit Postgres, the connection pool saturates, p99 latency goes vertical, and either the database recovers in seconds or the load balancer sheds traffic until it does. Costly in goodwill, rarely costly in dollars.

The LLM version inverts that. Capacity is not the constraint — the provider absorbs the burst. The constraint is the pricing curve. Cache reads cost roughly a tenth of the base input rate. Cache writes cost a quarter more than the base input rate, sometimes twice as much for extended-TTL writes. The ratio between a hit and a miss is not 1:1; it is closer to 1:12. A single coordinated miss across a fleet does not page anyone. It just silently multiplies your input-token spend for the duration of the warm-up window, and the next time you read the bill you find a quiet six-figure surprise that nobody owns.

The other inversion: you cannot see the cache. The provider does not tell you which prefix is in which GPU's KV cache, or which routing hash a request landed on, or when the TTL clock started. You see two numbers per response — cache_creation_input_tokens and cache_read_input_tokens — and from those you have to reconstruct what happened. The cache is a remote, opaque, stochastic resource whose state your code influences but does not own.

How the herd gets synchronized

Stampedes do not happen randomly. They happen because something synchronizes the herd, and LLM workloads have more synchronizers than the average database workload.

The fleet rollout. A deploy goes out at 14:00 UTC. Every worker process restarts within a 90-second window. Every worker's first request misses the prompt cache because the routing hash on the provider side keys partly on tenant or session ID and the provider's load balancer just spread the fleet across new machines. A thousand workers, a thousand cache writes, all stamped within two minutes.

The cron-scheduled batch job. A nightly evaluation run kicks off at 00:00:00 UTC and fires ten thousand evaluation prompts in parallel against the same model. Each evaluation shares the same 8K-token rubric prefix. The first batch of workers writes the cache; the next batch arrives in the same second and misses it because routing has not yet converged. The Google SRE book has been warning about midnight-cron synchronization since 2016 — what changed is that the resource being stampeded is now priced per token.

The TTL itself. Anthropic's default cache TTL is five minutes. If your traffic pattern is bursty — one request every seven minutes from a given prefix — every request you send pays the write premium and never enjoys a read. Worse, if several tenants share a prefix and they all happen to be busy in the same minute, they all write the same cache content because the provider treats their requests as independent cache populators. (Whether tenants' caches are actually shared is a provider-dependent detail you cannot rely on; assume the worst.)

The autoscaler. Traffic spikes at 09:00 local time. The orchestrator scales from twenty workers to two hundred in ninety seconds. The hundred and eighty new workers all hit the model with the same system prompt before any of them has a warm cache, and the provider's prefix-routing layer cannot help because each new worker is a new TCP-level client with a new prompt-cache key.

The common factor: uniformity. The herd is created by identical timers, identical TTLs, identical deploy windows, identical routing keys. Every textbook cache-stampede mitigation starts by breaking that uniformity.

The cost shape, in numbers

Suppose your service runs an agent loop with a 12K-token system prompt and a small per-turn user message. At the steady state, every request is a cache read: 12K tokens × $0.30 per million × 0.1 = $0.00036 per request on the cached portion. Cheap.

Now collapse all reads into writes for two minutes. Each request becomes 12K tokens × $0.30 × 1.25 = $0.0045. The per-request cost rose 12.5×. If your fleet is doing a thousand requests per second during the warm-up window, that is roughly $270 of extra spend in 120 seconds — for output you would have paid $21 to produce had the cache been warm. Two orders of magnitude is not a thumb-on-the-scale rounding; it is the entire ROI of your caching strategy disappearing for the time it takes a deploy to converge.

Now imagine the extended-TTL variant: writes at 2× base input rate, reads still at 0.1×. The stampede multiplier jumps to 20×. The longer the TTL you bought to amortize writes, the more expensive each unexpected write becomes.

This is the calculation nobody runs when they enable prompt caching for the first time. The marketing math says "save 90% on input tokens." The operational math says "save 90% in steady state, lose 1150% during synchronized misses." Whether the average comes out ahead depends entirely on how often you stampede.

What an LLM-aware stampede defense actually looks like

The mitigations are not new. They are just being re-discovered by people who never had to think about cache coherency before.

Jitter the TTL refreshes. If you are warming caches with a keep-alive ping every four minutes (to stay inside a five-minute TTL), add ±30 seconds of random jitter per worker. A thousand workers pinging at exactly 14:04:00 produces a stampede; a thousand workers pinging anywhere in 14:03:30–14:04:30 does not. This is a one-line fix and it eliminates roughly 90% of self-inflicted stampedes.

Probabilistic early expiration. The XFetch algorithm — proven optimal in a 2015 paper that everyone forgot — has each worker independently decide to refresh the cache before its nominal expiry, with probability rising as expiry approaches. The math distributes refreshes over an interval rather than clustering them at the boundary. For LLM prompt caches, this looks like: on each request, with probability p(remaining_ttl), issue a deliberate cache-write request even though the cache is still valid. The expected cost of the over-refresh is small; the expected cost of a synchronized miss is large.

Single-flight at the gateway. Run an LLM gateway in front of the provider that coalesces concurrent identical requests. If a hundred workers hit the gateway in the same 50ms window with the same prompt prefix, only the first one triggers a provider call; the other ninety-nine wait on the in-flight result. This is harder for streaming responses (you have to fan the stream out to all waiters) but it is the right shape of fix for the deploy-rollout scenario.

Stagger the deploy. Instead of restarting the fleet in a 90-second window, restart it over fifteen minutes with a 1% canary that holds for two minutes. The first warm worker writes the cache; the rest read it. Most platform teams already do staggered deploys for capacity reasons — extend the same logic to cover prompt-cache warm-up, which now has a dollar cost attached to convergence.

Pre-warm explicitly. Before the autoscaler spins up new workers, fire a single warm-up request from a privileged caller to populate the cache. Then let the new workers join. This trades one deliberate cache write for hundreds of accidental ones.

Watch the write-to-read ratio, not the absolute count. Provider dashboards expose cache_creation_input_tokens and cache_read_input_tokens per response. The ratio is your stampede gauge. A healthy steady state has writes well under 5% of reads; a stampede shows up as a sudden inversion. Alert on the ratio, not on the absolute spend — by the time absolute spend triggers, the stampede is already over.

The deeper bug: nobody owns the seam

The most uncomfortable thing about this failure mode is who is responsible for it. The application team did not write a stampede. They wrote sensible code that issues a request when a request is needed. The infrastructure team did not provision a thundering herd; they configured an autoscaler that does what autoscalers are supposed to do. The cost team did not approve a spike; they signed off on a pricing model that, in steady state, saved 90% on input tokens.

The bug lives at the seam where these three perspectives meet, and that seam is the provider's cache, which nobody owns. The application team cannot see it. The infrastructure team cannot tune it. The cost team cannot model it because the provider's eviction policy is undocumented and changes without notice. (Anthropic silently dropped the default TTL from one hour to five minutes in early March; many customers found out from their bill, not from a changelog.)

The fix is partly technical — jitter, coalescing, staggered deploys — and partly organizational. Somebody has to own the prompt-cache contract end-to-end: which prefixes are cached, what TTL they hold, what the expected hit rate is, and what happens to spend when the herd synchronizes. In most teams in 2026, that role does not exist. The traffic SRE owns capacity, the FinOps engineer owns spend, the application engineer owns the prompt, and the provider's cache happily sits between all of them being collectively neglected until the next pager event.

If you have a prompt-caching strategy and you cannot tell me, today, what your write-to-read ratio was during the last fleet deploy — you do not have a strategy. You have an assumption. The cache stampede that hit your model provider is the bill arriving to disabuse you of it.

References:Let's stay in touch and Follow me for more thoughts and updates