Skip to main content

Prompt Cache Thrashing: When Your Largest Tenant's Launch Triples Everyone's Bill

· 10 min read
Tian Pan
Software Engineer

The bill arrives on the first of the month and it is three times what your spreadsheet said it would be. Nobody pushed a system prompt change. The dashboard says request volume is flat. p95 latency looks normal. The token-per-correct-task ratio is unchanged. And yet you owe the inference vendor an extra forty thousand dollars, and the only signal in the observability stack that even hints at why is a metric most teams never alarm on: cache hit rate, which dropped from 71% to 18% somewhere in the second week of the billing cycle, on a Tuesday, at 9:47 AM Pacific, which is when your largest tenant's customer-success team kicked off a coordinated onboarding push for two hundred new users.

Welcome to prompt cache thrashing — the multi-tenant failure mode that the SaaS playbook was supposed to have eliminated a decade ago, reintroduced through the back door by your inference provider's shared prefix cache. The provider's cache is shared across your organization's traffic. Your tenants share that cache with each other whether you want them to or not, and a single tenant whose prefix shape shifts overnight can evict the prefixes everyone else's unit economics depended on. The bill spikes for tenants who did nothing differently. Finance pages engineering. Engineering points at the dashboard, which shows nothing wrong, because the dashboard isn't measuring the thing that broke.

Prompt Cache Is a Shared Resource You Don't Own

The discount is real and the math is seductive. Anthropic's cached input tokens cost roughly a tenth of uncached input tokens — $0.30 per million cached versus $3.00 per million standard for Claude Sonnet-class pricing — which means a stable system prompt prefix run across a million requests turns a five-figure line item into a four-figure one. OpenAI's automatic prefix caching is less aggressive at a flat 50% discount, but it is also free of explicit opt-in: any prompt over 1,024 tokens with a stable prefix gets the discount automatically when the request is routed to a server that has seen the prefix recently.

The mechanics that make these discounts possible are also what make them brittle in multi-tenant deployments. The cache is not a perpetual artifact you control. It is a short-lived, in-memory record on the provider's serving infrastructure with a TTL between five and sixty minutes depending on the tier you pay for. When the TTL expires, or when the cache fills up and the provider's eviction policy reaches your prefix, the next request that hits that prefix pays the full uncached rate. Cache entries are isolated between organizations — your competitor cannot see your prefixes — but within your organization, the cache is shared across every tenant, every feature, every code path that calls the inference API.

This is the part the unit-economics spreadsheet usually misses. The forecast assumes "we have N tenants, each with prefix P, each making M requests per minute, so cache hit rate ≈ 90%." The forecast does not model what happens when tenant A's traffic mix shifts from M requests per minute to 5M requests per minute on prefix Q, which is different from prefix P, and Q now occupies the cache slots that P used to live in. Tenants B through Z are still sending requests on P. They were getting cache hits last week. They are getting cache misses this week, and the only thing that changed is a tenant they have no relationship with launched a feature.

The Anatomy of a Thrashing Event

Cache thrashing is not a single failure mode. It is a family of incidents, all of which look identical on the cost dashboard and require different fixes. The four shapes worth naming:

Volume displacement. The largest tenant's QPS spikes — a marketing event, a viral product moment, a customer-success-driven onboarding wave — and the sheer volume of their requests dominates cache occupancy. Smaller tenants get evicted not because their prefixes are different but because the provider's LRU-adjacent policy is keeping the most-recently-touched entries hot, and the largest tenant is now touching different ones constantly.

Prefix shape drift. A tenant ships a feature that introduces a new system prompt — a fine-tuned variant, an A/B test, a different tool schema — and their prefix hash changes. Their old prefix ages out of cache, the new prefix has to populate cold, and during the populate window their bill spikes. If the tenant is large enough, the new prefix's population also competes with everyone else's prefixes for cache slots.

Schedule synchronization. Tenants whose workloads run on the hour — cron jobs, scheduled email batches, end-of-day reconciliation tasks — all hit the cache simultaneously at predictable times. The cache that was 80% warm at 09:59 is 30% warm at 10:01 because a thundering herd of tenants are all asking for slightly different prefixes that all need to populate. The bill spike isn't continuous; it's a sawtooth that maps perfectly to your scheduler config, and the dashboard rolls it up into a daily average that hides the structure.

Silent reformatting. A library upgrade changes the JSON serialization order of a tool schema. A timestamp gets injected into the system prompt for "debugging." A retrieval block migrates from below the cache breakpoint to above it. The prefix is byte-different from yesterday's, even though the prompt is logically identical. Every tenant who consumes that code path now misses the cache, and the cost spike is global rather than per-tenant.

The unifying thread is that none of these events show up on the request-volume dashboard, the latency dashboard, the error-rate dashboard, or the model-quality dashboard. They show up on exactly one signal — cache hit rate, denominated per tenant — and most teams don't graph that signal at all, let alone alert on it.

Per-Tenant Cache Hit Rate Is the SLO Nobody Has

The first instinct when a team starts to take cache seriously is to graph the global cache hit rate. This is better than nothing and worse than it sounds. Aggregate hit rate is a weighted mean across tenants, dominated by whichever tenant has the most traffic. If your largest tenant has a 95% hit rate and your smallest twenty tenants are at 12%, the aggregate looks fine and the small tenants are paying eight times what they should be. Aggregation hides the noisy-neighbor signature.

The metric that surfaces thrashing is cache hit rate denominated per tenant, with a per-tenant SLO. Pick a target — 70% is a reasonable starting point for most SaaS workloads where the system prompt is meaningfully larger than the user message — and alert when any tenant's hit rate falls below it for more than fifteen minutes. The alert fires before the bill arrives. The alert names the tenant whose prefixes are being evicted, which lets you walk the dependency in either direction: was it tenant X's traffic shape that changed (push side), or was it tenant Y's volume surge that crowded X out (pull side)?

The data is not free. The major providers return cache metadata on every response — Anthropic exposes cache_creation_input_tokens and cache_read_input_tokens; OpenAI exposes usage.prompt_tokens_details.cached_tokens — and you have to plumb that metadata through your gateway, tag it with the tenant ID, and aggregate it. The tagging is the load-bearing piece. Hit rate without tenant-tagging is a global blur. Hit rate with tenant-tagging is the only signal that distinguishes "the system prompt got changed" from "tenant X had a launch."

Capacity Planning for the Worst-Case Tenant

Once you can see thrashing, the next question is what to do about it. The default answer the inference provider offers is "use longer TTLs" — Anthropic's 1-hour extended caching, for example — and this is genuinely useful but not sufficient. A longer TTL only helps if your cache entries survive that long against eviction, and eviction in a multi-tenant deployment is driven by other tenants' traffic, not by clock time.

The architectural moves that actually help, ordered by how much engineering investment they require:

Reserve cache headroom for the worst case, not the average. If your largest tenant can plausibly 5x their volume during a launch, your capacity plan needs to assume that happens during the same hour as everyone else's normal traffic. The aggregate cache hit rate spreadsheet that uses average tenant volume is forecasting against a scenario that doesn't survive contact with a single customer-success milestone.

Pre-warm prefixes for tenants whose workloads are predictable. If tenant X always runs their nightly batch at 02:00 UTC, fire a single throwaway request against their prefix at 01:55 to seed the cache. The cost is one cached-write penalty per scheduled job. The benefit is that the batch's first hundred requests don't pay the cold-cache tax.

Stabilize the prefix shape that shouldn't change. The thrashing events that come from "silent reformatting" are entirely avoidable by treating the cached region of the prompt like a schema. Pin the JSON serialization order. Hoist any dynamic content (timestamps, session IDs, user-specific metadata) below the cache breakpoint. Treat changes to the cached region as a release-gated event, the same way you'd treat a database migration, and run a hit-rate regression check before merging.

Route prefix-divergent tenants separately when economics warrant it. If you have a single very large tenant whose prefix shape is fundamentally different from the rest of your fleet, the right architectural answer may be to send their traffic through a different deployment — different organization at the provider, different API key, different billing cohort. This is expensive in coordination but cheap in cache contention, because each cohort gets its own cache budget. The trigger condition is when one tenant's traffic alone is large enough to evict the rest.

What the Architecture Diagram Should Have Said

There is a load-bearing assumption in most multi-tenant SaaS architecture diagrams that the inference call is a leaf node — a stateless function whose cost depends only on its inputs. Prompt caching breaks this assumption. The inference call now has shared state with every other inference call your organization makes in the next five to sixty minutes, and the state is mediated by a cache you don't control, denominated in tokens you don't see, evicted by a policy you can't introspect.

The team that hasn't drawn this dependency on their architecture diagram is going to learn about it the way the SaaS industry has been learning about noisy-neighbor problems for fifteen years: by getting paged, finding nothing wrong with their service, and only later realizing the shared resource was the thing they paid the platform to handle for them. Cache hit rate is now a tenant-coupling surface. It belongs in the noisy-neighbor section of the architecture review, not the cost-optimization section of the engineering wiki.

The teams that are ahead on this in 2026 are the ones treating cache hit rate the way mature platforms treat database connections: as a first-class budget, alarmed per tenant, capacity-planned for the worst case, and audited the same way you audit any other shared resource. The teams that aren't are paying a tax denominated in surprise invoices, and the invoice arrives weeks after the incident. The shared cache reintroduced a class of problem the multi-tenant playbook was supposed to have already solved. Solve it again, on the same axes — isolation, observability, capacity — or pay the bill.

References:Let's stay in touch and Follow me for more thoughts and updates