Skip to main content

Prompt Caching's Hidden Tax: When a Cache Hit Serves the Wrong User's Context

· 11 min read
Tian Pan
Software Engineer

Prompt caching is sold as a free win. Cache the long shared prefix — your system prompt, your tool definitions, your retrieved context — pay full price only for the short tail that changes, and watch the bill drop. The numbers are real: a cache read costs roughly a tenth of a fresh input token, so a workload with a heavy stable prefix can see its input cost fall by 80% or more. Teams adopt it for that reason, tune it for that reason, and report on it with a single metric: cache hit rate, trending up.

What that framing hides is that the boundary you just drew — the line between the cached prefix and the uncached tail — is not a billing knob. It is a correctness boundary. Everything above the cache breakpoint is content the system has decided is interchangeable across requests. If you draw that line to maximize hit rate, you are letting a finance metric decide which facts in your prompt are allowed to be shared between users, between tenants, and across time. That is an isolation decision, and it deserves to be made on purpose.

The failure mode is quiet because it never throws. A cache hit that serves one user's context shaped by another user's profile returns a perfectly well-formed response. A cache hit that serves personalization that was true when the prefix was warmed and false by the time it is reused returns a confident, coherent, wrong answer. Nothing in your latency graph or your error rate moves. The only signal is a hit rate that looks great — because the key is too coarse.

The cache key is a correctness boundary, not a billing knob

Provider-side prompt caching works by hashing the prompt prefix up to a marked breakpoint. On a subsequent request, if the prefix hash matches an existing entry, the model reuses the cached state instead of recomputing it. The match has to be exact — every token, in order, up to and including the breakpoint. Change one character above the line and you get a different hash and a cache miss.

That exactness is what makes the mechanism feel safe. "It only hits when the prefix is byte-identical, so how could it serve the wrong thing?" The answer is that you decide what goes above the line. The provider does not know that the paragraph reading User: Jane Doe, plan: Enterprise, region: EU is per-user. It sees tokens. If those tokens sit above your breakpoint, they are part of the shared prefix, and any other request whose prefix is byte-identical up to that point will hit the same entry — which is exactly what you asked for when you put them there.

So the real question is never "will the hash match." It is "what did I decide is allowed to be identical." Every fact you place above the breakpoint is a fact you have declared interchangeable. Put the system prompt there: fine, it genuinely is shared. Put the tool schema there: fine. Put the current user's entitlements there to pad the prefix and lift the hit rate: now you have declared that one user's entitlements may stand in for another's whenever the surrounding tokens line up. The cache did not leak anything. You drew the boundary in the wrong place.

Two ways a coarse key goes wrong

A boundary drawn for cost rather than correctness fails in two distinct directions, and they need different fixes.

Cross-tenant bleed. This is the spatial failure: the cached segment contains a fact that belongs to a specific user or tenant, and a request from a different principal reuses it. In application-level caching this happens when you build the prefix by concatenating a shared template with per-user data and then cache the whole thing with one breakpoint — two users with similar profiles can collide on the prefix and one inherits the other's context. At the infrastructure level it is worse than a logic bug: research presented at NDSS 2025 demonstrated PROMPTPEEK, a side channel in multi-tenant LLM serving where shared key-value caches let an adversary reconstruct other users' prompts by timing which prefixes are already warm. The lesson generalizes: a cache shared across trust boundaries is an information channel across those boundaries, whether you meant it to be one or not.

Stale personalization. This is the temporal failure: the cached segment was true when the prefix was first written and false by the time it is reused. A prefix that embeds a user's preferences, their recent activity, or a tenant's current policy is a snapshot. With a 5-minute TTL it is a snapshot that can be served for five minutes; with a 1-hour TTL, for an hour. If the underlying fact changes inside that window — the user updates a setting, the policy is revised, a long-running agent compacts its own history — the cache keeps serving the stale version, and the model has no way to tell stale context from current context. This is not hypothetical. A recent regression in a widely used coding agent did exactly this: a change raised hit rates on stale prefix entries without adding invalidation for compaction events, so the model began receiving outdated context it could not distinguish from the live state.

The two failures share a root cause — a mutable or principal-specific fact placed above the breakpoint — but they do not share a fix. Cross-tenant bleed is fixed by keying: making the principal part of what must match. Stale personalization is fixed by invalidation: making the fact's lifetime shorter than the cache entry's, or versioning it so a change forces a miss. A team that notices one and not the other ships a system that is half-correct.

The hit rate is the symptom, not the scoreboard

Cache hit rate is the metric every prompt-caching rollout reports, and it is genuinely useful — for diagnosing the opposite problem. A hit rate near zero almost always means something dynamic is rotating inside your prefix: a timestamp, a request ID, a non-deterministic JSON key order. That is a real bug and the hit rate catches it.

But a hit rate that is climbing is not automatically good news, and treating it as the scoreboard is how coarse keys get shipped. Consider what raises the number. You can raise it by genuinely stabilizing your prefix — extracting volatile tokens below the breakpoint so the stable part is cleanly cacheable. That is the win you wanted. Or you can raise it by widening the shared segment: pulling more content above the line, including content that should have been per-user or per-request. The number goes up the same amount. The dashboard cannot tell the two apart, because hit rate measures how often the prefix matched — not whether the things that matched were safe to match.

This is why hit rate belongs next to a second metric, not alone. Pair it with a correctness probe: an eval slice that deliberately issues requests from different tenants and different freshness states and asserts that no response carries another principal's facts or an out-of-date value. If hit rate goes up and that slice stays green, you stabilized the prefix. If hit rate goes up and the slice starts failing, you widened the boundary. Same dashboard movement, opposite outcomes — and only the second metric tells you which one happened.

Designing the key: decide what rides in the cached segment

The practical discipline is an audit. For every block above your cache breakpoint, answer one question: is this fact identical for every principal who could hit this entry, and stable for the entire TTL? If the answer is yes for both, it belongs in the cached segment. If it is no for either, it belongs below the breakpoint, in the uncached tail — or the breakpoint needs to move.

That audit usually produces a layered structure rather than one big cached blob. Real prompts have content that changes at different rates:

  • Truly static, truly shared — the system prompt, tool definitions, few-shot examples. This is the high-value cache target. It is identical across every user and changes only when you deploy.
  • Per-tenant, slow-moving — a tenant's policy, branding, or configuration. Shareable within a tenant but never across tenants. This needs the tenant identity to be part of the cache key, which most providers give you implicitly: the prefix only matches if the tenant block is byte-identical, so as long as the tenant block sits above the breakpoint and is distinct per tenant, two tenants cannot collide. The mistake is templating it away to "improve" matching.
  • Per-user, mutable — preferences, recent history, entitlements. This is the dangerous layer. It is tempting to cache because it is bulky and "feels like setup," but it is neither shared nor stable. It belongs below the breakpoint, or behind its own short-TTL breakpoint that you invalidate explicitly when the user changes something.
  • Per-request, volatile — the actual query, timestamps, request IDs. Always below the breakpoint. These exist to be different every time.

Most providers let you place several breakpoints — typically up to four — precisely so you can cut the prompt at these natural seams. Use them. One breakpoint after the static block, one after the per-tenant block, and the per-user and per-request content left uncached gives you most of the cost savings with none of the cross-tenant exposure. The cost difference between caching the per-user layer and not caching it is small; the correctness difference is the whole ballgame.

TTL is a truth-decay budget, not a cost dial

The choice between a 5-minute and a 1-hour TTL is usually framed as economics: the longer TTL costs more to write (a 2x multiplier on the write versus 1.25x for the short one) but keeps the cache warm across a slow request cadence so you eat fewer cold-start writes. That framing is correct and incomplete.

A TTL is also the maximum age of any fact in the cached segment. If you cache anything mutable, the TTL is a promise that you are willing to serve that fact for that long after it was true. For a static system prompt that promise is free — it does not change inside an hour anyway. For a per-tenant policy, an hour might be acceptable, or it might not, depending on how fast policy changes need to take effect. For per-user state it is almost never acceptable: a user who flips a setting and immediately sees the old behavior because a warm prefix outlived their change experiences it as a plain bug.

So pick the TTL per layer, not per request. Long TTLs for the static, shared content where age does not matter. Short TTLs — or no caching — for anything whose truth has a shelf life shorter than the window. And where a fact can change on a discrete event rather than just aging out, do not rely on the TTL at all: version the cached block. Append a content hash or a version counter to the block, and when the underlying fact changes, the version changes, the prefix hash changes, and the next request cleanly misses and rewrites. Time-based expiry is a blunt instrument for "this got old"; a version key is the precise instrument for "this specifically changed."

The boundary is yours to draw

Prompt caching is not a cost optimization with a correctness footnote. It is a correctness decision — which facts may be shared across principals and across time — that happens to come with a cost benefit. The provider gives you the mechanism and an exact-match guarantee; it does not and cannot decide which of your tokens are safe to treat as interchangeable. That is your design.

Draw the boundary by hand. Audit every block above the breakpoint for "identical for all principals" and "stable for the whole TTL." Layer the prompt so the static, the per-tenant, the per-user, and the per-request content sit at their own breakpoints with their own lifetimes. Report a cross-tenant-and-staleness eval slice next to the hit rate so a rising number cannot quietly mean a widening boundary. Do that, and prompt caching is exactly the free win it advertises. Skip it, and you have outsourced an isolation boundary to a billing dashboard — and the dashboard will tell you everything is fine right up until a user reads someone else's context.

References:Let's stay in touch and Follow me for more thoughts and updates