Prompt Cache as Covert Channel: TTFT Probing Leaks Cross-Tenant Prompts
Prompt caching is the optimization that pays for itself the moment you turn it on. A long system prompt is hashed once, the KV state lives in GPU memory, and every subsequent request that reuses the prefix skips the prefill cost. Providers report 80% latency reduction and 90% input-cost reduction on cached requests, and at scale the math is irresistible: a single shared prefix amortized across millions of calls turns a line item into a rounding error.
The mechanism that makes the savings work is a shared resource whose hit-or-miss state is observable as latency. That observability is the side channel. A cache hit and a cache miss are distinguishable from outside the network, the difference is large and deterministic, and the optimization that earned its place on the cost dashboard has a second job nobody scoped: it leaks information about what other tenants on the same provider are doing right now.
The interesting part is not that someone might exploit this. The interesting part is that the side channel is structural to the optimization, not a bug — patching it to zero means giving up the cache reuse the savings depended on, and most production teams have not had the conversation that names this trade-off explicitly. The cost dashboard says one thing. The threat model, if it mentions caching at all, says another. The bridge between those two artifacts is where the security review should have happened and didn't.
How TTFT Becomes a Side Channel
The mechanics are straightforward and the math is public. When a request lands at an inference server with prefix caching enabled, the runtime hashes the prompt's leading blocks and looks them up against KV state already in memory. A hit means the prefill skips ahead to the suffix; time-to-first-token drops from hundreds of milliseconds to tens. A miss means the model recomputes the prefix from scratch. The latency difference is large, repeatable across runs, and measurable from a network client over an ordinary HTTP connection.
An attacker who suspects another tenant is using a particular prefix sends the candidate prefix and times the response. A short TTFT means the prefix matched something already cached, which is evidence that another principal recently submitted a request whose prefix overlapped. Repeat the probe, vary the candidate, and the cache becomes a yes-or-no oracle for "has anyone here submitted a request beginning with X."
The 2024 paper The Early Bird Catches the Leak showed this is not theoretical. Researchers detected global cache sharing across organizations on seven commercial API providers and used an incremental search algorithm to recover system prompts token-by-token at 92.3% accuracy with an average of 234 queries per token. I Know What You Asked (NDSS 2025) demonstrated end-to-end prompt reconstruction against vLLM and SGLang multi-tenant deployments using nothing but TTFT measurements. Auditing Prompt Caching in Language Model APIs applied statistical hypothesis tests to the public APIs of major providers and detected per-user, per-organization, and global caching behaviors that contradicted the providers' own documentation. The attack literature has moved fast enough that most defense literature is one cycle behind.
The capability the attacker gains is not "decrypt cross-tenant data." It is closer to "ask the cache yes-or-no questions about what other tenants have recently sent." That capability is enough to recover system prompts whose contents the operator believed were confidential, to confirm whether a competitor is running a workflow whose prefix matches a guess, or to fingerprint the in-context examples a SaaS feature uses to steer its model. None of that requires breaking the model. It requires that the cache's hit-or-miss state be observable, which it is by construction.
The Optimization Couples Tenants Through a Shared Resource
The cleanest way to see the problem is to notice that prefix caching is a multi-tenancy primitive the way a shared CDN cache or a shared database connection pool is. Two tenants are no longer isolated the moment a piece of state belonging to either one of them affects the latency observed by the other. That is a coupling. The coupling is the entire point of the optimization — that's where the savings come from — and it is also the surface a tenancy review should have flagged.
The cost-vs-security trade-off here is sharper than most. If you partition the cache so that no two tenants can share a block, the side channel closes. The hit rate also collapses — every tenant pays full prefill cost on every novel prefix. Providers built prompt caching as a feature precisely because shared prefixes (system prompts, tool schemas, few-shot examples) repeat across millions of customer calls. The asset that makes the cache worth keeping is exactly the asset that makes it leaky. There is no obvious knob that gives you the cost savings of full sharing and the isolation of full partitioning at the same time.
What providers have actually done is pick a granularity. Anthropic and OpenAI now scope cache reuse at the organization level: requests inside one organization can reuse each other's cached prefixes, but cross-organization sharing is disabled. Anthropic's docs explicitly say cache entries are isolated between organizations. OpenAI has made similar changes after the 2024 disclosures and now states caching is "isolated at the organization level and is not shared across accounts." Azure OpenAI's documentation describes the same posture. These changes happened after researchers demonstrated cross-tenant leakage on the same APIs the same providers had been documenting as fine. The disclosure cycle worked the way disclosure cycles usually work: the threat model updated when someone published the exploit.
Inside a single organization, the cache is still shared. That is fine for many enterprises (one organization is one trust boundary) and not fine for SaaS vendors whose own customers are tenants of their application running on top of an inference provider. If your product multiplexes customers onto a single inference-provider organization, your customers are sharing a cache with each other, and the provider's organization-scoped isolation does not give you per-customer isolation. You have to build that yourself.
What "Build It Yourself" Looks Like
The vLLM project has been working through this in public, and the design space the contributors have surfaced is the most useful artifact a production team can read on the topic. RFC #16016 proposes a cache_salt field on the request, injected into the hash of the first KV block, so that two requests with different salts cannot share cached state regardless of how identical their textual prefixes are. PR #17045 implemented the single-barrier version of the design and shipped in vLLM's stable release.
- https://arxiv.org/html/2409.20002v1
- https://arxiv.org/abs/2502.07776
- https://arxiv.org/abs/2508.08438
- https://arxiv.org/html/2508.09442v1
- https://www.ndss-symposium.org/wp-content/uploads/2025-1772-paper.pdf
- https://github.com/vllm-project/vllm/issues/16016
- https://github.com/vllm-project/vllm/pull/17045
- https://docs.vllm.ai/en/stable/design/prefix_caching/
- https://platform.claude.com/docs/en/build-with-claude/prompt-caching
- https://developers.openai.com/api/docs/guides/prompt-caching
- https://learn.microsoft.com/en-us/answers/questions/2117496/how-does-azure-openai-ensure-prompt-caching-privac
- https://developer.nvidia.com/blog/structuring-applications-to-secure-the-kv-cache/
- https://aisecurityhandbook.com/chapter-3/prompt-leakage-kv-cache-sharing.html
- https://www.schneier.com/blog/archives/2026/02/side-channel-attacks-against-llms.html
