Skip to main content

Per-Tenant Inference Isolation: When Shared Cache, Fine-Tunes, and Embeddings Leak Across Customers

· 12 min read
Tian Pan
Software Engineer

Multi-tenant SaaS solved data isolation a decade ago. Row-level security in Postgres, per-tenant encryption keys, S3 bucket policies scoped to tenant prefixes — by 2018 the playbook was so well-rehearsed that an auditor asking "show me how customer A's data cannot reach customer B" had a one-page answer with a citation per layer. AI features quietly reintroduced the question and the answer is no longer one page.

The interesting part is not that AI broke isolation. The interesting part is where it broke isolation: not at the data layer the audit team has been guarding for ten years, but at four new layers nobody put on the diagram. Prompt cache prefixes share KV state across requests in ways that turn time-to-first-token into a side channel. Fine-tunes trained on aggregated customer data memorize tenant-specific phrasing and surface it back to the wrong customer. Embedding indexes get partitioned logically by query filter when the threat model demands physical separation. KV-cache reuse across requests creates timing channels that nobody threat-modeled when "shared inference is fine" was a reasonable shortcut.

This post is about what changed and what the discipline looks like once you take the problem seriously.

The Shared-Inference Posture Ages Badly

Single-tenant inference was fine when the only thing two requests shared was a GPU and a model checkpoint. The model was a pure function of input tokens. Cache was per-request. Embedding indexes belonged to one customer because there was only one. Fine-tunes were single-purpose. None of those assumptions survives a 2026 production stack.

Modern serving runtimes — vLLM, SGLang, TGI, and the closed equivalents inside the major API providers — share state aggressively because that's where the cost wins are. Automatic prefix caching reuses KV blocks across requests with the same prompt prefix. Continuous batching mixes tenants in the same forward pass. Speculative decoding shares draft model state. Embedding indexes consolidate millions of tenants into a single ANN graph so the recall numbers stay flat as the customer count grows. Fine-tunes pool data because gradient signal is expensive to collect.

Every one of those decisions is correct in isolation and wrong in aggregate. The 2026 NDSS paper I Know What You Asked showed that vLLM's and SGLang's KV-cache sharing — the same feature that drops your inference cost by 40% — lets a malicious tenant reconstruct another tenant's prompts by issuing crafted queries and timing the response. The attack does not require breaking encryption. It does not require a model bug. It requires that two tenants share a runtime with prefix caching enabled, which is the default everywhere.

The pattern repeats at every layer where AI features traded isolation for performance. The fix is not to turn off the optimizations. The fix is to pay the isolation tax deliberately at the layers where the threat model demands it, and to know which layers those are.

Layer One: Prompt Caching as a Side Channel

Prompt caching is the cleanest example because the leak is provable and the math is public. When a request comes in, the runtime hashes the prompt prefix and checks whether the KV state is already in GPU memory. A hit means time-to-first-token drops from hundreds of milliseconds to tens. A miss means the full prefill cost. The difference is large, deterministic, and measurable from a network client.

If two tenants share that cache, an attacker can probe the cache by sending prompts with guessed prefixes. A short TTFT means the prefix matched something cached. With enough probes — and the attack literature now has reinforcement-learning variants that need surprisingly few — you reconstruct another tenant's prompt content.

The mitigations split into three families and they are not equivalent:

  • Full per-tenant isolation disables sharing at the cache layer. Every tenant gets their own cache namespace; cross-tenant probes always miss. This is the only mitigation that holds against a determined attacker, and it is the most expensive one because you give back the cache reuse you were paying for.
  • Timing obfuscation injects noise into TTFT to mask the cache-hit signal. It is cheaper but not airtight — if the noise distribution is bounded, enough probes will recover the signal anyway, and the attack literature is moving faster than the defense literature.
  • Selective isolation shares cache for prompts classified as non-secret and isolates everything else. Schemes like CacheSolidarity monitor cross-user reuse, flag suspicious sharing, and isolate prefixes adaptively. This is the current research frontier and the one most production teams should track because it preserves most of the cost win.

The operational implication is that "we use prompt caching" is no longer a complete answer. The complete answer is "we use prompt caching, scoped to a per-tenant namespace, and our threat model documents what would have to break for tenant A's cache to influence tenant B's TTFT." If you cannot say that sentence, your prompt cache is a side channel waiting for a researcher to publish.

Layer Two: Fine-Tunes That Memorize Their Training Set

The second layer is the one that bites slowest and hurts the most. A fine-tune trained on aggregated customer data memorizes pieces of that data — not in a vague distributional sense, but in the literal sense that an extraction prompt can pull verbatim sequences out. The 2024 research on extracting training data from fine-tuned LLMs put the recoverable fraction at over 50% in natural settings. Newer 2026 work on LoRA fine-tuning shows the leakage is smaller than full fine-tuning but still nonzero, and the leakage concentrates in the upper layers of the model.

The failure mode is not that an attacker pulls a customer's API key out of the model. The failure mode is more boring and more legally fraught: tenant A's fine-tune is trained on a corpus that includes phrasing, names, internal terminology, or document fragments from tenants B, C, and D. The model now produces output that surfaces tenant B's vocabulary in tenant A's session. Nobody attacked anything. The training pipeline did its job. The contract said tenant data would not influence other tenants' outputs and the model violated that contract by construction.

The discipline that has to land is training-data lineage. Every example in the fine-tuning set has a tenant tag. The default policy excludes any cross-tenant aggregation; opt-in is by named contractual permission, not by absence of objection. The training pipeline emits a manifest that names every tenant whose data influenced the resulting weights. When a customer asks "did my data train your model," the answer is a query against the manifest, not a guess.

The architectural realization here is that a fine-tune is a stored derivative of training data with the same compliance properties as a database — you would never aggregate every customer's transactional data into one shared table without a contract permitting it, and a fine-tune is the same shape of question. Most teams reach this realization the first time legal audits the training pipeline.

Layer Three: Embedding Indexes and the Logical-Versus-Physical Trap

Vector databases support multi-tenancy in two postures and the difference matters. Logical multi-tenancy stores all tenants in the same index and filters by a tenant_id predicate at query time. Physical multi-tenancy gives each tenant its own index, namespace, or collection — a separate ANN graph, separate memory, separate query path.

Logical isolation is cheap and load-balances well across millions of small tenants. It also fails in three predictable ways. The first is the missing-filter bug, where a query path that should have included tenant_id = $current doesn't, and ANN search returns neighbors from across the boundary. Code review catches some of these; production traffic catches the rest. The second is the metadata-leak path, where the filter excludes vectors but the listing API or admin interface returns vector IDs and metadata across tenants. The third — and this one is unintuitive — is organic cross-tenant similarity, where one tenant's documents are semantically close to another's because the world contains shared topics, and a benign query returns leaked content even when the filter is correct, because the filter was applied after retrieval rather than as a pre-filter inside the ANN traversal.

Physical isolation gives every tenant its own index. The cross-tenant query is now structurally impossible, not just policy-prohibited. The cost is that small tenants pay overhead, and very-large-multi-tenancy patterns hit the per-index limits of the underlying engine. Pinecone's namespaces, Weaviate's tenant-per-collection, and Qdrant's collection-per-tenant patterns exist because the industry concluded that the logical-isolation failure modes were too expensive to keep paying.

The decision is threat-model-driven, not preference-driven. If your tenants are competitors of each other, or if any of them are regulated and the auditor will literally ask the cross-tenant question, the answer is physical. If your tenants are independent users of a low-stakes feature and the worst case is a few rare leaks across a large customer base, logical with audited filters can be defensible. The mistake teams make is choosing logical because it was the default and discovering at audit time that the threat model was always physical.

Layer Four: The KV-Cache Side Channels Nobody Threat-Modeled

The fourth layer is the one most teams have never looked at. Modern inference runtimes share KV cache state across requests for reasons beyond explicit prompt caching: continuous batching packs multiple tenants into the same forward pass, speculative decoding shares draft model state, and paged attention reuses physical memory blocks across logical sequences. Each of these is a performance win and a potential information channel.

The 2024 paper The Early Bird Catches the Leak documented timing channels in vLLM-style serving where the time to schedule a request reveals queue depth, which reveals other tenants' traffic patterns. The 2025 NDSS work extended this to prompt content reconstruction. The 2026 Selective KV-Cache Sharing paper formalized which sharing patterns are safe and which are not, and the answer is roughly "sharing across a tenant boundary is unsafe by default."

The mitigation is not to disable batching — that would surrender the GPU economics that make AI features viable. The mitigation is batching scoped to a security domain. Requests from tenants in the same trust class can be batched together; requests across trust classes are serviced from separate batches, separate queues, or separate replicas. This costs you GPU utilization. It buys you the ability to answer the auditor's question.

The deeper architectural point is that every shared resource at inference time is a candidate side channel. The auditor's question — "show me how customer A's queries cannot influence customer B's outputs" — is a question about the full chain: cache, batch, embedding index, fine-tune, log pipeline, error path. If any link in that chain is shared without a documented isolation boundary, the answer to the auditor's question is "we don't know," which in regulated industries is the same as "no."

The Discipline That Has to Land

The pattern across all four layers is the same: a feature that improved performance silently introduced cross-tenant state. The fix is not to remove the feature; the fix is to scope it.

A production-grade per-tenant inference isolation posture covers at least these surfaces:

  • Cache namespacing audited like row-level filters. Every cache layer — prompt cache, semantic cache, embedding cache — has a tenant key in the lookup path, and that key is in the threat model. The team can produce, on demand, the test that proves cross-tenant probes miss.
  • Fine-tune training-data lineage. The default is no cross-tenant aggregation. Opt-in is contractual, not implicit. The training pipeline emits a manifest of which tenants' data influenced the weights. The manifest is durable and queryable.
  • Embedding indexes physically separated where the threat model demands it. Logical multi-tenancy is acceptable for low-stakes large-fanout patterns; physical separation is required wherever competing tenants share infrastructure or where the auditor will ask the cross-tenant question.
  • Inference batching scoped to security domains. Requests are batched within a trust class, not across one. The cost of this discipline is real and measurable; the cost of skipping it is a postmortem.
  • Contract-level commitments to tenants. The contract names what tenant data does and does not influence — training, caching, retrieval, telemetry — and the engineering reality matches the contract clause by clause. The mismatch between what sales promised and what the runtime actually does is the single most expensive class of incident in this category.

The realization that closes the loop: "shared inference is fine" was a posture that held while AI features were experimental and tenants tolerated some fuzziness. It ages badly the moment one customer is your competitor's customer too — at which point the question stops being academic and starts being a renewal conversation. The teams that handle this well are the ones that paid the isolation tax before they were forced to. The teams that handle it badly are the ones that discover, at audit time, that "we'll add tenant isolation later" is the AI-era equivalent of "we'll add encryption later" — a sentence that sounds reasonable until the day it doesn't.

The architectural realization is older than AI: every shared resource is a potential leak path, and the only durable answer is to make the sharing intentional, scoped, and auditable. The four new layers — prompt cache, fine-tune, embedding index, KV-cache reuse — are just the latest places where that lesson has to be relearned at full price.

References:Let's stay in touch and Follow me for more thoughts and updates