Skip to main content

Cross-Tenant Data Leakage in Shared LLM Infrastructure: The Isolation Failures Nobody Tests For

· 11 min read
Tian Pan
Software Engineer

Most multi-tenant LLM products have a security gap that their engineers haven't tested for. Not a theoretical gap — a practical one, with documented attack vectors and real confirmed incidents. The gap is this: each layer of the modern AI stack introduces its own isolation primitive, and each one can fail silently in ways that let one customer's data reach another customer's context.

This isn't about prompt injection or jailbreaking. It's about the infrastructure itself — prompt caches, vector indexes, memory stores, and fine-tuning pipelines — and the organizational fiction of "isolation" that most teams ship without validating.

In April 2024, Wiz researchers demonstrated complete cross-tenant breaches on a major AI-as-a-service platform. The attack chain ran through misconfigured Kubernetes environments and pickle deserialization in model files, ultimately giving attackers access to private models and datasets across the entire customer base. The researchers could access other customers' data without authentication bypass — just by exploiting the gaps between isolation layers that looked solid individually but were not composed correctly.

That incident is the visible tip. The subtler failures don't make headlines because nobody is looking for them.

The KV-Cache Timing Channel

When you deploy an LLM serving system like vLLM with automatic prefix caching enabled, the system stores key-value tensors from repeated prompt prefixes in GPU memory and reuses them on cache hits. This is a meaningful efficiency win — cache hits skip the expensive prefill computation and respond noticeably faster.

That measurable latency difference is also an attack vector.

Research presented at NDSS 2025 documented "PROMPTPEEK"-class attacks in which an adversary reconstructs other users' prompts by analyzing cache hit/miss timing patterns on shared serving infrastructure. The methodology requires no special access — only the ability to send requests and observe response latency. When the attacker's probe prompt matches another tenant's cached prefix, the hit is statistically distinguishable from a miss at p < 10⁻⁸. From a sequence of probes, an attacker can deduce what other tenants are querying.

The fix exists in vLLM as the cache_salt parameter, which creates separate cache namespaces per tenant by incorporating the salt value into the block hash. Only requests with matching salts can reuse cached blocks. But this protection is opt-in and application-enforced. The default configuration — the one most teams deploy — provides no cross-tenant cache isolation whatsoever.

Anthropic's managed infrastructure switched from organization-level to workspace-level cache isolation in early 2026, recognizing that even internal teams within the same organization shouldn't share KV-cache blocks. If you're running your own serving stack, equivalent isolation requires explicit instrumentation. Most teams haven't added it.

The Namespace Illusion in Vector Databases

Vector databases are the layer most directly implicated in RAG-based leakage, and also the layer where the gap between "isolated" and "actually isolated" is widest.

Pinecone namespaces, Weaviate collections with multi-tenancy mode off, pgvector schemas with row-level security — all of these are organizational boundaries, not cryptographic ones. They work by convention: queries include a filter, the database restricts the search space. What makes them fail is the same thing that makes SQL injection work — the boundary is enforced by application code, not by the storage system itself.

The specific failure modes differ by database:

Pinecone namespaces are correctly enforced at the index level when specified. The failure mode is omission: a developer writing a retrieval call forgets to pass the namespace parameter, and the query scans across all tenants. In a code review, this looks like a minor oversight. In production, it means every query without a namespace returns vectors from any tenant's data.

pgvector with row-level security is more robust because the database engine enforces the policy even when application code contains bugs — a forgotten WHERE clause is blocked, not silently permitted. But PostgreSQL's optimizer statistics have leaked rows that RLS was supposed to block (CVE-2024-10976), demonstrating that even database-layer isolation can fail at unexpected boundary points. RLS is not a guarantee; it's a strong default that reduces the blast radius of application bugs.

Weaviate offers the strongest native multi-tenancy model among mainstream vector databases, with logically isolated data per tenant at the collection level. But it requires explicit multi-tenancy mode configuration — not the default — and the isolation guarantee depends on the tenant key being correctly set on every write and every query.

The testing gap is the common thread. Most teams verify that a tenant can retrieve their own data. Almost none verify that a tenant cannot retrieve another tenant's data. These are different tests.

A minimal cross-tenant isolation test looks like this: inject a distinctive document into Tenant A's index, then attempt retrieval using Tenant B's credentials — not just filtering but actually authenticating as Tenant B. If the document surfaces, the isolation is broken. Run this test in CI before every deployment that touches retrieval configuration.

Fine-Tuning as a Cross-Tenant Amplifier

Shared fine-tuning infrastructure introduces a contamination risk that most platform teams haven't considered: one tenant's training data can affect the base model that serves all tenants.

Research on training data poisoning has established that contaminating fewer than 0.01% of training samples is sufficient to implant behavioral backdoors that survive subsequent safety fine-tuning. Poisoning 1% of instruction-tuning data achieves 80% performance degradation on targeted task categories. The number of required poisoned samples remains roughly constant as training data scales — meaning larger training sets don't dilute the attack.

The multi-tenant threat model follows directly. If a platform runs fine-tuning jobs for multiple customers on shared infrastructure and produces a base model that all customers draw from, a single tenant uploading a poisoned dataset contaminates the shared base. Other tenants' models inherit the backdoor without knowing it exists. The poisoned behavior activates only when specific trigger patterns appear in prompts — patterns the attacker controls.

The practical defense is simple in principle and difficult in practice: never produce a shared base model from customer-specific fine-tuning jobs. Each fine-tuning run should either start from a stable, audited base model and produce a tenant-specific adapter, or run in completely isolated training infrastructure. The contamination risk disappears when there is no path from one tenant's training data to another tenant's serving weights.

For datasets that aren't directly uploaded by tenants but are assembled from mixed sources, use training data provenance tracking — record which data segments contributed to which model versions. When a contamination incident occurs, provenance logs tell you which model versions to revoke and which tenants are affected.

Agent Memory Store Leakage

Long-running agents maintain state across sessions. That state lives somewhere — Redis for ephemeral session context, Postgres for durable long-term memory, vector databases for semantic retrieval. Each storage layer requires its own isolation implementation, and they fail in different ways.

Redis is the most common failure point because it's frequently deployed as a single shared instance with key-prefix conventions for tenant separation:

tenant:{tenant-id}:session:{session-id}

This is organizational isolation. It works exactly until application code fails to include the tenant prefix — a missing variable, a copy-paste error, a library call that constructs keys internally. The database has no notion of tenant boundaries; it will happily return any key to any client that asks for it. ACL rules on key patterns help but require near-perfect discipline to maintain. One overly permissive ACL entry exposes the entire keyspace.

PostgreSQL with row-level security provides stronger guarantees for durable memory because the isolation is database-enforced, not application-enforced. Even a buggy ORM that omits a WHERE clause will be blocked by the RLS policy. The tradeoff is that cross-tenant queries (which you never want in production) become impossible even during debugging.

The audit test for memory store leakage mirrors the vector database test: write a distinctive piece of data as Tenant A, then read from Tenant B's context and verify the data is not accessible. For agents specifically, this test should run across multiple conversational turns, because memory leakage often occurs in context compilation — when the agent assembles its working context from stored state — rather than in a single read operation.

What Actually Enforces a Boundary

The practical lesson from auditing multi-tenant AI infrastructure is that there are two categories of isolation primitives: those that require application discipline to enforce, and those that enforce themselves.

Namespace-level isolation — key prefixes in Redis, schema isolation in databases, namespace parameters in vector databases — requires every developer, every library, and every code path to consistently apply the tenant context. One omission creates a gap. These primitives are operationally cheap but organizationally fragile.

Policy-level isolation — PostgreSQL RLS, per-tenant ACLs with keyspace restrictions, Firecracker microVMs for code execution — enforces boundaries regardless of what application code does. PostgreSQL RLS blocks cross-tenant reads even when the application omits a WHERE clause. A Firecracker microVM cannot access another tenant's filesystem even if the agent tries. These primitives cost more to operate but convert a class of software bugs into non-incidents.

Hyperscalers resolved this question empirically: AWS uses Firecracker for Lambda (untrusted customer code), Google uses gVisor for multi-tenant search infrastructure. Neither uses standard containers for workloads with serious cross-tenant isolation requirements. The industry's largest operators concluded that containers are insufficient for untrusted workloads — a conclusion that most SaaS AI platforms haven't yet absorbed.

For teams that can't justify the operational overhead of hardware-level isolation everywhere, the practical approach is to tier isolation by data sensitivity. Use policy-enforced boundaries (RLS, dedicated Redis databases) for any layer that processes customer-specific data. Reserve namespace-level isolation for layers where the data is less sensitive and the blast radius of a misconfiguration is bounded. Never use namespace-level isolation as the sole boundary for authentication state, encryption keys, or personally identifiable information.

The Audit Methodology

Finding cross-tenant contamination before a customer finds it requires intentional testing across every layer of the stack.

Prompt cache probing: Inject a distinctive, low-entropy prompt prefix as Tenant A. Measure response latency for an identical prompt from Tenant B's context. A statistically significant latency reduction indicates the cache is shared. Repeat after enabling cache salting per tenant and verify that the timing signal disappears.

Vector database cross-namespace retrieval: Write a synthetic document with a unique identifier into Tenant A's index. Execute a semantic query from Tenant B that should match the document. Verify the document does not appear in results. Test both with correct namespace filtering and with deliberately omitted namespace parameters to confirm the database rejects the second case.

Memory store leakage detection: For each memory backend (Redis, Postgres, other), inject a distinguishable value as Tenant A and attempt retrieval using Tenant B's session credentials. For agent systems specifically, run this test with multi-turn conversations where the agent has had multiple opportunities to compile context from the store.

Fine-tuning contamination: If your platform runs tenant-specific fine-tuning jobs, maintain a strict inventory of which training jobs contributed to which model checkpoints. Audit whether any shared base model was produced from a training run that included customer data. If so, that base model's provenance is unclean and should be retrained from audited sources.

Cross-tenant behavioral testing: Create two tenants with intentionally different system prompts and few-shot examples. After several interactions for each, probe whether either tenant's behavioral conditioning surfaces in the other's session. This tests the full context pipeline rather than individual storage layers.

None of these tests are difficult to implement. They are uniformly absent from standard CI pipelines because the industry hasn't treated cross-tenant isolation as a testing requirement — only as an architectural claim.

The Practical Posture

Multi-tenant LLM infrastructure has a security testing gap that doesn't exist in traditional multi-tenant systems only because the infrastructure is newer. The attack surfaces — prompt caches, vector indexes, fine-tuning pipelines, agent memory — were designed for efficiency first and isolation second.

The teams that will avoid incidents are the ones that treat cross-tenant isolation the same way they treat SQL injection: as something that must be verified automatically, on every deployment, not trusted based on architectural intent. Architecture diagrams show what is supposed to happen. Automated tests show what actually does.

Start with the test that most directly maps to a customer incident: inject sensitive data as one tenant, retrieve as another, verify nothing leaks. Add it to CI. Then work through the layers — prompt cache, vector database, memory store — until the test suite covers the full data path. Each test you write converts a silent failure mode into a detectable one.

The Hugging Face breach and the NDSS timing-channel research both point to the same conclusion: the isolation that the architecture diagram promises is not the isolation that production delivers. Closing that gap requires testing, not assumptions.

References:Let's stay in touch and Follow me for more thoughts and updates