Skip to main content

The Multi-Tenant LLM Problem: Noisy Neighbors, Isolation, and Fairness at Scale

· 12 min read
Tian Pan
Software Engineer

Your SaaS product launches with ten design customers. Everything works beautifully. Then you onboard a hundred tenants, and one of them — a power user running 200K-token context windows on a complex research workflow — causes every other customer's latency to spike. Support tickets start arriving. You look at your dashboards and see nothing obviously wrong: your model is healthy, your API returns 200s, and your p50 latency looks fine. Your p95 has silently tripled.

This is the noisy neighbor problem, and it hits LLM infrastructure harder than almost any other shared system. Here's why it's harder to solve than it is in databases — and the patterns that actually work.

Why LLMs Break Multi-Tenancy Assumptions

Traditional multi-tenant infrastructure has decades of tooling: connection pools, row-level security, per-schema databases, VPCs. The core abstraction in all of it is that resource consumption per request is bounded and roughly predictable. A database query takes a few milliseconds. A web request consumes a few hundred microseconds of CPU. You can estimate capacity from request rates.

LLM requests break this assumption completely. A single inference call might consume 512 tokens or 200,000. It might return in 300ms or take four minutes for a complex multi-step chain. The KV cache for a single long-context session on a Llama-3 70B model can exceed 40GB — memory that must remain resident on the GPU for the duration of that session. A GPU that can serve 50 concurrent short-context users can be fully saturated by two users running extended research workflows.

This variability isn't a bug you can engineer around. It's fundamental to what makes LLMs useful. The same infrastructure that handles simple question-answering needs to handle document analysis and multi-turn agentic workflows. And these workloads land on the same hardware.

The failure mode is invisible until it isn't. One tenant's expensive session evicts another tenant's KV cache, forcing recomputation. Recomputation spikes time-to-first-token. Users see stalls. Your monitoring shows the GPU is busy — which is technically accurate, but completely unhelpful for diagnosing whose session is consuming whose resources.

Token-Aware Rate Limiting Is Not Optional

Most teams start with request-per-minute (RPM) limits. These are easy to implement and immediately block the most egregious abuse. But they don't solve the noisy neighbor problem, because requests aren't equal. A tenant sending 100 short classification requests has a completely different GPU footprint than a tenant sending 3 requests with 50,000-token contexts.

The correct primitive for LLM rate limiting is tokens-per-minute (TPM), measured across both input and output. Input tokens are known at request time; output tokens must be estimated or tracked post-hoc. The practical approach is to count input tokens before admission and enforce a token budget per tenant per time window, then charge actual output tokens against that budget as they're generated.

Three algorithms dominate production implementations:

Token bucket fills at a steady rate and allows bursting up to bucket capacity. This models actual provider behavior well — most LLM APIs refill at a fixed TPM rate and allow short-term bursts. Token bucket is the right default for most applications because it enables burst-tolerant workloads while enforcing average-rate fairness.

Sliding window evaluates requests against a rolling time interval. It prevents the boundary exploitation that fixed windows allow (sending maximum requests at the end of one window and the beginning of the next), but is more complex to implement in distributed systems.

Priority queuing doesn't rate-limit by dropping requests — it degrades gracefully by queuing lower-priority requests when the system is under load. Interactive user requests get high priority; background batch jobs get processed when capacity is available. This is the right answer for applications where you'd rather add latency than return errors.

The overhead of token-aware rate limiting is negligible — around 4ms — compared to model generation time. There's no argument for skipping it.

One frequently missed consideration: rate limits must be enforced before the LLM call is dispatched, not after. Checking token budgets post-hoc means a runaway tenant has already consumed resources you can't reclaim. The enforcement point belongs at your API gateway or inference proxy, not in application code.

The Three Layers of Tenant Isolation

Multi-tenant LLM infrastructure requires isolation at three separate layers: the inference layer, the retrieval layer, and the context layer. Most teams implement one or two and discover the third the hard way.

Inference Layer Isolation

For most applications, shared model instances with per-tenant rate limiting and budget tracking are sufficient. The risk isn't that tenants share the model weights — they always do — it's that one tenant's resource consumption degrades another's latency. Rate limiting handles this.

For regulated industries or high-security requirements, dedicated model instances per tenant provide hard isolation. This is expensive: each dedicated instance means one or more GPUs reserved exclusively for one tenant, whether or not they're actively using them. The economics only work for high-value enterprise customers with strong compliance requirements.

The middle ground is namespace-level isolation: shared GPU resources with scheduler-enforced fairness. Modern serving systems like SGLang and vLLM support continuous batching, which interleaves tokens from different requests across the same GPU resources. With proper admission control and per-tenant token budgets, this achieves reasonable fairness without dedicated hardware. SGLang's RadixAttention memory manager provides ~29% higher throughput than baseline vLLM for multi-turn workloads — the architecture choice of your serving layer affects multi-tenant efficiency significantly.

Retrieval Layer Isolation (RAG)

This is where most implementations have gaps. When tenants share a vector index, retrieval isolation requires more than adding a tenant_id filter. It requires enforcing that filter at query time, before results reach the generation layer.

The critical mistake is post-retrieval filtering: retrieving across the full index and then dropping results that belong to other tenants. This is a data leakage risk — chunks from one tenant's documents appear in another tenant's retrieval context before the filter runs. More importantly, it's inefficient: you're paying to retrieve and embed content you'll immediately throw away.

Three patterns handle RAG isolation at different scale points:

Separate collections per tenant provides the cleanest isolation. Each tenant has their own vector collection. No cross-tenant queries are possible by construction. The tradeoff is operational overhead: spinning up new collections on every tenant onboarding, managing index sizes separately, and paying fixed costs even for tenants with small document volumes.

Namespaces within a shared index (Pinecone's approach) or partitioned collections (Qdrant's tiered multitenancy) offer a middle ground. Tenants are isolated within the index, but share underlying storage and compute. This handles the common case efficiently but doesn't provide hard isolation guarantees.

Metadata filtering with tenant_id is the most common approach and the most dangerous when implemented incorrectly. The filter must be enforced at the query layer, never post-retrieval. Tie it to the JWT or session token — inject tenant_id into every retrieval call from your infrastructure layer, not from application code. Application code can be bypassed; infrastructure enforcement cannot.

The security primitive that matters here is that authorization happens before data touches the generation context. Once a retrieval result enters the prompt, it can be exfiltrated through the model's output.

Context Layer Isolation

Context isolation is the least-discussed layer and the one with the most subtle failure modes. In multi-tenant systems that cache prompts or maintain session state, cross-tenant context leakage is a real risk.

KV cache sharing is the specific vulnerability to understand. Sharing KV cache entries across tenants offers real performance benefits — it eliminates redundant recomputation for shared prefixes like system prompts. But research presented at NDSS 2025 demonstrated that cross-tenant KV cache reuse creates timing side-channel attacks that allow reconstruction of other tenants' prompts. The attack surface is real, not theoretical.

The safe approach is cache salting: add a per-tenant cache key component to all cached entries. This prevents cross-tenant cache hits for privacy-sensitive content. For system prompts that are truly shared across all tenants — your base instructions, for example — cache sharing is safe and the performance benefit is worth keeping. For anything tenant-specific, strict isolation is required.

For session state in multi-turn agents, isolate at the storage layer. Per-tenant memory namespaces, not shared tables with tenant_id rows. The engineering cost is higher but the isolation guarantee is stronger and the access patterns are cleaner.

Fairness and Degradation Under Load

Rate limiting prevents the worst-case noisy neighbor scenarios, but it doesn't tell you what to do when your aggregate load exceeds capacity. You need explicit policies for how to degrade gracefully.

Priority lanes are the most practical approach. Classify requests by type: interactive user sessions get the highest priority, API integrations get medium priority, background batch jobs get lowest priority. When the system is under load, shed lowest-priority traffic first. Users tolerate latency on batch operations far better than they tolerate latency on interactive responses.

Admission control is the complement to rate limiting. Rate limiting governs the steady state; admission control governs spikes. When a new request would push a tenant over their token budget for the current window, queue it or return a 429 immediately rather than degrading service for other tenants. A clean 429 with a Retry-After header is a better experience than a silent 10-second wait.

Per-tenant circuit breakers prevent a single malfunctioning tenant from causing system-wide degradation. If one tenant's requests consistently exceed token limits or error repeatedly, circuit-break that tenant's traffic independently of others.

Budget enforcement timing matters more than most teams realize. Enforce budgets at request admission, not post-generation. Checking after the model has run means you've already consumed the compute budget and can't recover it.

The Operational Difference from Database Multi-Tenancy

Teams that have operated multi-tenant databases often assume LLM multi-tenancy is similar. The mental model doesn't transfer.

Database queries are CPU-bound and CPU costs are relatively predictable per query type. You can estimate capacity from query rate and average execution time. LLM inference is GPU-bound with costs that vary by an order of magnitude based on context length. A 2,000-token request and a 200,000-token request look identical at the HTTP layer until generation starts.

Database connection pools handle concurrency through queuing and connection reuse. GPU inference uses continuous batching, where the serving engine dynamically schedules token generation across concurrent requests. Interference happens at a lower level — evicting one tenant's KV cache to accommodate another's request — and is harder to observe.

Database row-level security is deterministic: a query either returns rows it's authorized to see or it doesn't. Vector retrieval filtering is probabilistic and can fail silently if implemented incorrectly. A missing filter propagates a data leak through the generation layer without any error being returned.

Database schema changes are transactional and immediately visible to all queries. Embedding model updates are not backward-compatible — embedding vectors from model version N are not interchangeable with vectors from model version N+1. Updating your embedding model means reindexing your entire vector store, which must happen with zero downtime in a multi-tenant system where tenants are continuously adding documents.

None of these differences mean multi-tenant LLM infrastructure is impossible — it's in production at scale across many companies. But they mean you can't cargo-cult patterns from database operations. The primitives are different.

What to Build First

If you're standing up multi-tenant LLM infrastructure and need to prioritize, this is the order that pays off:

Start with TPM rate limiting at the gateway layer. This is the highest-leverage single change you can make. It prevents the most damaging noisy-neighbor scenarios and gives you cost attribution per tenant as a free side effect. Implement it before you onboard your tenth tenant.

Implement retrieval isolation with enforced query-time filtering. Get this right before you ingest significant tenant data. Post-retrieval filtering is a security vulnerability and a waste of compute. The correct architecture enforces isolation before retrieval results enter the generation context.

Add priority queuing when your system first hits capacity. Don't preemptively build sophisticated queuing infrastructure. Build it when you have real load data showing which workload types are competing with each other.

Audit your KV cache sharing policy explicitly. Don't accept the default behavior of your serving framework as a security decision. Decide explicitly which cache entries are safe to share across tenants and which require isolation, then enforce that policy.

Separate billing from enforcement. Many teams track per-tenant token consumption for billing and assume that's also their enforcement mechanism. It isn't. Billing records what happened; enforcement prevents it from happening. Build both independently.

Multi-tenant LLM infrastructure will become table stakes for any AI product with multiple customers. The teams that get the primitives right — rate limiting, retrieval isolation, context isolation, graceful degradation — will spend their time building product features. The teams that defer it will spend their time explaining to enterprise customers why one tenant's research workflow caused their latency to spike.


The noisy neighbor problem doesn't announce itself. It accumulates silently until your p95 latency becomes someone else's p50. Build the isolation before you need to explain why it wasn't there.

References:Let's stay in touch and Follow me for more thoughts and updates