The Multi-Tenant LLM Problem: Noisy Neighbors, Isolation, and Fairness at Scale
Your SaaS product launches with ten design customers. Everything works beautifully. Then you onboard a hundred tenants, and one of them — a power user running 200K-token context windows on a complex research workflow — causes every other customer's latency to spike. Support tickets start arriving. You look at your dashboards and see nothing obviously wrong: your model is healthy, your API returns 200s, and your p50 latency looks fine. Your p95 has silently tripled.
This is the noisy neighbor problem, and it hits LLM infrastructure harder than almost any other shared system. Here's why it's harder to solve than it is in databases — and the patterns that actually work.
Why LLMs Break Multi-Tenancy Assumptions
Traditional multi-tenant infrastructure has decades of tooling: connection pools, row-level security, per-schema databases, VPCs. The core abstraction in all of it is that resource consumption per request is bounded and roughly predictable. A database query takes a few milliseconds. A web request consumes a few hundred microseconds of CPU. You can estimate capacity from request rates.
LLM requests break this assumption completely. A single inference call might consume 512 tokens or 200,000. It might return in 300ms or take four minutes for a complex multi-step chain. The KV cache for a single long-context session on a Llama-3 70B model can exceed 40GB — memory that must remain resident on the GPU for the duration of that session. A GPU that can serve 50 concurrent short-context users can be fully saturated by two users running extended research workflows.
This variability isn't a bug you can engineer around. It's fundamental to what makes LLMs useful. The same infrastructure that handles simple question-answering needs to handle document analysis and multi-turn agentic workflows. And these workloads land on the same hardware.
The failure mode is invisible until it isn't. One tenant's expensive session evicts another tenant's KV cache, forcing recomputation. Recomputation spikes time-to-first-token. Users see stalls. Your monitoring shows the GPU is busy — which is technically accurate, but completely unhelpful for diagnosing whose session is consuming whose resources.
Token-Aware Rate Limiting Is Not Optional
Most teams start with request-per-minute (RPM) limits. These are easy to implement and immediately block the most egregious abuse. But they don't solve the noisy neighbor problem, because requests aren't equal. A tenant sending 100 short classification requests has a completely different GPU footprint than a tenant sending 3 requests with 50,000-token contexts.
The correct primitive for LLM rate limiting is tokens-per-minute (TPM), measured across both input and output. Input tokens are known at request time; output tokens must be estimated or tracked post-hoc. The practical approach is to count input tokens before admission and enforce a token budget per tenant per time window, then charge actual output tokens against that budget as they're generated.
Three algorithms dominate production implementations:
Token bucket fills at a steady rate and allows bursting up to bucket capacity. This models actual provider behavior well — most LLM APIs refill at a fixed TPM rate and allow short-term bursts. Token bucket is the right default for most applications because it enables burst-tolerant workloads while enforcing average-rate fairness.
Sliding window evaluates requests against a rolling time interval. It prevents the boundary exploitation that fixed windows allow (sending maximum requests at the end of one window and the beginning of the next), but is more complex to implement in distributed systems.
Priority queuing doesn't rate-limit by dropping requests — it degrades gracefully by queuing lower-priority requests when the system is under load. Interactive user requests get high priority; background batch jobs get processed when capacity is available. This is the right answer for applications where you'd rather add latency than return errors.
The overhead of token-aware rate limiting is negligible — around 4ms — compared to model generation time. There's no argument for skipping it.
One frequently missed consideration: rate limits must be enforced before the LLM call is dispatched, not after. Checking token budgets post-hoc means a runaway tenant has already consumed resources you can't reclaim. The enforcement point belongs at your API gateway or inference proxy, not in application code.
The Three Layers of Tenant Isolation
Multi-tenant LLM infrastructure requires isolation at three separate layers: the inference layer, the retrieval layer, and the context layer. Most teams implement one or two and discover the third the hard way.
Inference Layer Isolation
For most applications, shared model instances with per-tenant rate limiting and budget tracking are sufficient. The risk isn't that tenants share the model weights — they always do — it's that one tenant's resource consumption degrades another's latency. Rate limiting handles this.
For regulated industries or high-security requirements, dedicated model instances per tenant provide hard isolation. This is expensive: each dedicated instance means one or more GPUs reserved exclusively for one tenant, whether or not they're actively using them. The economics only work for high-value enterprise customers with strong compliance requirements.
The middle ground is namespace-level isolation: shared GPU resources with scheduler-enforced fairness. Modern serving systems like SGLang and vLLM support continuous batching, which interleaves tokens from different requests across the same GPU resources. With proper admission control and per-tenant token budgets, this achieves reasonable fairness without dedicated hardware. SGLang's RadixAttention memory manager provides ~29% higher throughput than baseline vLLM for multi-turn workloads — the architecture choice of your serving layer affects multi-tenant efficiency significantly.
Retrieval Layer Isolation (RAG)
This is where most implementations have gaps. When tenants share a vector index, retrieval isolation requires more than adding a tenant_id filter. It requires enforcing that filter at query time, before results reach the generation layer.
The critical mistake is post-retrieval filtering: retrieving across the full index and then dropping results that belong to other tenants. This is a data leakage risk — chunks from one tenant's documents appear in another tenant's retrieval context before the filter runs. More importantly, it's inefficient: you're paying to retrieve and embed content you'll immediately throw away.
Three patterns handle RAG isolation at different scale points:
Separate collections per tenant provides the cleanest isolation. Each tenant has their own vector collection. No cross-tenant queries are possible by construction. The tradeoff is operational overhead: spinning up new collections on every tenant onboarding, managing index sizes separately, and paying fixed costs even for tenants with small document volumes.
Namespaces within a shared index (Pinecone's approach) or partitioned collections (Qdrant's tiered multitenancy) offer a middle ground. Tenants are isolated within the index, but share underlying storage and compute. This handles the common case efficiently but doesn't provide hard isolation guarantees.
- https://learn.microsoft.com/en-us/azure/architecture/antipatterns/noisy-neighbor/noisy-neighbor
- https://learn.microsoft.com/en-us/azure/architecture/guide/multitenant/service/openai
- https://www.truefoundry.com/blog/rate-limiting-in-llm-gateway
- https://milvus.io/blog/build-multi-tenancy-rag-with-milvus-best-practices-part-one.md
- https://www.pinecone.io/learn/series/vector-databases-in-production-for-busy-engineers/vector-database-multi-tenancy/
- https://aws.amazon.com/blogs/machine-learning/multi-tenant-rag-implementation-with-amazon-bedrock-and-amazon-opensearch-service-for-saas-using-jwt/
- https://arxiv.org/html/2503.16525v2
- https://arxiv.org/html/2603.00356v1
- https://www.ndss-symposium.org/wp-content/uploads/2025-1772-paper.pdf
- https://portkey.ai/blog/rate-limiting-for-llm-applications/
- https://docs.aws.amazon.com/prescriptive-guidance/latest/agentic-ai-multitenant/enforcing-tenant-isolation.html
- https://www.businesswire.com/news/home/20251119343840/en/Qdrant-Introduces-Tiered-Multitenancy-to-Eliminate-Noisy-Neighbor-Problems-in-Vector-Search
