The Noisy Neighbor Problem in Shared LLM Infrastructure: Tenancy Models for AI Features
The pager goes off at 2:47 AM. The customer-facing chat assistant is returning 429s for half of paying users. Engineers scramble through dashboards, looking for the bug they shipped that afternoon. They find nothing — the code is fine. The actual culprit is a batch summarization job a different team launched that evening, sharing the same provider API key, which has eaten the account's per-minute token budget for the next four hours. Nobody owns the shared key. Nobody owns the limit.
This is the noisy-neighbor problem, and it has a particular cruelty in LLM systems that classic API quota incidents do not. A REST endpoint that hits its rate ceiling fails fast and gets retried; an LLM token-per-minute bucket is consumed asymmetrically by request content, so a single feature emitting 8K-token completions can starve a feature making cheap 200-token classification calls without ever appearing in request-count graphs. The traffic isn't noisy in the dimension you're measuring.
Most teams discover this the way the team above did: an unrelated team's job collides with a paying user's session, and the only thing both have in common is a string in an environment variable.
Why LLM rate limits hit differently
Provider rate limits are typically expressed in two dimensions at once: requests per minute and tokens per minute, with Anthropic splitting tokens further into input and output buckets. The buckets are token-bucket algorithms — they refill continuously, so transient bursts are partially absorbed, but sustained pressure exhausts capacity quickly. A 70B-class model call can consume thousands of tokens in a single request, and a single tool-use loop in an agent can consume tens of thousands across a few seconds.
The mismatch with classical capacity planning is what makes this hard. In a typical web service, you scale by RPS and your per-request cost is roughly fixed. In an LLM service, two requests can differ in real cost by 100x depending on prompt size, output length, retries, and whether reasoning is enabled. A 1% increase in users can translate into a 30% increase in token consumption if those users happen to hit a feature with long contexts.
There is no native provider-side concept of "this feature owns 40% of the account's TPM and the rest of you split the remainder." OpenAI and Anthropic both expose account-level or workspace-level limits, not feature-level allocations. If you have one API key per provider and three features behind it, you have implicitly granted whichever feature is loudest the right to starve the other two.
The detection problem comes before the isolation problem
Most teams skip detection and jump to remediation. This produces architectures that might be isolating noisy neighbors but nobody can prove it. The detection layer needs three signals that are surprisingly absent from default observability:
The first is per-feature TPM consumption, broken down by direction (input vs. output). Not request counts. Not latency. Not error rates. Token throughput by feature. The reason: a feature can be quiet in request volume and ruinous in token volume simultaneously, and the provider 429 will hit features doing high-RPS, low-TPM work first. If your dashboards track request counts only, the team that gets paged is never the team that caused the incident.
The second is headroom-to-limit ratio per minute. Absolute TPM numbers tell you what you used; the ratio against your provider tier tells you when the next spike will tip over. Once this ratio crosses ~70%, you are one bursty job away from a multi-feature outage. Most monitoring stacks I've seen alert on 429 responses, which is too late — the outage has already started.
The third is share-of-budget by feature, computed as a rolling window. This is the actual signal that catches a noisy neighbor: a feature whose share of total token consumption doubles in 10 minutes, regardless of whether the absolute number triggers a limit. Without this signal, you'll discover the noisy neighbor only when something downstream fails.
A practical detection threshold: page the platform team if any single feature exceeds 60% of total token throughput for more than two consecutive five-minute windows, unless that feature is on a dedicated allocation. The escape hatch matters because some features deserve to dominate; the alert matters because most features that dominate are doing so by accident.
Isolation pattern 1: per-feature token buckets
The simplest isolation pattern is to allocate each feature a fixed slice of the provider's token budget and enforce it locally — usually inside an LLM gateway or proxy. If your tier provides 1,000,000 TPM, you might give the chat assistant 600,000, the batch summarizer 200,000, and reserve 200,000 as a global pool for new features and bursts.
Two things make this pattern easier to talk about than to operate. The first is that allocations need to sum to slightly less than the provider limit, not exactly equal — provider rate limit accounting has measurement lag, and a 100% allocation will trigger 429s before your local accounting thinks it should. A useful default is to size local buckets to 85% of the provider limit and treat the remaining 15% as headroom for measurement skew and cross-region replication.
The second is that fixed allocations starve each other under uneven load. The chat assistant gets quiet at 3 AM while the batch job hits its ceiling and gets local-throttled — even though the provider has plenty of capacity. Dynamic reallocation systems address this by computing each feature's allowed share as a weighted fraction of currently-active features rather than as a static number. If only the batch job is running, it gets the full pool; if the chat assistant wakes up, the batch job's allocation shrinks proportionally over a few seconds.
Dynamic allocation is the right default for most teams. Static allocation is correct when you have hard contractual SLAs that need to be defensible — auditors prefer fixed numbers to fairness algorithms.
Isolation pattern 2: dedicated model pools for revenue-critical flows
When isolation matters more than efficiency, the answer is to stop sharing the provider account at all. Revenue-critical paths get their own provider workspace, their own API key, their own rate limits, and their own billing. The chat assistant that drives renewals does not share quota with internal tooling, period.
This sounds wasteful — you're paying for unused headroom on each pool — but the math usually works out in favor of separation. The hidden cost of shared infrastructure is the cost of incidents, and a single multi-hour outage of a paid product is generally worth more than a year of unused capacity on a dedicated key.
The pattern extends naturally to model selection. Revenue-critical flows often want a specific model with predictable behavior; experimental features can route to whatever model is cheapest or fastest this quarter. Mixing those traffic profiles on the same key means a model-routing change in an experiment can shift load on the production model without anyone noticing until users start complaining.
Where I've seen this go wrong: teams set up dedicated keys, then route fallback traffic from the shared pool to the dedicated key when the shared pool is exhausted. The dedicated key is now a noisy neighbor of itself, and the isolation is theatrical. Fallbacks should go down in priority, not up.
Isolation pattern 3: priority queues with preemption
Rate limits hit a hard ceiling. Once you're at the ceiling, somebody has to wait. The decision of who waits is the priority queue.
The simplest priority scheme has two tiers: real-time user-facing requests and everything else. When the bucket is full, real-time requests skip the queue and batch jobs get backed up. This works well for most product surfaces because user-facing traffic is rarely the dominant token consumer; it's the batch jobs and the periodic backfills that drain the bucket.
A three-tier scheme adds preemption: high-priority requests can cancel in-flight low-priority requests if the queue is starving. This is dangerous to operate — you're spending tokens on requests that get killed mid-flight — but for systems with very bursty user-facing traffic against a backdrop of constant batch load, preemption is the only way to keep p99 latency from collapsing during spikes.
The trap is that priority becomes meaningless when everything is high-priority. Without active governance, every product manager will lobby for tier-one status, the queue collapses to a single tier, and you're back to first-come-first-served. The platform team needs the political authority to keep the priority assignments honest, which means the priority decision has to live somewhere with org-wide visibility — not buried in an env var on each service.
Isolation pattern 4: per-tenant cost ceilings
Tenancy in AI features is a two-layer concept. There's the internal tenancy of which feature owns which slice — the four patterns above. There's also the customer-facing tenancy of how individual users or accounts are bounded. The latter is what stops a single enterprise customer's runaway agent from burning your monthly margin.
Cost ceilings work best when they're enforced in-memory at the gateway with no database round-trip per request. The check has to be cheap or it becomes a latency tax on every request. A workable design holds a sliding-window estimate of per-tenant token consumption in a shared memory store (Redis is the common choice), updates it on each completion, and rejects requests when the projected cost over the next minute would exceed the tenant's allocation.
The interesting product design question isn't the architecture — it's what to do when the ceiling is reached. Hard rejection is honest but jarring. Quality degradation (route to a smaller, cheaper model when the tenant is over their soft budget) preserves the user experience at the cost of explainability. Most production systems land on a hybrid: degrade quality on soft-limit, hard-reject on hard-limit, and surface both states clearly to the customer's admin.
A subtle but expensive failure mode: tracking usage in the application code rather than the gateway. The application sees the request before the model is called and updates the counter optimistically. If the request fails, the counter is wrong, and over weeks of accumulated drift, tenants get throttled or over-served unpredictably. Counters should be updated based on actual completion responses, not estimates from request payloads.
The org-structure implication nobody wants
Every team I've watched go through this transition has resisted the same conclusion: shared LLM infrastructure eventually requires a platform team. Not because the technology is hard, but because the governance is hard. Allocation decisions have to be made centrally, priority assignments have to be defended against political pressure, cost ceilings have to be enforced uniformly, and detection has to be wired into a single dashboard that any feature team can read.
When this responsibility is left informal, what happens is predictable. The team that owns the original API key becomes the de facto platform team without the budget or mandate. They get paged for everyone else's incidents but can't enforce limits because they have no authority over other teams' roadmaps. They burn out, the system rots, and a year later the company hires a "AI infrastructure" team that has to redo all the same decisions from scratch.
The right move is to recognize early that an LLM gateway is platform-team-shaped infrastructure, the same way Kubernetes or a service mesh is. It needs an owner, an SLO, an oncall rotation, and a budget. The decision to introduce that team is usually made at exactly the wrong moment — after the incident, when the company is reactive — but it's still better to make it then than to keep paying for it in surprise outages.
What "good" looks like at steady state
A team that has solved this problem has a few visible properties. There is one place where every feature's token consumption is graphed against its allocated share. Every API key has a documented owner, an explicit list of features it serves, and a tier classification. New features cannot enter production until they've been allocated a slice of capacity by the platform team, and that allocation is reviewed quarterly against actual usage. Cost ceilings on enterprise tenants are enforced by default, with explicit overrides logged.
Most teams will not need all four isolation patterns at once. Start with detection — you cannot fix what you cannot see — then add per-feature buckets. Dedicated pools come next, when a single feature's revenue impact justifies its own pool. Priority queues are for the systems large enough that capacity contention is constant rather than occasional. Per-tenant ceilings are for systems with external customers whose usage you cannot bound by trust.
The work is not glamorous. No one ships a "we successfully avoided an outage at 2:47 AM" feature. But this is the kind of infrastructure that decides whether your AI products feel reliable to the people who pay for them, and it almost always pays for itself the first time it prevents an incident that would have made the news.
- https://docs.anthropic.com/en/api/rate-limits
- https://support.anthropic.com/en/articles/8243635-our-approach-to-api-rate-limits
- https://portkey.ai/blog/rate-limiting-for-llm-applications/
- https://www.truefoundry.com/blog/rate-limiting-in-llm-gateway
- https://compute.hivenet.com/post/llm-rate-limiting-quotas
- https://docs.litellm.ai/docs/proxy/multi_tenant_architecture
- https://docs.litellm.ai/docs/proxy/dynamic_rate_limit
- https://agentgateway.dev/blog/2025-11-02-rate-limit-quota-llm/
- https://werun.dev/blog/how-to-handle-llm-api-rate-limits-in-production
- https://cloud.google.com/blog/products/ai-machine-learning/learn-how-to-handle-429-resource-exhaustion-errors-in-your-llms
- https://learn.microsoft.com/en-us/azure/api-management/llm-token-limit-policy
- https://dev.to/pranay_batta/building-hierarchical-budget-controls-for-multi-tenant-llm-gateways-ceo
- https://konghq.com/blog/engineering/ai-gateway-benchmark-kong-ai-gateway-portkey-litellm
- https://www.inngest.com/blog/fixing-multi-tenant-queueing-concurrency-problems
