Skip to main content

The Noisy Neighbor Problem in Shared LLM Infrastructure: Tenancy Models for AI Features

· 12 min read
Tian Pan
Software Engineer

The pager goes off at 2:47 AM. The customer-facing chat assistant is returning 429s for half of paying users. Engineers scramble through dashboards, looking for the bug they shipped that afternoon. They find nothing — the code is fine. The actual culprit is a batch summarization job a different team launched that evening, sharing the same provider API key, which has eaten the account's per-minute token budget for the next four hours. Nobody owns the shared key. Nobody owns the limit.

This is the noisy-neighbor problem, and it has a particular cruelty in LLM systems that classic API quota incidents do not. A REST endpoint that hits its rate ceiling fails fast and gets retried; an LLM token-per-minute bucket is consumed asymmetrically by request content, so a single feature emitting 8K-token completions can starve a feature making cheap 200-token classification calls without ever appearing in request-count graphs. The traffic isn't noisy in the dimension you're measuring.

Most teams discover this the way the team above did: an unrelated team's job collides with a paying user's session, and the only thing both have in common is a string in an environment variable.

Why LLM rate limits hit differently

Provider rate limits are typically expressed in two dimensions at once: requests per minute and tokens per minute, with Anthropic splitting tokens further into input and output buckets. The buckets are token-bucket algorithms — they refill continuously, so transient bursts are partially absorbed, but sustained pressure exhausts capacity quickly. A 70B-class model call can consume thousands of tokens in a single request, and a single tool-use loop in an agent can consume tens of thousands across a few seconds.

The mismatch with classical capacity planning is what makes this hard. In a typical web service, you scale by RPS and your per-request cost is roughly fixed. In an LLM service, two requests can differ in real cost by 100x depending on prompt size, output length, retries, and whether reasoning is enabled. A 1% increase in users can translate into a 30% increase in token consumption if those users happen to hit a feature with long contexts.

There is no native provider-side concept of "this feature owns 40% of the account's TPM and the rest of you split the remainder." OpenAI and Anthropic both expose account-level or workspace-level limits, not feature-level allocations. If you have one API key per provider and three features behind it, you have implicitly granted whichever feature is loudest the right to starve the other two.

The detection problem comes before the isolation problem

Most teams skip detection and jump to remediation. This produces architectures that might be isolating noisy neighbors but nobody can prove it. The detection layer needs three signals that are surprisingly absent from default observability:

The first is per-feature TPM consumption, broken down by direction (input vs. output). Not request counts. Not latency. Not error rates. Token throughput by feature. The reason: a feature can be quiet in request volume and ruinous in token volume simultaneously, and the provider 429 will hit features doing high-RPS, low-TPM work first. If your dashboards track request counts only, the team that gets paged is never the team that caused the incident.

The second is headroom-to-limit ratio per minute. Absolute TPM numbers tell you what you used; the ratio against your provider tier tells you when the next spike will tip over. Once this ratio crosses ~70%, you are one bursty job away from a multi-feature outage. Most monitoring stacks I've seen alert on 429 responses, which is too late — the outage has already started.

The third is share-of-budget by feature, computed as a rolling window. This is the actual signal that catches a noisy neighbor: a feature whose share of total token consumption doubles in 10 minutes, regardless of whether the absolute number triggers a limit. Without this signal, you'll discover the noisy neighbor only when something downstream fails.

A practical detection threshold: page the platform team if any single feature exceeds 60% of total token throughput for more than two consecutive five-minute windows, unless that feature is on a dedicated allocation. The escape hatch matters because some features deserve to dominate; the alert matters because most features that dominate are doing so by accident.

Isolation pattern 1: per-feature token buckets

The simplest isolation pattern is to allocate each feature a fixed slice of the provider's token budget and enforce it locally — usually inside an LLM gateway or proxy. If your tier provides 1,000,000 TPM, you might give the chat assistant 600,000, the batch summarizer 200,000, and reserve 200,000 as a global pool for new features and bursts.

Two things make this pattern easier to talk about than to operate. The first is that allocations need to sum to slightly less than the provider limit, not exactly equal — provider rate limit accounting has measurement lag, and a 100% allocation will trigger 429s before your local accounting thinks it should. A useful default is to size local buckets to 85% of the provider limit and treat the remaining 15% as headroom for measurement skew and cross-region replication.

The second is that fixed allocations starve each other under uneven load. The chat assistant gets quiet at 3 AM while the batch job hits its ceiling and gets local-throttled — even though the provider has plenty of capacity. Dynamic reallocation systems address this by computing each feature's allowed share as a weighted fraction of currently-active features rather than as a static number. If only the batch job is running, it gets the full pool; if the chat assistant wakes up, the batch job's allocation shrinks proportionally over a few seconds.

Dynamic allocation is the right default for most teams. Static allocation is correct when you have hard contractual SLAs that need to be defensible — auditors prefer fixed numbers to fairness algorithms.

Isolation pattern 2: dedicated model pools for revenue-critical flows

When isolation matters more than efficiency, the answer is to stop sharing the provider account at all. Revenue-critical paths get their own provider workspace, their own API key, their own rate limits, and their own billing. The chat assistant that drives renewals does not share quota with internal tooling, period.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates