Skip to main content

14 posts tagged with "rate-limiting"

View all tags

The Noisy Neighbor Problem in Shared LLM Infrastructure: Tenancy Models for AI Features

· 12 min read
Tian Pan
Software Engineer

The pager goes off at 2:47 AM. The customer-facing chat assistant is returning 429s for half of paying users. Engineers scramble through dashboards, looking for the bug they shipped that afternoon. They find nothing — the code is fine. The actual culprit is a batch summarization job a different team launched that evening, sharing the same provider API key, which has eaten the account's per-minute token budget for the next four hours. Nobody owns the shared key. Nobody owns the limit.

This is the noisy-neighbor problem, and it has a particular cruelty in LLM systems that classic API quota incidents do not. A REST endpoint that hits its rate ceiling fails fast and gets retried; an LLM token-per-minute bucket is consumed asymmetrically by request content, so a single feature emitting 8K-token completions can starve a feature making cheap 200-token classification calls without ever appearing in request-count graphs. The traffic isn't noisy in the dimension you're measuring.

Most teams discover this the way the team above did: an unrelated team's job collides with a paying user's session, and the only thing both have in common is a string in an environment variable.

Backpressure Patterns for LLM Pipelines: Why Exponential Backoff Isn't Enough

· 10 min read
Tian Pan
Software Engineer

During peak usage, some LLM providers experience failure rates exceeding 20%. When your system hits that wall and responds by doubling its wait time and retrying, you are solving the wrong problem. Exponential backoff handles a single call's resilience. It does nothing for the system as a whole — nothing for wasted tokens, nothing for connection pool exhaustion, nothing for the 50 other requests queued behind the one that just got a 429.

The traffic patterns hitting LLM APIs have also changed fundamentally. Simple sub-100-token queries dropped from 80% to roughly 20% of traffic between 2023 and 2025, while requests over 500 tokens became the consistent majority. Agentic workflows chain 10–20 sequential calls in rapid bursts, generating traffic patterns that look indistinguishable from a DDoS attack under traditional request-per-minute rate limits. The infrastructure built for REST APIs with predictable payloads is not the infrastructure you need for LLM pipelines.