Skip to main content

Quota Starvation: When Your AI Features Eat Each Other's Rate Limits

· 11 min read
Tian Pan
Software Engineer

At 2 AM, a scheduled report-generation job spins up fifty parallel LLM requests against your shared API key. By the time the 9 AM product demo starts, every real-time chat completion is silently timing out. Your error dashboards are green. No 429s in the logs. The model is returning responses — just ten seconds late, on a feature with a two-second SLA.

This is quota starvation. It does not look like an outage. It looks like the AI is "slow today."

Every LLM provider imposes rate limits on API access: tokens per minute (TPM), requests per minute (RPM), or both. When a product has multiple AI-powered features — a chat assistant, a search relevance scorer, a report generator, an email classifier — and all of them share a single API key, they are competing for the same finite quota pool. The feature that wins is whichever one happens to issue requests first. There is no negotiation, no priority, no fairness. It is first-come, first-served by construction.

The engineering pain here is not the rate limit itself. It is that the competition is invisible. You built each feature believing it had access to the full quota. You never explicitly decided that the batch job should be allowed to starve the chat assistant. The priority was set implicitly, at two in the morning, by a cron job.

Why Shared Quota Is a Hidden Concurrency Problem

Rate limits at major providers are more nuanced than a single cap. Anthropic tracks input TPM, output TPM, and RPM independently — a combined token count can be misleading because a feature that generates long completions can exhaust output quota while input quota stays untouched. OpenAI limits scale with tier and cumulative spend. Gemini Flash allows 4 million TPM on paid tiers with no minimum spend.

What these providers share is the enforcement model: quota is a shared pool across all requests on a key, depleted in real time and refilled at a steady rate. The refill is continuous — roughly speaking, a 400K TPM limit refreshes at about 6,700 tokens per second. When a batch job maintains fifty in-flight requests and the completions consume tokens faster than the refill rate, the bucket stays empty. Interactive requests arrive, find no quota available, and wait.

This is the distributed systems problem of resource starvation applied to an API. The background workload is not doing anything wrong by its own design. It issued requests, consumed quota, and got results. The problem is the absence of any mechanism to reserve capacity for higher-priority workloads.

Three failure modes emerge from unconstrained shared quota:

Starvation — a high-throughput batch consumer holds quota continuously, and lower-throughput interactive requests can never acquire enough to proceed. The batch job does not have to be doing anything malicious; it simply has enough parallelism to keep the pool depleted.

Head-of-line blocking — when requests queue in FIFO order behind one another, a large prompt that consumes thousands of tokens delays every request behind it. A user expecting a two-second response waits while someone's document-summarization job drains the window.

Priority inversion — a low-priority job acquires quota that a high-priority job needs, and the high-priority job must wait. In shared-quota systems this happens silently: there is no mechanism to preempt or reorder. The inversion is invisible until you start debugging latency.

The Minimum Viable Fix: Separate API Keys

The fastest way to eliminate cross-feature quota starvation is to stop sharing keys. Provision a dedicated API key for interactive, user-blocking features and a separate key for background batch workloads. Each key gets its own quota envelope. A background report job consuming its entire batch key cannot touch the interactive key's TPM at all.

This is not always possible — provider tiers often tie quota ceilings to spend, and splitting keys can fragment your tier progression — but even partial separation is valuable. Grouping features into two buckets (real-time vs. asynchronous) already eliminates the worst failure mode: the 2 AM cron job starving the 9 AM demo.

Separate keys also make observability trivial. You can instrument each key independently and see which workload is approaching its limit without complex attribution logic.

Priority Queuing When You Cannot Separate Keys

When key separation is not feasible — multiple teams sharing a consolidated enterprise account, providers that gate quota tier by total spend rather than per-key — you need application-level quota governance.

The token bucket algorithm is the most natural fit for LLM rate limiting. Your application maintains an internal token balance that replenishes at the provider's allowed TPM rate. Before issuing any request, the application estimates the token cost and deducts it from the bucket. Requests that would overdraw the bucket wait. This converts the provider's enforcement — which happens at the network boundary, after you have already made the API call — into local enforcement that can apply priority rules.

A priority queue sits in front of the token bucket. Incoming LLM requests are classified by priority before they enter the queue:

  • Real-time user interactions (chat completions, search ranking, recommendations)
  • Time-sensitive background tasks (webhooks, near-real-time enrichments)
  • Scheduled batch jobs (report generation, bulk classification)
  • Internal tooling and debug queries

When the token bucket is full, all tiers proceed freely. When the bucket drops below a threshold — say, 20% remaining — the queue stops dequeuing from lower tiers first. Batch jobs pause. Near-real-time tasks throttle. Interactive requests continue to drain the remaining capacity.

One pathology to guard against: a batch job that is paused indefinitely is as bad as one that is never prioritized. Implement priority promotion — if a lower-priority request has been waiting longer than a configured threshold, its priority rises. This prevents permanent starvation of batch workloads during sustained high load.

For multi-instance deployments, synchronize the token bucket state through Redis using a sliding window counter. Per-instance local buckets accumulate error when many instances run concurrently: each instance believes it has quota that collectively exceeds what the provider allows.

Making the Priority Decision Explicit

The governance question underneath quota starvation is: who decides which features get resources when there is not enough for everyone?

Without explicit governance, that decision is made by accident — by deployment timing, by which team's cron job runs first, by which engineer had the most aggressive retry configuration. Making the decision explicit means writing it down and enforcing it in code.

A practical starting point is a priority tier document owned jointly by engineering and product. Each LLM-powered feature is classified into a tier. The tier determines:

  • Maximum concurrency (how many in-flight requests a feature can hold at once)
  • Queue position when quota is constrained
  • Whether the feature has a degraded fallback that activates when it cannot get quota

The classification requires product input because it is fundamentally a business decision: when we have to choose between the chat assistant and the report generator, which matters more to the user? Engineers should not be making that call unilaterally.

Once the classification exists, enforce it automatically. The priority queue configuration is code, checked in, reviewed. When a new feature ships, it must declare its priority tier before it can make LLM calls. The default for uncategorized requests should be low priority — new features do not get to inherit the quota budget of established ones without review.

Observability That Detects Starvation Before Users Do

Standard error rate monitoring will not catch quota starvation. The model is responding; requests are completing; 429s are rare or absent because your backoff logic handles retries transparently. What changes under starvation is latency.

Monitor P99 latency per feature type, not just aggregate P99. A batch job can be slow (its latency is degrading gracefully) while a chat feature is silently timing out (its P99 is spiking). Aggregate P99 might look stable while user-facing features are on fire.

Alert on the ratio of interactive to batch quota consumption. Under normal conditions, interactive features should consume a predictable fraction of quota. When batch consumption rises relative to interactive — especially at times of day when user traffic is expected — that is a leading indicator of starvation risk.

Queue depth is a lagging but clear indicator. A growing queue of pending LLM requests while request rate is stable means requests are not draining. That is head-of-line blocking or token exhaustion.

Track quota utilization as a first-class metric: current TPM as a percentage of the key's limit, sampled continuously. Alert at 80% sustained utilization for more than five minutes. At that level, any load spike will cause starvation. The alert gives you time to investigate before users notice.

One subtle gotcha: in multi-instance systems, quota utilization reported by a single instance reflects only that instance's local state. Aggregate the metric across instances — total requests per minute across all pods — to get a picture that matches what the provider is actually enforcing.

Graceful Degradation When Quota Is Exhausted

For high-priority features, having a fallback path is worth the engineering investment. When quota is unavailable and the user cannot wait:

  • Serve cached results if the request is semantically close to a recent one (semantic caching can absorb a significant fraction of chat queries during peak load)
  • Return a partial result — a summary rather than the full generation, or a retrieval-only response without LLM processing
  • Switch to a smaller, cheaper model on a secondary key with its own quota envelope

Circuit breakers prevent a feature from hammering a depleted quota pool. When a feature's LLM calls fail consistently — whether from quota exhaustion or provider issues — a circuit breaker opens and routes subsequent calls immediately to the fallback path. This fails fast in milliseconds rather than waiting for a timeout. It also prevents retry storms: uncircuited retries under quota pressure add load at the worst possible moment.

The Organizational Pattern That Enables This

Technical quota governance depends on one organizational pattern: a single team owns the API keys and the quota, and every feature team requests quota allocation from them rather than provisioning their own.

Without centralized ownership, the common failure mode is each team provisioning its own key, optimizing its own latency without visibility into aggregate spend, and discovering quota competition only when two teams' keys are consolidated by a finance team trying to hit a spend tier.

A platform team that owns the LLM gateway has visibility into every feature's consumption pattern. They can see when a new feature ships and immediately consumes an outsized fraction of quota. They can enforce the priority tier classification as a prerequisite for production access. They can redistribute quota allocation when business priorities change.

The platform team also owns the showback infrastructure: monthly reports showing each product team's LLM spend by feature. Showback alone — before any chargeback mechanism — changes team behavior. Teams discover they are running debug queries against the expensive model in production. They find batch jobs with no concurrency limits. They learn their retry configurations are hammering the API during backoff.

The cost signal creates accountability that rate limit enforcement cannot. Teams that see their quota consumption on a dashboard think about optimization in ways that teams who only see error rates do not.

What to Build First

If you have multiple AI features sharing a single API key and you have not thought about quota allocation, the priority order is:

  1. Separate the API key for batch workloads from interactive features. This is a one-day change with immediate protection against the worst failure mode.
  2. Add per-feature quota utilization logging. Before you can govern allocation, you need to see who is consuming what.
  3. Implement exponential backoff with jitter on all retry logic. Synchronized retries under quota pressure cause retry storms that worsen the starvation — the jitter breaks the synchronization.
  4. Build the priority queue and token bucket once you have visibility into the consumption pattern. Design the configuration as code so priority tiers are explicit and reviewable.
  5. Establish a platform team or quota governance process before the number of features grows large enough that the priority tier document becomes a negotiation among equals.

The failure mode quota starvation creates is particularly insidious because it rewards the features that happen to run first, which are usually the cheapest and least user-facing ones. The most important user-facing features — the ones with the tightest latency SLAs — are often the ones that lose the competition. Getting governance right before scale is the only way to prevent the 2 AM cron job from owning your product demo.

References:Let's stay in touch and Follow me for more thoughts and updates