LLM Rate Limits Are a Distributed Systems Problem
Your AI product has two surfaces: a user-facing chat feature and a background report generation job. Both call the same LLM API under the same key. One afternoon, a support ticket arrives: "Chat responses are getting cut off halfway." No alerts fired. No 429s in the logs. The API was returning HTTP 200 the entire time.
What happened: the report generation job gradually consumed most of your shared token quota. Chat requests started completing, but only up to your max_tokens limit — semantically truncated, syntactically valid, silently wrong. Your standard monitoring never noticed because there was nothing to notice at the HTTP layer.
This is not an edge case. It is what happens when engineers treat LLM rate limits as a simple throttle problem instead of recognizing the class of distributed systems failure they actually are.
Rate Limits Behave Like Distributed Locks
The mental model most teams carry is: "we hit the rate limit, requests get a 429, we back off and retry." That model is accurate for a single-tenant, single-workload scenario. As soon as you have multiple workloads competing for shared quota, the failure modes change completely.
LLM rate limits impose a shared capacity constraint across all callers using the same key. That shared constraint is functionally equivalent to a distributed lock on a finite resource pool. The same failure patterns that plague distributed lock designs appear here:
Starvation occurs when one workload continuously holds or consumes quota, preventing other workloads from making progress. A batch job running 50 parallel requests against a 100 RPM limit leaves exactly 50 slots — and it will fill them again immediately after each request completes. A user-facing chat request that arrives in that window waits indefinitely for a slot that never opens.
Head-of-line blocking occurs when a queue is ordered naively. If your request queue is first-in-first-out across all workload types, a large batch job at the front of the queue blocks all the small, latency-sensitive interactive requests behind it. The interactive requests aren't processing; they're waiting for the batch job to drain.
Priority inversion is the subtler failure. It occurs when a low-priority workload holds a resource that a high-priority workload needs. In the LLM context: the report generation job (low priority, background, no user watching) is using quota that the interactive chat (high priority, user actively waiting) needs to proceed. The batch job has "priority" over the resource through timing, not through any explicit policy.
What makes this particularly dangerous is that all three failure modes can manifest without any 429 errors. You only see 429s when your entire shared pool is exhausted. Starvation, head-of-line blocking, and priority inversion can degrade your high-priority flows while your low-priority flows succeed just fine — and your error rate dashboards stay clean.
Queuing Theory Predicts When You'll Break
The mathematical framework for understanding these dynamics is queuing theory. A simplified model — the M/M/1 queue, where arrivals and service times are both exponentially distributed with a single server — gives a clean picture of the utilization/latency relationship.
The key variable is utilization: ρ = λ/μ, where λ is your request arrival rate and μ is your service rate (how fast the API processes requests). The stability requirement is ρ < 1. What the math shows is that latency doesn't degrade linearly as utilization approaches 1 — it explodes nonlinearly:
- At 50% utilization, queuing adds minimal latency overhead.
- At 70%, latency starts climbing perceptibly.
- At 85%, you're in severe degradation territory.
- At 95%, the system is effectively unusable for latency-sensitive workloads.
The industry-standard target for LLM infrastructure is 60–70% utilization. That conservatism isn't waste — it's the buffer that absorbs traffic spikes, retry storms, and failover events without cascading into exponential latency growth.
Little's Law reinforces this. The fundamental relationship is L = λW: the average number of in-flight requests equals the throughput multiplied by the average response time. The practical implication is that latency and concurrency are coupled. If a provider slowdown doubles your average response time, your number of concurrent in-flight requests doubles, even though your arrival rate is unchanged. You can hit concurrency limits (RPM) as a consequence of latency degradation, not as a cause — a cascade that's unintuitive until you've seen it.
LLM requests complicate the standard M/M/1 analysis because service time is not memoryless. A 20-token completion takes vastly less time than a 2000-token generation. Your quota consumption is variable in both tokens and latency, making the system closer to an M/G/1 queue (general service distribution). The practical takeaway: model your LLM calls with empirical percentile distributions, not averages, and size your headroom against P95 or P99 service times.
The Silent Degradation You're Not Monitoring
The failure mode most teams miss is not the 429 — it's the 200 with a truncated response.
When token quota pressure builds, individual requests don't fail outright. They succeed, but with the max_tokens constraint binding before the model reaches a natural stopping point. The response is syntactically valid. It passes JSON schema validation. It returns HTTP 200. But the content is truncated at an arbitrary point mid-sentence, mid-list, mid-code-block.
Standard monitoring doesn't catch this because there's nothing anomalous to measure at the HTTP layer. You need different signals:
- Response completion rate: Track what percentage of responses are hitting the
max_tokensboundary (by checkingfinish_reason == "length"in the response metadata). A spike in this metric means quota pressure is forcing early cutoffs. - Token count distribution per workload: If your chat requests are normally 200–400 output tokens but you suddenly see a cluster at exactly your
max_tokensceiling, that's quota pressure, not user behavior change. - P99 latency per workload tier separately: Don't average interactive and batch latencies together. Aggregated P99 can look fine while your interactive P99 has doubled, because the batch jobs are fast and numerous.
- Semantic validation probes: For critical paths, periodically validate that responses are semantically complete, not just syntactically valid.
The last point is expensive to implement systematically, which is why the first three metrics are the practical baseline.
Fair Scheduling Requires Explicit Architecture
The fix for starvation and priority inversion is not monitoring — it's quota partitioning combined with explicit priority scheduling. Reactive detection of degradation is too slow for interactive workloads; by the time you notice the truncation rate climbing, your users have already noticed.
Quota partitioning means carving your total TPM/RPM budget into independent reservations per workload tier, not letting all workloads draw from a shared pool:
Total budget: 1,000,000 TPM
├── P0 — user-facing interactive: 400,000 TPM (guaranteed)
├── P1 — async product features: 300,000 TPM (guaranteed)
└── P2 — batch jobs: 300,000 TPM (opportunistic, can't consume P0/P1)
Critically: P2 should be opportunistic — it can use idle capacity from P0 and P1, but it can never consume their reserved allocations. This prevents batch jobs from starving interactive requests even when you're running near capacity.
Priority queuing handles the scheduling within your own request queue before requests reach the provider. When you have a backlog of requests across tiers, a first-in-first-out queue guarantees that a large batch job at the front will delay all interactive requests behind it. You need a multi-level queue that drains higher-priority tiers first.
Weighted fair queuing (WFQ) is the classical algorithm for this. Within a priority tier, WFQ distributes capacity proportionally by weight — preventing a single high-weight tenant or feature from starving others at the same priority level. OpenAI's Priority Processing feature applies a version of this at the provider level, but the critical limitation is that priority and standard tiers share the same quota. Priority access doesn't add capacity; it redistributes existing capacity. You still need quota partitioning to prevent starvation across workload tiers.
