Skip to main content

Provider Rate Limits Are a Capacity Plan You Never Wrote

· 9 min read
Tian Pan
Software Engineer

The first time your application hits a 429 from a model provider, something important happens, and almost nobody notices it. Not the error itself — the line of code that runs next. Maybe your HTTP client retries with exponential backoff. Maybe it falls back to a smaller model. Maybe it queues the request, or drops it, or surfaces a spinner that never resolves. Whatever it does, that behavior is now your capacity policy. It decides which users get served and which get degraded when demand exceeds supply.

And you almost certainly didn't write it. It was authored by whoever wrote the SDK wrapper, the retry decorator, or the three-line try/except someone copied from a tutorial. The most consequential decision in your system under load — what to do when you can't do everything — is being made by code nobody reviewed as a capacity decision.

This post is an argument for treating that code as what it actually is: a load-shedding policy and a capacity plan. Not an error handler. The 429 is not the problem. The problem is that you have outsourced the design of your system's behavior under contention to a library default.

A 429 Is a Capacity Signal, Not an Error

Most teams file rate-limit errors next to timeouts and 500s: transient faults, retry and move on. That framing is the original mistake. A timeout means something broke. A 429 means nothing broke — the system is telling you, precisely and on purpose, that you are asking for more than you are allotted. It is not a failure. It is a capacity boundary, communicated.

Providers expose that boundary on two axes that move independently. RPM (requests per minute) caps how many calls you make. TPM (tokens per minute) caps how much compute you consume. A workload of many tiny prompts hits the RPM ceiling first; a workload of few enormous prompts hits TPM first. Treating "rate limit" as one number is the first sign you haven't modeled your own capacity — the provider already did the modeling and handed you the result, and the 429 is the bill.

Once you see the 429 as a signal rather than a fault, the question changes. It is no longer "how do I make this error go away" but "demand just exceeded supply — who do I serve, and who do I make wait?" That is a capacity-planning question. Every system that calls a model API answers it. Most answer it by accident.

Your Retry Wrapper Is a Load Shedder in Disguise

Load shedding is the discipline of deliberately dropping or degrading work when a system is over capacity, so that the work you keep stays healthy. Infrastructure teams have practiced it for decades: when you can't serve everything, you choose, on purpose, what to serve.

Your retry wrapper does load shedding too. It just does it without choosing. Consider what a generic backoff-and-retry actually decides on your behalf:

  • A retried request occupies a worker, a connection, and memory while it waits. Under sustained 429s, your retry queue is now competing with fresh requests for the same finite resources.
  • A request that retries five times with exponential backoff can take 30 or more seconds to fail. The user behind it has been silently enrolled in your worst-case latency tier.
  • A fallback to a smaller model is a quality cut. You just decided this user gets a worse answer — and you decided it uniformly, for everyone, regardless of whether they were mid-checkout or running a background summarization.

None of those decisions are wrong in isolation. They are wrong because nobody decided them. The wrapper treats a paying customer's interactive request and a nightly batch job as identical, because it has no concept that they differ. It sheds load — it just sheds it blind.

The Thundering Herd You Ship by Default

The default retry behavior has a second failure mode that is worse than blindness: it actively prolongs the outage it was meant to survive.

When demand spikes and a wave of requests gets 429'd, every client that uses the same fixed retry interval — or even the same un-jittered exponential schedule — retries at the same moment. The provider, briefly over capacity, now receives a synchronized burst precisely calibrated to keep it over capacity. The clients back off in lockstep and retry in lockstep again. This is the thundering herd, and in a fleet of identical agent instances it is not an edge case. It is the expected behavior, because identical code produces identical timing.

The fix is well known and still routinely skipped: jitter. Full jitter — sleeping a random duration between zero and your computed backoff cap — desynchronizes the fleet so retries spread across the recovery window instead of stacking on it. Writeups on the pattern consistently show jitter cutting retry-induced contention dramatically. Two more rules belong with it. Honor the Retry-After header when the provider sends one; it is the provider telling you exactly when capacity returns, and guessing a backoff instead is choosing to be wrong. And cap your attempts at three to five, because the gap between five retries and infinite retries is the gap between shedding a request and feeding the herd forever.

Notice that these are not really retry tactics. They are capacity tactics. Jitter is how you avoid manufacturing your own demand spike. The attempt cap is how you decide a request is not worth its share of a constrained resource. You are doing capacity planning. You are just doing it three lines at a time.

Priority Belongs in the Request, Not the Wrapper

A real capacity plan starts by admitting that not all requests are equal, and then making that inequality explicit in the system.

The pattern is to tag every request with a priority when it is created, where the context to judge importance actually exists. A useful default hierarchy:

  • P0 — interactive, revenue-critical. A user is waiting, and the request is on the path to a conversion or a support resolution.
  • P1 — interactive, non-critical. A user is waiting, but a delay or a degraded answer is tolerable.
  • P2 — deferred. Triggered by a user but not awaited: an async enrichment, a notification draft.
  • P3 — background. Batch jobs, pre-computation, evals. No user, no deadline within the minute.

With that tag in hand, the behavior at a 429 stops being uniform. P3 work yields first — it should arguably never compete with interactive traffic for a constrained quota at all, and is a natural fit for a provider's discounted batch tier. P2 work queues. P1 work can take the smaller-model fallback, because a slightly worse answer beats a spinner. P0 work gets the retries, the preserved quality, and the front of the queue. When you are at 95% of your TPM ceiling, you are not out of capacity for the checkout flow — you are out of capacity for the nightly summarizer. Only a system that knows the difference can act on it.

This is also where probabilistic shedding earns its place. Rather than a hard cutoff, you shed an increasing fraction of low-priority traffic as load climbs toward the ceiling — a shedding rate that scales with (current_load − threshold) / (capacity − threshold). The result is a system that bends smoothly under pressure instead of snapping.

Treat Quota Like a Resource With a Scheduler

The deepest reframing is this: your provider quota is a fixed, shared, contended resource. You already have a name for the component that allocates a fixed, shared, contended resource among competing claimants. It is a scheduler. The thing you almost certainly have instead is an exception handler.

An exception handler is reactive and local. It sees one failed request and decides what to do with that one request, blind to every other request in flight. A scheduler is proactive and global. It knows your minute-by-minute RPM and TPM budget, it knows what is queued and at what priority, and it admits work against the budget before the work is ever sent — so the provider's 429 becomes a rare backstop instead of your primary control surface.

Concretely, a scheduler-shaped client does a few things an exception handler cannot:

  • Estimates token cost before sending, so it can reason about TPM headroom instead of discovering the ceiling by crashing into it.
  • Maintains priority queues, so when the budget is tight the admission order is intentional rather than first-come.
  • Tracks the budget centrally, so ten agent instances share one view of remaining quota instead of each independently rediscovering the limit.
  • Exposes its decisions, so "we shed 12% of P3 traffic for nine minutes" is a dashboard line, not a mystery.

You do not need to build a distributed scheduler from scratch on day one. A shared token bucket in Redis and a priority field on the request queue covers most of the distance. The point is not the sophistication of the mechanism. The point is that someone — a person, reasoning about which requests matter — authored the policy, instead of inheriting it from a library default.

Write the Plan Before the Ceiling Writes It for You

Every system that calls a model API has a capacity plan. The only question is whether anyone wrote it on purpose. If you cannot answer "what happens to a P0 request when we are at our TPM ceiling, and how is that different from what happens to a batch job" without reading your retry decorator's source code, the answer is no — and the plan you have is whatever the library author happened to choose.

Find the place in your stack where a 429 is handled. Read it as a capacity policy, because that is what it is. Ask whether it distinguishes a customer from a cron job, whether it jitters, whether it honors Retry-After, whether it can tell you after the fact what it shed and why. If it cannot, you have not been spared the work of capacity planning. You have only handed it to whoever typed except RateLimitError: sleep(1) — and they were not thinking about your users.

References:Let's stay in touch and Follow me for more thoughts and updates