Skip to main content

The Rate Limit That Became a Product Decision

· 10 min read
Tian Pan
Software Engineer

A rate limit used to be an infrastructure detail. You hit a 429, you retried with backoff, you queued the overflow, and nobody outside the on-call channel ever knew it happened. The user saw a response that was a few hundred milliseconds slower than usual. That was the whole story.

That story no longer holds for agentic features. When an agent hits a provider's tokens-per-minute ceiling halfway through a multi-step plan, the failure does not stay inside the infrastructure. It surfaces as a half-finished answer, a tool loop that stalls before the last call, or a user watching a spinner that will never resolve. The quota stopped being a backend capacity number and became a constraint that product has to design around — the same way product designs around a checkout flow or an empty state.

The shift is easy to miss because the vocabulary stayed the same. We still say "rate limit," we still say "retry," we still say "queue." But the thing those words describe has changed shape. A rate limit on a REST API governs one request. A rate limit on an LLM provider governs a shared, fast-draining budget that every concurrent user, every sub-agent, and every background retry is spending at the same time. Treating it as an infra detail is how teams end up discovering their capacity model in an incident review instead of a design doc.

A Quota Is a Shared Budget, Not a Per-User Allowance

The first mental model to discard is that each user gets their own rate limit. They do not. Providers enforce limits per organization, per API key, per region, and per model class — and your application's entire user base draws from that single pool. Anthropic's API, for example, enforces three separate dimensions per model class: requests per minute, input tokens per minute, and output tokens per minute. Hitting any one of them returns a 429. Azure's quota is assigned per region, per model, per deployment type, and every deployment that shares that quota is competing for the same tokens-per-minute number.

This matters because agentic workloads are unusually good at draining a shared budget fast. A single agent request is not one API call. It is a loop: a planning call, several tool calls, a retrieval step that stuffs documents into context, maybe a sub-agent that runs its own loop. A 200,000-token context window costs as much, in TPM terms, as fifty 4,000-token prompts. Ten users running deep agent sessions can consume more quota than ten thousand users sending short chat messages.

So the capacity question is not "how many users do we have." It is "how many concurrent agent loops can run before the shared budget is exhausted, and what happens to the loop that gets unlucky." That second clause is the product question, and it is the one that usually goes unanswered until launch day.

"Just Retry With Backoff" Is Latency the User Feels

Exponential backoff with jitter is still the correct baseline retry strategy. Pure exponential backoff creates a thundering herd — every throttled client retries on the same schedule and slams the limit again — so jitter to spread them out is non-negotiable. None of that is wrong.

What is wrong is treating retry as invisible. On a traditional API, a retried request adds tens of milliseconds. On an LLM call inside an agent loop, a retry can add seconds, and the agent may need to retry several calls in the same plan. The provider's retry-after-ms header tells you how long to wait, and that wait is often measured in whole seconds when the org-wide budget is genuinely saturated. Multiply that by a multi-step plan and the "invisible" retry becomes a fifteen-second pause with no output on screen.

A user staring at a spinner does not know the difference between "the model is thinking" and "we are waiting out a rate limit." Both look identical. But one is a feature working and the other is a feature failing, and if your UX renders them the same way, you have shipped a failure mode that looks exactly like normal operation. The longer the agent runs, the more the user has invested in waiting, and the worse the eventual timeout feels.

The honest version of retry treats the wait as a state the user can see. Tell them the system is busy. Show progress on the steps that did complete. Give them the choice to wait or to take the partial result. Retry that hides behind a spinner is not graceful — it is a guess that the limit will clear before the user gives up, and that guess is a product decision being made by silence.

Degraded Mode Is a Feature, Not an Error Path

The most important reframing is this: when you cannot serve the full-quality response, the fallback is not an error. It is a product state, and it deserves the same design attention as the happy path.

Graceful degradation in AI systems usually means a layered fallback. A common ordering is full model → smaller model → cached or heuristic result → static response. Each rung sacrifices something — accuracy, freshness, personalization — to keep the feature responsive. The point is not that the degraded answer is as good. The point is that the system behaves predictably and communicates clearly instead of failing mysteriously.

Concretely, a few fallback rungs worth designing in advance:

  • Route to a smaller, cheaper model. A model in the same family with lower TPM pressure can often finish the task at reduced quality. The user gets an answer; you note that it was a fast-mode response.
  • Cut the number of steps. If a full agent plan needs eight tool calls, a degraded plan might do three and stop, returning what it found with an explicit "I stopped early" note.
  • Serve a cached or partial result. If the agent completed four of six steps before the budget ran out, the four steps are real work. Showing them beats throwing them away.
  • Fall back to a static or heuristic answer. For some queries a templated response or a simple rule is genuinely better than an infinite spinner.

What separates a degraded mode that builds trust from one that erodes it is honesty. A user who is told "I used a faster model because demand is high — here is what I found, and you can ask me to retry in full" understands the trade and stays in control. A user who silently receives a worse answer and never learns why concludes the product is unreliable. The degraded state needs copy, needs a visual treatment, and needs a way back to full quality. That is design work, and it has to happen before the incident, not during it.

Capacity Planning Looks Like a Connection Pool

Once you accept that provider quota is a finite shared resource, the right analogy is the database connection pool, not the API rate limit. A connection pool has a fixed size. When it is exhausted, new work either waits in a bounded queue or is rejected fast. You size the pool against expected concurrency, you monitor saturation, and you alert before it hits the ceiling — not after.

Treat tokens-per-minute the same way. Estimate the TPM cost of a typical agent session, multiply by realistic peak concurrency, and compare that to your provider tier's actual limit. If the math says peak demand exceeds quota, you have three honest options, and all three are decisions someone has to own:

  1. Buy more capacity. Move up a provider tier, or reserve throughput. Vertex AI's Provisioned Throughput and Azure's Provisioned Throughput Units let you reserve capacity for stable latency, purchased as a commitment. This trades money for predictability.
  2. Shed load deliberately. Add a bounded internal queue with a fast rejection path so that when the budget is gone, users get an immediate, honest "try again in a minute" instead of a slow timeout.
  3. Degrade deliberately. Use the fallback rungs above so peak demand is served at lower quality rather than not at all.

The wrong fourth option is to do nothing and let the provider's 429 be your capacity plan. That is not a plan. It is the absence of one, and it converts a predictable load problem into an unpredictable user-facing one.

Monitoring should match this framing. Track TPM and RPM utilization as a percentage of your actual limit, not as raw counts. Alert at 70 or 80 percent, the same threshold you would use for a connection pool, so the team has runway to act. Watch throttling events and time-to-first-token, because rising latency is the early signal that the budget is tightening.

Negotiate Limits and Priority Tiers Before You Launch

The last piece is the one that is least technical and most often skipped: talk to the provider before launch, not during the incident.

Provider rate limits are not fixed constants. They scale with usage tier and cumulative spend, and they can frequently be raised on request — but the request takes time to process, and "we are launching Tuesday" is a weak position from which to ask. Knowing your tier, your current limits, and the lead time for an increase is launch readiness, the same as load testing.

If your application has different classes of traffic — a paying enterprise tier and a free tier, an interactive request and a background batch job — decide how they share the budget before the budget is contended. When the org-wide quota is saturated, something has to yield. If you have not decided what yields, the answer is whatever request happened to arrive last, and that is rarely the request you would have chosen to protect. Background jobs should yield to interactive ones. Free traffic should degrade before paid traffic. A multi-provider gateway can route around a single provider's 429 by failing over to the same model on another host. All of these are policies, and a policy decided in advance is a product decision; a policy decided by a 429 in production is an accident.

The Limit Was Always a Product Boundary

The deeper change is not that rate limits got stricter. It is that the agentic loop made them visible. A traditional request either succeeded or retried fast enough that nobody noticed. An agent's multi-step plan exposes the budget in the middle of the work, where the user is watching, where partial progress is real, and where the difference between "thinking" and "throttled" is invisible unless you choose to show it.

That visibility is why the quota is now a product surface. It changes what you build: a degraded mode with its own copy and design, a capacity model sized like a connection pool, a load-shedding policy that decides who waits, and a conversation with your provider that happens before the launch instead of after the outage. Teams that treat the rate limit as an infrastructure detail will keep discovering it in incident reviews. Teams that treat it as a constraint to design around will ship a feature that fails the way they chose, when the budget runs out — not the way the spinner decided for them.

Decide what your feature does when the budget is gone. If you do not, the 429 will decide for you, and it will not be kind to your users.

References:Let's stay in touch and Follow me for more thoughts and updates