The Agent That Retried Its Way Past Your Rate Limit

May 23, 2026 · 10 min read

Software Engineer

Your gateway enforces a clean 100 requests per second per tenant. The dashboard shows every tenant comfortably under that ceiling. The bill from your model provider says you blew through the spend cap anyway. Nobody on the rollout call has a clean story for why.

The answer is that the rate limiter and the bill are measuring different things. The limiter sees one "user request" when a customer clicks a button. The provider sees a planner call, three tool-result reflections, a format-correction retry triggered by a stricter JSON schema, and a final synthesis — each with its own internal retry budget that fires when a transient 429 or 500 comes back. A single click can fan out into thirty model calls. The limiter counts one. The bucket leaks at thirty times the rate it was sized for.

Rate-limiting an agentic system at the HTTP boundary is enforcing speed limits at the highway entrance while the cars inside multiply. Until the limiter understands the loop, the loop will route around it.

The Accounting Gap Between User-Facing And Provider-Facing Calls

Classical web rate limiting has a tidy invariant: one HTTP request from the user equals roughly one unit of backend work. The token bucket is sized against that invariant, the dashboards are built around it, and the on-call runbook trusts it.

Agentic workflows break the invariant on purpose. The orchestrator's whole job is to fan a single user intent out into many model calls — planning, retrieval, tool use, reflection, format correction, synthesis. Recent analyses put the typical amplification factor at 10 to 100 model calls per user-facing request, with autonomous coding agents and deep-research loops sitting at the high end of that range. At fifty reasoning steps, the per-request cost multiplier can exceed thirty times the equivalent single-shot completion.

The gap shows up in two places. On the cost side, your FinOps dashboard is summing dollars across millions of provider calls while your product dashboard is summing requests across thousands of user clicks. The two graphs diverge silently for weeks. On the reliability side, your gateway's rate-limit budget is sized against the user-click count, not the model-call count. When traffic spikes thirty percent on the click graph, the model-call graph spikes nine hundred percent, and the provider's rate limiter — which is the one that actually decides whether your tenants get service — fires far before yours does.

The fix is not to add another token bucket at the HTTP layer. The fix is to teach every layer to count the unit of work that actually consumes the resource. For LLM-backed systems, that unit is tokens, not requests. Token-based rate limiting reflects what your provider is billing you for and what your provider's own rate limiter will throttle on. Request-based limiting reflects a world where one HTTP call meant roughly one unit of backend work, and that world ended the moment you put a loop behind the endpoint.

Loop-Aware Budgeting As The Primary Control

The single most useful pattern is a per-user-request token budget that travels with the request through every layer of the orchestrator. The user clicks. The orchestrator allocates a budget — say, 200,000 input tokens and 40,000 output tokens. Every model call in the loop debits the budget before issuing the call. When the budget hits zero, the orchestrator fails loudly with a structured error the product can interpret: "this request needed more reasoning than was allowed; here's the partial result."

A few things make this work in practice.

The budget has to be a property of the request, not a global counter. Global counters become contention points and tell you nothing about which user's request is misbehaving. A per-request budget gives you a clean unit of accountability: every error log carries the budget that was set, the budget that was spent, and the step that exhausted it.

The budget has to be enforced before the model call, not after. Post-hoc accounting will tell you that you overspent; pre-call enforcement is what prevents the overspend. The orchestrator should refuse to issue the call when the remaining budget is smaller than the projected cost of the next step.

The budget has to include speculative spend, not just consumed spend. If the planner has decided to issue three parallel retrieval calls, the budget for all three has to be reserved before the first one fires. Otherwise the loop can commit you to spend it cannot afford to complete.

The budget has to be visible to the model. Agents that know how much budget remains can choose to summarize earlier, skip optional tool calls, or return a partial answer with a flag. Agents that don't know burn through the cap and then fail at the last step. The cheapest place to abort a runaway loop is in the planner, not in the limiter.

A useful default ratio: set the per-request budget at two to three times the median observed spend for that route, and treat the p99 as a signal to look at, not a budget to accommodate. Routes whose p99 sits above the budget will fail loudly; that is the point.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Agent That Retried Its Way Past Your Rate Limit

The Accounting Gap Between User-Facing And Provider-Facing Calls

Loop-Aware Budgeting As The Primary Control

Recommended Reading

About Tian Pan

The Accounting Gap Between User-Facing And Provider-Facing Calls​

Loop-Aware Budgeting As The Primary Control​

Recommended Reading

About Tian Pan

The Accounting Gap Between User-Facing And Provider-Facing Calls

Loop-Aware Budgeting As The Primary Control