Skip to main content

Rate Limit Hierarchy Collapse: When Your Agent Loop DoSes Itself

· 12 min read
Tian Pan
Software Engineer

The bug report says the service is slow. The dashboard says the service is healthy. Token-per-minute usage is at 62% of the tier cap, well inside the green band. Then you open the traces and see the shape: one user request spawned a planner step, which emitted eleven parallel tool calls, four of which were search fan-outs that each triggered sub-agents, which each called three tools in parallel — and that single "request" is now pounding your own token bucket from forty-seven different workers at the same time. The other ninety-nine users of your product are stuck behind it, getting 429s they never earned. Your agent is DoSing itself, and the rate limiter is doing exactly what you told it to.

This is rate limit hierarchy collapse. You bought a perimeter defense designed for HTTP APIs where one request equals one unit of work, then wired it in front of a system where one request means a tree of unknown depth and unbounded branching factor. The single-bucket model doesn't just fail to protect — it fails invisibly, because your aggregate numbers never breach anything. The damage happens in the tails, in correlated bursts, and in the heads-down users who happen to be adjacent in time to a heavy one.

Rate limits were invented to stop misbehaving clients from overwhelming shared infrastructure. That premise assumed clients were mostly dumb loops: a polling job, a webhook retry, a scraper. The answer was a token bucket per API key, sized to the tier the customer paid for. One process, one key, one bucket. You could reason about it in your head.

Agent systems break the premise. A single user-facing request is no longer a client; it's a planner that decides, at runtime, how many concurrent sub-requests to emit. Parallel tool calling — the feature everyone turned on because it halves wall-clock latency — means the planner's output can balloon into dozens of in-flight calls before the first response arrives. Sub-agents compound this: each branch of the tree is itself a planner, with its own fan-out. The effective branching factor of a single "request" can easily hit two or three orders of magnitude.

Against this, a flat per-API-key bucket is worse than useless. It tells you nothing about which user triggered the burst. It gives a single runaway agent the power to consume the entire quota, starving every other tenant. And it turns the model's own recovery behavior — retry, re-plan, fall back — into an amplifier.

The four levels the flat bucket conflates

If your rate limiter has one layer, it's enforcing one invariant. Agent systems have at least four invariants, and collapsing them into a single bucket means violating three of them every time you enforce the fourth.

Tenant level is what your flat bucket actually protects: the aggregate spend of a customer organization against provider-level caps. This is the coarsest signal. It answers "is the whole account over its tier?" but says nothing about distribution within the account.

User level is the fairness guarantee your product makes to individual end users within a tenant. Without it, one user's runaway agent silently steals capacity from every other user in the same org — the classic noisy-neighbor pattern, now with AI-specific blast radius because a single bad prompt can emit fifty tool calls. This is the layer that determines whether your product feels reliable.

Request-tree level is the budget assigned to a single end-user request and all its descendants. Without it, one over-planned request consumes the user's per-user quota and then their tenant's quota, even when the rest of their work would have fit comfortably. This is the layer that stops one bad prompt from nuking a user's afternoon.

Tool level is the per-downstream protection: your vector store can handle 300 requests per second, your internal search cluster can handle 50, the third-party weather API can handle 10. These are hard backend limits that have nothing to do with model tokens. Without it, a planner that decides to call the slowest tool fifty times in parallel gets to find that out by breaking the tool, not by being told no.

A mature agent rate limiter enforces all four simultaneously, with the request subject to the minimum of whatever budgets apply. Flattening any of them to save implementation effort just relocates the failure.

Why retries make it worse, not better

The natural response to a 429 is to retry with exponential backoff and jitter. That's the right pattern for stateless HTTP clients. For agents, it is a correlated-failure amplifier, and it's worth being specific about why.

First, the unit of work is wrong. Exponential backoff counts requests; the provider counts tokens. A 200-token request and a 12,000-token request both count as "one retry," but the second one consumes sixty times the budget when it succeeds. Your retry schedule is measuring the wrong axis, so it can't give you a useful guarantee about when you'll stop hitting the wall.

Second, the fan-out is synchronized. An agent that emits eleven parallel tool calls in one planning step will often get eleven 429s back together, because the underlying bucket was empty when all eleven arrived. Every naive retry library will then back them off by the same base — and unless the jitter is generously sized, they'll resume together and reproduce the burst. Each "retry storm" is structurally identical to the one that caused the 429 in the first place.

Third, each retry is not free. A retry in an agent loop consumes context window, wall-clock latency, user patience, and — crucially — tokens on the planner that re-inspects the failure. A 2% tool error rate, retried three times through an agent loop with eight steps, can balloon into a 15–20% end-to-end failure rate once you account for context bloat from inserted error messages, timeouts from the cumulative latency, and the planner getting confused by the longer trace. The retry budget is a scarce resource; treat it like one.

Fourth, the planner often "retries" implicitly by re-planning. If your agent's response to a failed tool call is "think again and try a different approach," you now have two forms of retry happening at once: the HTTP client's, and the model's. Neither knows about the other. Neither has a shared budget. A single bad prompt can trigger both, creating the exact pathology retry libraries were designed to prevent.

The practical move is to treat the retry budget as a first-class resource at the same level as the rate limit budget. Assign each user request a bounded number of total retries across all tool calls and model invocations in its tree. Deduct aggressively. When the budget is gone, fail loudly rather than silently burning the user's quota on self-inflicted recovery.

Cooperative backpressure the planner can actually consume

Here's the part almost nobody implements. Your planning model can see the system prompt, the tools, and the conversation. It cannot see your rate limiter. So when it emits eleven parallel tool calls and you answer three with 429s, the planner doesn't understand that the system is saying "slow down." It understands that three tools are broken. It will either retry them — burning more budget — or try to route around them, which usually means picking a different tool and calling that one eleven times.

The signal you want the planner to consume is not a 429 on individual calls; it's a statement about the shape of the work it's allowed to do next. Think of it as a budget-aware system message injected into the next turn:

  • "You have 4,200 tokens remaining in this request's budget."
  • "Parallel tool calls are currently capped at 3 due to backend pressure on the search tool."
  • "The weather tool is in cooldown; prefer the cached snapshot for the next 30 seconds."

Models, especially recent ones, respond remarkably well to this kind of guidance when it is concrete. The cost is cheap: one extra section in the system prompt, updated per turn. The benefit is that the planner stops emitting fan-outs you were going to refuse anyway, which eliminates the 429 storm at its source instead of at its perimeter.

The inverse — silent refusal — teaches the planner nothing. It retries, re-plans, or hallucinates success. All three make the dashboard worse.

One subtle trap: don't dump raw bucket state into the prompt. "You have 8,441.7 tokens in bucket A, 311 in bucket B, refill rate 1000/s, …" will confuse the model more than it helps. Translate numbers into actions the model can take: how many parallel calls, which tools to prefer, whether to batch. The model is good at following constraints stated in the vocabulary of its own action space.

Budget accounting that survives concurrency

A rate limit that only checks after the LLM call has returned has already lost. By the time you know the cost, the budget is spent; you're doing bookkeeping, not enforcement. Worse, if your agent emits parallel calls, each of them races to consume the bucket, and whichever of them happens to check first wins — completely ignoring the others in flight.

The discipline is pre-commit accounting: before dispatching any call, reserve the worst-case token cost against all applicable buckets (tenant, user, request-tree, tool). If the reservation fails, the call doesn't go out. When the call returns, reconcile the reservation against the actual cost — credit back the unused portion, or debit the overrun. Treat the reservation as a lock, not an advisory number.

For parallel fan-out, reserve the sum of worst-case costs of all branches before dispatching any of them. This is the step that breaks naive implementations: they reserve per-call, so the first branch succeeds, the second succeeds, and the sixth one breaches the bucket partway through — at which point you've already committed to five calls you can't finish. If you can't afford the whole fan-out, refuse the fan-out and tell the planner to narrow its plan.

Worst-case estimation is awkward because output tokens are hard to predict before the call. The workable approximation: use the model's declared max_tokens as the upper bound, reconcile on completion. You'll over-reserve, which means you'll sometimes refuse requests that would have fit. That's the trade; it's the price of not letting a runaway completion drain the bucket with nothing left for anyone else.

The observability that keeps the hierarchy honest

You cannot operate a four-level rate limit architecture with a single "429 count" metric. The whole point of the hierarchy is to distinguish which layer refused the request, because each layer means something different for triage.

  • A tenant-level refusal means the customer is over their plan. Sales or autoscaling conversation.
  • A user-level refusal inside a healthy tenant means one user is hot; product or UX conversation.
  • A request-tree refusal means one prompt fanned out too aggressively; prompt engineering or planner-caps conversation.
  • A tool-level refusal means a specific backend is the bottleneck; infra conversation.

Emit metrics and traces labeled with the refusing layer, plus the tenant, user, request ID, and tool. Aggregate dashboards are meaningless here; the interesting patterns only show up when you can filter on one dimension and see the distribution within it. If 92% of your refusals come from a single layer, that's the one to invest in. If they're evenly split across all four, your tiers are miscalibrated relative to each other.

Then wire a canary: a low-traffic synthetic agent that runs the common workflows at fixed intervals, tagged as a distinct synthetic user. If it starts getting 429s while real aggregate traffic is stable, the layers beneath the flat dashboard have shifted — usually because a new user's prompt style is fanning out in an unexpected way, and they're crowding the user-level buckets your synthetic shares the tenant with.

The failure mode is quiet until it isn't

The reason this problem is hard to catch early is that none of the individual signals look alarming. Tenant token-per-minute sits at 60–70% of cap. Individual 429 rates are "within SLA." The agent's success rate is high because failures are quietly absorbed by retries. What degrades is the thing dashboards don't show: a handful of users experience wildly inconsistent latencies and occasional complete stalls, because their calls are losing the race for a shared bucket to someone else's bad prompt.

By the time it shows up in aggregate metrics, it's a plateau — your agent platform simply stops scaling past a certain concurrency, and nobody can quite explain why. The architecture hasn't failed; the architecture was never there.

The fix is structural, not patchwork. Separate buckets per layer. Budgets reserved per request-tree. Retry budgets that cap the damage one prompt can do. Cooperative signals the planner can act on before it emits work you'll have to refuse. None of these are optional once your product is doing real agent work for real users, because the pathology they prevent isn't a corner case — it's the default behavior of any system that flattens four dimensions into one.

Rate limits in agent systems are not a cudgel against abuse. They're the flow-control fabric that keeps one user's overconfident planning step from eating the rest of the platform alive. Build them accordingly, or the flatness will find you.

References:Let's stay in touch and Follow me for more thoughts and updates