Skip to main content

The Rate Limit You Set for Humans an Agent Saturates in Three Seconds

· 10 min read
Tian Pan
Software Engineer

The rate limit was never a fairness primitive. It was a sales-engineering quote that grew up — a number a solutions engineer typed into a docs page during onboarding three years ago, copied into a tier definition, and never revisited because no one ever hit it. The limit said "100 requests per minute" and it meant "more than any sane integration will ever need," because every integration on the platform was a backend service driven by a human at a keyboard, and humans do not type a hundred times a minute.

Then a paying tenant pointed an agent at the endpoint. The agent did not type. It did not pause to read responses. It did not have a UI to render between requests. It executed a planning loop that called the API once per reasoning step, and one reasoning step took the model about thirty milliseconds of wall time to formulate. The agent hit the per-minute ceiling in three seconds, the per-hour ceiling in three minutes, and the daily quota before the on-call engineer's coffee had cooled. The support escalation landed before the throttle dashboard had updated.

The incident retro will talk about the agent. The agent is not the interesting part. The interesting part is that the rate limit was never doing the job people thought it was. It was a soft-guidance line dressed as a contractual ceiling, and the integration shape that exposed the gap was always going to come — agents are just the integration shape that came first.

The Limit Was Calibrated for a Caller That No Longer Exists

Walk back through how the number got picked. Someone — probably the first product manager on the API — looked at the analytics from the dashboard product and noticed the chattiest customer was making about twenty calls a minute, set the limit at five times that with a safety margin, and shipped. The number worked. It worked because every caller on the platform was a piece of glue code wrapped around a human action: a form submission, a button click, a sync triggered by a webhook fired by another human action upstream.

Human-driven traffic has a shape. The shape has a name in the queueing literature — bounded interarrival times, low burstiness, request rates that scale with seat count rather than with compute. The shape is also why the noisy-neighbor model on the platform worked: the worst neighbor was the customer with the most seats, and the customer with the most seats was paying the most, so the ceiling roughly aligned with the contract. Fairness emerged as a side effect of human-paced traffic, not as a property the system enforced.

Agents do not have that shape. An agent's request rate is a function of the planning loop's iteration time and the tool-call fan-out per step, neither of which has anything to do with seat count. A single agent acting on behalf of a single user can issue more requests per second than the entire human user base of the same tenant, and that ratio gets worse every time the model gets faster. The fairness side effect disappears. The number that worked for ten years stops working in the first afternoon of agent traffic.

"Rate Limit" Is a Category That Just Split

The thing that used to be called the rate limit is now two things, and the platform has to decide which one it is shipping. The first is the throughput contract — how much of the upstream resource a tenant is entitled to consume, expressed in whatever unit actually maps to cost. The second is the abuse ceiling — the place where the platform refuses to keep serving regardless of contract, because something pathological is happening and the cheapest response is to cut the connection.

These were the same number when callers were human, because the throughput a human could generate was always well below the abuse ceiling. They are not the same number when callers are agents. A perfectly well-behaved agent can saturate the old combined limit doing exactly the work the tenant is paying for, which means treating that saturation as abuse is wrong. A misconfigured agent can saturate ten times that limit doing nothing useful, which means raising the combined number for everyone to accommodate the well-behaved case is also wrong.

The split forces a separate conversation about each ceiling. The throughput contract becomes a per-tenant, per-workload-class budget that the sales team can quote against and the finance team can model. The abuse ceiling becomes a platform-wide circuit breaker keyed on behavioral signal — request-shape entropy, retry rate after errors, ratio of successful to failed tool calls — that fires only when the signal says the caller is in a degenerate state regardless of contract.

Token Buckets Keyed on Something Other Than the User

The mechanical fix is to move the bucket. The bucket keyed on user identity worked when one user generated one stream of requests. The bucket keyed on user identity now sits empty while twenty parallel agents under that user's session each generate a stream the bucket cannot see, or — worse — sits unfairly drained because the user happened to spawn an agent and the human session loses its allotment to a background process.

The bucket has to be keyed on a tuple. The tuple has to include the tenant, because the tenant is the billing unit. It has to include the workload class — interactive human, scheduled job, autonomous agent — because those classes have different acceptable burst shapes. It often has to include the model or downstream resource, because saturating a frontier-model budget is a different kind of failure than saturating a cached-response budget. The tuple is the rate-limiting primitive; the per-user limit is a degenerate special case of it.

Once the tuple exists, the rest follows. The free tier gets a smaller per-workload-class bucket than the enterprise tier. When a tenant upgrades, the metadata on the tenant identity changes and the bucket sizes update without a code change. A runaway agent throttles only its own workload-class bucket; the human session under the same tenant is unaffected. A multi-tenant abuse event fires the platform-wide circuit breaker without forcing the per-tenant buckets to be conservatively sized for the worst neighbor.

Burst Credits Were Implicit; Now They Have to Be Real

Human callers got their bursts for free, because human bursts were small and irregular. A user clicking a "sync all" button generated a fifty-request burst, the system absorbed it without noticing, and the per-minute average never told the truth about the second-by-second shape. The bucket worked because the burst was small enough to ride out the smoothing.

Agent bursts are not small. An agent told to enrich a thousand records will issue a thousand requests as fast as the network can carry them, then go quiet for the rest of the day. The per-minute average across the day looks fine; the bucket fires every time. The honest answer is that the agent's traffic is bursty by design and the platform either accommodates that by exposing burst credit as a first-class quota — preallocated capacity an agent can claim before the work starts — or it pushes agents into discovering the ceiling through 429 responses, which is what just happened.

Burst credits also force the platform to surface a piece of information that it has always had but never published: how close a tenant is to its ceiling. The human caller never needed to know, because the human caller was never near it. The agent caller has to know, because the agent caller is approaching the ceiling within the first second of any nontrivial job, and the right back-off behavior depends on whether the ceiling is twenty percent away or two percent away. The headers that expose that — the remaining, the reset, the burst-pool depth — stop being a courtesy and start being part of the integration contract.

The Contract Surface Has to Name the Agent Caller

Look at the public docs page that names the rate limit. It almost certainly has one ceiling per tier and a footnote about contacting sales for higher limits. The page reads like a quote sheet, which is what it always was. The page does not say "this limit assumes human-paced calling; agent-paced calling has a different limit." The page does not say "if you point an autonomous agent at this endpoint, the relevant SKU is over here." The page does not name agent traffic as a category, because when the page was written, agent traffic did not exist.

The fix is to make the page tell the truth about what the platform now serves. The agent-caller SLA goes next to the human-caller SLA, named explicitly, with its own bucket sizing and its own burst behavior and its own pricing dimension. The integration documentation tells builders which SKU their integration belongs to and why. The customer-success motion that used to be "what's your seat count" gets a new question about expected programmatic call rate, and the answer determines which tier metadata the tenant identity carries.

This is what the contract surface looks like when "rate limit" stops being one thing. The split is not optional; it is already real in the production traffic. The only choice is whether the docs page admits it or whether the tenant discovers it by sending a Friday-evening burst the on-call engineer has to explain at the Monday retro. The retro will say the cause was an agent. The cause was a contract that named one caller and now serves two.

The Architecture Realization

A rate limit looks like an implementation detail until the population of callers shifts. Then it becomes the place where every assumption about who is calling, why, and how often gets exposed at once. The fairness model was a side effect; the contract was a quote; the bucket was a guess; the headers were a courtesy. None of those held when the caller stopped being a human.

The teams that will keep paying the same incident cost are the ones that treat the next saturated-quota incident as a single agent's bug. The teams that will stop paying it are the ones that look at the limit the agent saturated and see, in retrospect, that the limit was never a primitive — it was a frozen snapshot of a caller population that has now moved. The work is to name the caller classes that exist, give each one its own ceiling derived from its own workload shape, and ship the contract surface that makes the split explicit before the next agent finds it.

References:Let's stay in touch and Follow me for more thoughts and updates