Skip to main content

17 posts tagged with "rate-limiting"

View all tags

The Backpressure Signal Your Inference Provider Refuses to Send

· 9 min read
Tian Pan
Software Engineer

Your retry logic backs off on 429. Your queue depth alarm fires when latency rises. Between those two signals there is a region of provider load where the right action is "slow down by twenty percent" — and the only thing the provider will tell you is the binary throttle that arrives too late. The single most useful signal for an agent fleet to coordinate on is the one no inference API actually exposes.

A 429 is a tombstone, not a warning. By the time you receive one, the provider has already decided your traffic is excessive, you have already wasted a request's worth of token accounting, and — if you are sharing a tenant with other consumers — they have probably gotten one too. The interesting failure mode is not the 429 itself; it is the seconds before it, when every client in the world is flying blind between "everything is fine" and "you are cut off."

The Free Trial That Burned Your Quarterly Inference Budget in Eleven Hours

· 11 min read
Tian Pan
Software Engineer

Your trial offered "100 generations per day." Your pricing team modeled an interested user kicking the tires for a week. The first trialist who points an agent at the endpoint runs through the daily quota in seventy seconds, the weekly quota in nineteen minutes, and the quarterly inference budget by lunch the next day. Nobody alerted, because the only alert wired up was the one that fires when a trial converts.

The trial limits were not wrong when they were written. They were calibrated for a usage distribution that no longer describes the modal user. Somewhere between the pricing review six months ago and the signup that arrived this morning, the population shifted from humans clicking buttons to programs that don't get tired. The numbers on the dashboard stopped meaning what they meant when you set them.

The Rate Limit You Set for Humans an Agent Saturates in Three Seconds

· 10 min read
Tian Pan
Software Engineer

The rate limit was never a fairness primitive. It was a sales-engineering quote that grew up — a number a solutions engineer typed into a docs page during onboarding three years ago, copied into a tier definition, and never revisited because no one ever hit it. The limit said "100 requests per minute" and it meant "more than any sane integration will ever need," because every integration on the platform was a backend service driven by a human at a keyboard, and humans do not type a hundred times a minute.

Then a paying tenant pointed an agent at the endpoint. The agent did not type. It did not pause to read responses. It did not have a UI to render between requests. It executed a planning loop that called the API once per reasoning step, and one reasoning step took the model about thirty milliseconds of wall time to formulate. The agent hit the per-minute ceiling in three seconds, the per-hour ceiling in three minutes, and the daily quota before the on-call engineer's coffee had cooled. The support escalation landed before the throttle dashboard had updated.

The Agent That Retried Its Way Past Your Rate Limit

· 10 min read
Tian Pan
Software Engineer

Your gateway enforces a clean 100 requests per second per tenant. The dashboard shows every tenant comfortably under that ceiling. The bill from your model provider says you blew through the spend cap anyway. Nobody on the rollout call has a clean story for why.

The answer is that the rate limiter and the bill are measuring different things. The limiter sees one "user request" when a customer clicks a button. The provider sees a planner call, three tool-result reflections, a format-correction retry triggered by a stricter JSON schema, and a final synthesis — each with its own internal retry budget that fires when a transient 429 or 500 comes back. A single click can fan out into thirty model calls. The limiter counts one. The bucket leaks at thirty times the rate it was sized for.

Rate-limiting an agentic system at the HTTP boundary is enforcing speed limits at the highway entrance while the cars inside multiply. Until the limiter understands the loop, the loop will route around it.

Provider Rate Limits Are a Capacity Plan You Never Wrote

· 9 min read
Tian Pan
Software Engineer

The first time your application hits a 429 from a model provider, something important happens, and almost nobody notices it. Not the error itself — the line of code that runs next. Maybe your HTTP client retries with exponential backoff. Maybe it falls back to a smaller model. Maybe it queues the request, or drops it, or surfaces a spinner that never resolves. Whatever it does, that behavior is now your capacity policy. It decides which users get served and which get degraded when demand exceeds supply.

And you almost certainly didn't write it. It was authored by whoever wrote the SDK wrapper, the retry decorator, or the three-line try/except someone copied from a tutorial. The most consequential decision in your system under load — what to do when you can't do everything — is being made by code nobody reviewed as a capacity decision.

This post is an argument for treating that code as what it actually is: a load-shedding policy and a capacity plan. Not an error handler. The 429 is not the problem. The problem is that you have outsourced the design of your system's behavior under contention to a library default.

The Rate Limit That Became a Product Decision

· 10 min read
Tian Pan
Software Engineer

A rate limit used to be an infrastructure detail. You hit a 429, you retried with backoff, you queued the overflow, and nobody outside the on-call channel ever knew it happened. The user saw a response that was a few hundred milliseconds slower than usual. That was the whole story.

That story no longer holds for agentic features. When an agent hits a provider's tokens-per-minute ceiling halfway through a multi-step plan, the failure does not stay inside the infrastructure. It surfaces as a half-finished answer, a tool loop that stalls before the last call, or a user watching a spinner that will never resolve. The quota stopped being a backend capacity number and became a constraint that product has to design around — the same way product designs around a checkout flow or an empty state.

Quota Starvation: When Your AI Features Eat Each Other's Rate Limits

· 11 min read
Tian Pan
Software Engineer

At 2 AM, a scheduled report-generation job spins up fifty parallel LLM requests against your shared API key. By the time the 9 AM product demo starts, every real-time chat completion is silently timing out. Your error dashboards are green. No 429s in the logs. The model is returning responses — just ten seconds late, on a feature with a two-second SLA.

This is quota starvation. It does not look like an outage. It looks like the AI is "slow today."

Rate Limits Are a Design Constraint, Not an Error Code

· 10 min read
Tian Pan
Software Engineer

A team I know built a financial assistant with an agentic loop. Week one, API spend was 127.Weekeleven,itwas127. Week eleven, it was 47,000 — same system, same feature, no intentional change in scope. The agent hit a rate limit, the retry logic dutifully retried, the loop had no circuit breaker, and the costs compounded in silence until someone noticed the billing alert they had set too high.

This isn't a story about a bug. It's a story about architecture. The team's mental model treated rate limits as an error to handle reactively. The system they built reflected that model exactly. The $47,000 week was the system working as designed.

Conversation-Aware Rate Limiting: Why Per-Request Throttling Breaks Multi-Turn AI

· 10 min read
Tian Pan
Software Engineer

Your AI feature works in testing. Single-turn Q&A, perfect. Run it in production with a real user sitting in a 10-turn debugging session and it fails — not because the model broke, but because your rate limiter was designed for a completely different world.

The standard API rate limit is a blunt instrument built for stateless REST calls. Each request is treated as an independent, roughly equal unit of consumption. That model works fine for CRUD endpoints where every call is indeed comparable. It falls apart for multi-turn conversations, where each successive turn gets more expensive, a single user interaction can trigger dozens of internal model calls, and a mid-session cutoff is far more damaging than a failed single-shot query ever was.

Agent Traffic Is Not Human Traffic: Designing APIs for Two Species of Caller

· 11 min read
Tian Pan
Software Engineer

The API you shipped two years ago was designed for a single species of caller: a person, behind a browser or a mobile client, clicking once and waiting for a response. That assumption is now wrong on roughly half of every interesting endpoint. The other half of the traffic is agents — your own, your customers', third-party integrations using your endpoints as tools — and they have different physics. They burst. They retry forever. They parallelize. They parse error strings literally. They act on behalf of a human who will not be available to clarify intent when something breaks.

Most of the production weirdness landing in postmortems this year traces back to one architectural mistake: treating both species as the same caller class. Rate limits sized for human pacing get blown apart by an agent's parallel fanout. Error messages designed to be human-readable get parsed wrong by an agent that retries forever on a 400. Idempotency assumptions that humans satisfy by default get violated when an agent retries the same payload from a recovered checkpoint. Auth logs lose the ability to distinguish "the user did this" from "the user's agent did this on the user's behalf."

The fix is not a smarter WAF or a bigger rate-limit bucket. It is a deliberate API design that names two caller classes, treats their traffic as different shapes, and records the delegation chain so accountability survives the indirection.

Load Shedding Was Built for Humans. Agents Amplify the Storm You're Shedding

· 12 min read
Tian Pan
Software Engineer

A 503 to a human is a "try again later" page and a coffee break. A 503 to an agent is a 250-millisecond setback before retry one of seven, and the planner is already asking the LLM whether a different tool can sneak around the failed dependency. The first behavior gives an overloaded service room to recover. The second behavior is what an overloaded service has nightmares about: thousands of correlated retries, each one cheaper and faster than a human's, half of them fanning out into the next dependency over because the planner decided that was a creative workaround.

Load shedding — the discipline of dropping low-priority work to keep the high-priority path alive — was designed in an era when the principal sending traffic was a human at a keyboard or a well-behaved service with a hand-tuned retry policy. Both of those assumptions break the moment a fleet of agents shows up. The agent retries faster, retries from more places at once, replans around the failure, and treats your 503 as a load-balancing hint instead of as the cooperative back-pressure signal you meant it to be.

This piece is about why the standard load-shedding playbook doesn't survive contact with agentic clients, what primitives the upstream service needs in order to actually shed agent traffic, and what the agent itself has to do — at the tool layer and at the planner — to stop being the hostile traffic in someone else's incident report.

Rate Limit Hierarchy Collapse: When Your Agent Loop DoSes Itself

· 12 min read
Tian Pan
Software Engineer

The bug report says the service is slow. The dashboard says the service is healthy. Token-per-minute usage is at 62% of the tier cap, well inside the green band. Then you open the traces and see the shape: one user request spawned a planner step, which emitted eleven parallel tool calls, four of which were search fan-outs that each triggered sub-agents, which each called three tools in parallel — and that single "request" is now pounding your own token bucket from forty-seven different workers at the same time. The other ninety-nine users of your product are stuck behind it, getting 429s they never earned. Your agent is DoSing itself, and the rate limiter is doing exactly what you told it to.

This is rate limit hierarchy collapse. You bought a perimeter defense designed for HTTP APIs where one request equals one unit of work, then wired it in front of a system where one request means a tree of unknown depth and unbounded branching factor. The single-bucket model doesn't just fail to protect — it fails invisibly, because your aggregate numbers never breach anything. The damage happens in the tails, in correlated bursts, and in the heads-down users who happen to be adjacent in time to a heavy one.