Skip to main content

Rate Limits Are a Design Constraint, Not an Error Code

· 10 min read
Tian Pan
Software Engineer

A team I know built a financial assistant with an agentic loop. Week one, API spend was 127.Weekeleven,itwas127. Week eleven, it was 47,000 — same system, same feature, no intentional change in scope. The agent hit a rate limit, the retry logic dutifully retried, the loop had no circuit breaker, and the costs compounded in silence until someone noticed the billing alert they had set too high.

This isn't a story about a bug. It's a story about architecture. The team's mental model treated rate limits as an error to handle reactively. The system they built reflected that model exactly. The $47,000 week was the system working as designed.

The difference between "my system handles a rate limit event" and "my system is designed to function under sustained quota pressure" is not semantic. It's the difference between adding a try/except around your API call and deciding, at design time, what your system does when quota is the binding constraint — because at production scale, it often is.

The Multi-Dimensional Nature of Quota

Most engineers think of rate limits as a single dial: requests per minute. In practice, every major LLM provider enforces limits across at least three separate dimensions simultaneously: requests per minute (RPM), tokens per minute (TPM), and tokens per day (TPD). Some providers — Anthropic notably — separate input tokens per minute (ITPM) from output tokens per minute (OTPM), which matters because a cache hit on a 100K-token context costs you nothing against ITPM but would otherwise consume the majority of your per-minute budget.

The practical consequence is that optimizing one dimension can push you into a different limit than you expected. A system that batches requests efficiently to stay under RPM limits might burst into TPM limits during high-traffic periods. A system carefully shaped to stay under TPM might still hit RPD limits during sustained load. You need to track all three simultaneously, not just the one that bit you last time.

Rate limits also vary dramatically by tier. Anthropic's Tier 1 (after a 5deposit)gives50RPMand12,500ITPManorderofmagnitudelessthanTier4s4,000RPMand2,000,000ITPM.OpenAIsTier5requiring5 deposit) gives 50 RPM and 12,500 ITPM — an order of magnitude less than Tier 4's 4,000 RPM and 2,000,000 ITPM. OpenAI's Tier 5 — requiring 500,000 in cumulative spend — allows 40 million TPM on its flagship models. The gap between what a startup has access to and what a scaled company has access to is enormous, and architectural choices that seem fine in development can collapse in production when the tier ceiling drops to what your organization has actually earned.

This matters for capacity planning. You cannot design your queue depth, your backpressure thresholds, or your retry budgets without knowing your actual quota ceiling — and that ceiling will change as your spend tier changes. Quota is not a fixed parameter; it's a moving constraint that your architecture needs to track as a first-class variable.

Calculating Sustainable Throughput Before You Need It

The question "how much load can my system sustain?" needs an answer in the planning session, not during a traffic spike. The math is not complicated, but it requires doing it before you need it.

For token-based limits, the calculation starts with your average token cost per request. If you're averaging 2,500 input tokens and 800 output tokens per request, each call consumes 3,300 tokens. Against a 100,000 TPM limit, your theoretical maximum is about 30 requests per minute — roughly 0.5 RPS. Against an RPM limit of 1,000, the token limit is the binding constraint by a wide margin. If your RPM limit is 50, the token limit may never come into play at all.

The sustainable throughput is the minimum across all limit dimensions given your request profile. For workloads with variable request sizes — different users send very different prompts — you need a distribution, not an average. The tail matters: if your p99 request is 10,000 tokens, a single large request can consume 10% of a 100,000 TPM budget in one shot, suddenly making your effective sustainable throughput much lower than the average-case calculation suggested.

A useful heuristic: budget for 70-80% of your stated limit as your sustained operating ceiling. Leave the remaining headroom for traffic spikes, for requests that run longer than expected, and for the measurement imprecision that accumulates when you're tracking multi-second rolling windows. Systems that operate at 100% of their stated quota have no room for anything to go slightly wrong.

Priority Queues: Deciding What Gets Quota First

Once you have a throughput ceiling, you need a policy for what work runs when you're near it.

The naive approach is FIFO: first request, first served. This is almost always wrong for heterogeneous workloads. When a user is waiting for an interactive response and a background batch job is running, they should not share a queue. The batch job's willingness to wait is a resource, and you should use it.

A workable production model uses at least three tiers. Interactive requests — user-facing chat, real-time completions — get highest priority and should be served immediately with latency as the primary SLO. Background requests — content generation, code review, document summarization — can tolerate seconds to minutes of queueing and should be scheduled after interactive work is served. Maintenance work — eval runs, index updates, monitoring probes — should only run when the queue is empty or quota is abundant.

The goal is not just fairness; it's load shedding. Background and maintenance work should be the first thing your system stops doing when quota pressure rises. If you're approaching your TPM ceiling, you don't want your eval suite consuming tokens that a waiting user needs. The priority queue is the mechanism that makes this decision explicitly rather than accidentally.

Deadline hints add a useful complement. If a client-side request has a 5-second timeout, and the request has been queued for 4.8 seconds, the correct behavior is to drop it from the queue rather than dispatch it — the output will be discarded by the time it arrives. This is easy to implement (include a deadline timestamp in each queue entry and check it at dispatch time) and prevents your inference capacity from being consumed by work that produces no value.

The Burst-Smooth Tradeoff and Why It's a Design Decision

When you have more work than your sustained quota allows, you have two options: burst above your comfortable operating ceiling and accept the resulting latency, or smooth the load by queuing work and accepting increased end-to-end latency in exchange for predictability.

Bursting into your provider's limit ceiling gives you maximum throughput for short periods but creates unpredictable latency as requests stack up and start being rejected or delayed. Smoothing — using a token bucket or leaky bucket algorithm to drain your queue at a steady rate — trades throughput for stability. With continuous batching, grouping 32 requests together and dispatching them as a unit rather than individually, you can reduce per-token cost by roughly 85% with only a 20% latency increase for non-interactive workloads. For batch processing pipelines that don't have hard latency SLOs, this is almost always the right choice.

The mistake is treating this as a single global decision. Interactive and batch workloads have different tradeoff profiles and should be managed separately. A well-designed system runs interactive work at low concurrency with aggressive latency targets and runs batch work in a smoothed queue with throughput as the optimization target. These are two different control loops, not one.

Adding jitter to retry logic belongs in this same category: it's not optional. When 500 clients simultaneously encounter a rate limit and retry with identical exponential backoff intervals, they all retry at the same moment, recreating the exact traffic spike that caused the rate limit. Fixed-interval retries without jitter amplify traffic spikes by 60-80% rather than absorbing them. Add randomization — full jitter or decorrelated jitter — to every retry implementation, without exception.

Circuit Breakers and Multi-Provider Failover

Rate limits are not purely a quota management problem. They're also a reliability signal. A system that's receiving sustained 429s is not just over quota; it may be in a partially degraded state where some requests succeed and some fail unpredictably. Standard circuit breaker logic — counting errors, tripping to an open state, probing with a single request before closing — applies directly here.

LLM APIs need circuit breakers tuned for two failure modes that standard implementations miss. The first is sustained latency elevation: a request that takes 90 seconds to complete has consumed your context window and your thread for 90 seconds, even if it ultimately succeeds. Circuit breakers should trip on p95 latency exceeding your SLO, not just on error rate. The second is provider capacity depletion, where the API returns 200 but without the rate limit headers that normally signal remaining quota — a subtle failure mode that looks like success and drains your budget.

Single-provider architectures have a structural reliability problem that rate limit handling alone cannot solve. Provider API uptime across major LLM providers has been trending down year over year, with multi-hour outages occurring at every major provider in the past 18 months. Multi-provider routing — maintaining fallback endpoints for equivalent models across providers — transforms a single-provider outage from a complete service failure into a brief routing decision. The operational cost of maintaining two provider integrations is much lower than the cost of explaining complete service unavailability during a 4-hour provider outage.

Designing for Quota, Not Against It

The reframe that changes architectural decisions is this: quota is not an obstacle your system fights; it's a constraint your system is designed around.

This means quota limits belong in your system design documents alongside latency SLOs and throughput requirements, not in your error handling code. It means your capacity planning process calculates maximum sustainable throughput before you go to production, not after you hit your first 429. It means your priority queues and backpressure mechanisms are in place before you have users, not added reactively when a spike reveals that your FIFO queue doesn't differentiate interactive from batch work.

The teams that do this well treat quota as similar to database connection pool size or cache memory: a resource with a ceiling that their system is explicitly designed to operate within. They know their ceiling, they monitor their consumption continuously, and they have defined behavior for every point in the consumption curve from 0% to 100%.

The teams that do it poorly discover their real bottleneck during a traffic spike, have no load-shedding behavior to fall back on, and get the $47,000 week. Both outcomes are predictable from the architectural choices made months earlier. The difference is whether rate limits were treated as a design constraint or an error code.

References:Let's stay in touch and Follow me for more thoughts and updates