The Rate-Limit Headers Your Provider Returned That Disagreed With The Actual Throttle
The response header said you had 480,000 tokens-per-minute of headroom. The 429 arrived after you spent 240,000. Your scheduler had been autoscaling against a number the runtime was never going to honor, and the burndown chart on the wall was reading the documentation while the throttler was enforcing something else entirely.
This is one of those failures that takes a long time to even notice, because every component along the path is doing exactly what it advertised. The provider returns a header. Your client parses it. Your scheduler reads it. Your dashboard plots it. None of these layers is broken. What is broken is the assumption that the header is a contract.
It is not a contract. It is a hint, surfaced from an internal accounting system that was never designed to be a load-bearing input for downstream control loops, and the gap between what it claims and what the throttle enforces is wide enough to drive an outage through.
Why the header and the throttler disagree
The header you parse — x-ratelimit-remaining-tokens, anthropic-ratelimit-tokens-remaining, or your provider's equivalent — is the answer to a different question than the one your scheduler is asking. Your scheduler wants to know: "how many tokens can I safely send in the next second?" The header answers: "as of the moment we serialized this response, here is the remaining balance on a counter we keep for purposes we have not fully documented."
Those are not the same question, and the implementations diverge in at least four directions.
The first divergence is the time window. Documentation says "tokens per minute," and most engineers assume that means a sliding 60-second window. In practice, providers enforce over sub-minute intervals — sometimes 1-second buckets, sometimes 10-second buckets, sometimes a token-bucket algorithm with a burst capacity that has nothing to do with the per-minute advertised number. Azure's documentation acknowledges this directly: even when your per-minute average is within limits, a burst inside a sub-minute window will trigger 429s. The header tells you about the minute. The throttler is counting seconds.
The second is estimation versus reconciliation. Output token costs are unknowable until the model generates them. Providers handle this by reserving estimated output capacity upfront — typically using a worst-case max_tokens value — and reconciling against actual usage when the response completes. Between the reservation and the reconciliation, the in-flight budget is debited at the estimate, not the truth. If your application sets max_tokens: 4096 but the model usually returns 300, your accounted-against-quota usage is more than 13× the reality for the duration of every call. The header is honest about the estimate. Your dashboard treats it as a forecast.
The third is what counts. Most providers exclude cached input tokens from rate limit calculations, but the same providers may include them in the response's reported input_tokens and may charge for them at a discounted rate. A gateway sitting between you and the provider — Portkey, LiteLLM, Kong, an internal one — may or may not replicate this exclusion logic correctly. A known bug in LiteLLM through several 2025 releases counted cached tokens toward TPM limits even though the upstream provider would not have, which meant the gateway was 429ing requests the provider would have happily served. Your header was reporting the gateway's view; your retry logic was acting on the provider's view; the two were never reconciled.
The fourth is multi-deployment counter scope. If your traffic is spread across multiple gateway nodes, multiple regions, or multiple model deployments that share a quota, the per-response header reflects only what the local counter has seen. Azure API Management's token-limit policy explicitly notes this: counters are local to each gateway and do not aggregate across the instance. Your aggregate consumption can exceed the limit while every individual node's header reports plenty of headroom, because no node knows the other nodes' books.
The autoscaler that scaled into the wall
The most common version of this incident is an autoscaler that takes x-ratelimit-remaining seriously as a capacity signal. The logic seems sensible: if the provider says we have headroom, dispatch more in-flight requests; if remaining drops below a threshold, back off. Engineers implement this in good faith, treating the header the way they would treat any other backpressure mechanism.
But the autoscaler is making a decision over the next several seconds based on a counter that the provider may reset in milliseconds, scope to a deployment your traffic does not exclusively own, or update asynchronously after a request that has not yet been billed. The result is a control loop in which the input variable and the constraint variable are running on different clocks against different units in different scopes.
When the system is under steady load, this is invisible. The bias is small per call, but it is structural and one-directional: the header consistently advertises more headroom than the throttle will honor, because the header is a snapshot and the throttle is a window. Under burst conditions — the kind that happen during product launches, traffic migrations, or any event your scheduler is supposed to handle smoothly — the bias compounds. The scheduler scales up, hits the actual throttle wall, sees 429s, treats them as transient, retries, and now the retry traffic is competing with the original traffic against a budget that the system still thinks is two-thirds full because the header has not been updated yet.
This is the failure mode where the dashboard shows green while the user-facing latency falls off a cliff, and the on-call engineer is reading the same green dashboard the autoscaler read three minutes ago, and concluding that something must be wrong with the network because the rate limit math obviously checks out.
The metrics that disagree with each other
Once you start logging both the headers and the actual 429 events, a second mismatch becomes visible: the metrics in the provider's console disagree with the metrics in your logs.
Token usage in the provider's monitoring view is generally a record of successfully processed and billed requests. Rate-limit enforcement, however, applies to every request at receipt — including ones that were rejected, never billed, or counted against an estimate that was later reconciled downward. So you can see 429s in production while the provider's TPM graph shows you well under the published limit. The metric is reporting on a different population than the throttler is acting on.
This is more than an inconvenience for capacity planning. It means the conversation between you and your provider's support team starts from incompatible numbers. Your traces show 429s at 240k TPM; the provider's dashboard shows you peaked at 180k TPM. Both are "correct" against their respective definitions. Neither helps you understand what to change. The header you originally trusted is now one of three numbers nobody can reconcile, and the post-mortem is going to spend more time arguing about which graph is canonical than about the actual control loop that failed.
What the patterns that survive look like
You cannot fix the headers — the provider owns them. What you can do is stop treating them as a single source of truth and build the control loop on signals that are operationally honest about their own limitations.
Treat the header as a coarse hint, not a budget. Read it for trend detection — "remaining is dropping faster than I expected" — and not as a permission slip for the next request. The actual budget your scheduler should plan against is the conservative one you derive from sustained observed throughput, not the optimistic one the header advertises.
Use the provider-returned token counts as the source of truth, propagated back into your accounting. When a response comes back with actual input_tokens and output_tokens, write those into the same counter your throttler is consulting. Never run the throttler off your local pre-call estimate when a reconciled actual is available. The estimate is for admission control; the actual is for the next admission decision.
Reserve output capacity at the worst case your application would actually return, not at max_tokens. If your prompts almost always return under 800 tokens, do not reserve against 4096. The cost is paid twice: once in the artificial pressure on your own throttler, and again in opportunities you decline because the local accountant thinks you are out of room.
Run a divergence audit on a schedule. Once a day, sample your logs for 429 events and the headers that preceded them. If the average headroom advertised before a 429 is greater than a few percent of the limit, you have evidence that the header is not predictive and your control logic should stop treating it as if it were.
Separate the "did I get throttled" signal from the "should I slow down" signal. A 429 is ground truth about throttling. A falling remaining value is a hint. A 200 OK with a healthy remaining value is the absence of evidence, not evidence of absence. Wire your backoff to the 429s, not to the header, and your scheduler will start tracking reality instead of documentation.
If you run a gateway, audit what it does with cached tokens and reservations. A gateway is the easiest place for the header-to-throttle disagreement to grow worse, because the gateway is trying to enforce a TPM policy on traffic whose actual TPM cost is owned by the provider. The gateway's view of "TPM" is a guess; the provider's is the truth; if the two diverge silently, your traffic shaping is fictional.
The deeper problem
The header-versus-throttle gap is one instance of a larger pattern in AI infrastructure: provider-side signals were designed to be informational and your application is treating them as load-bearing. The same shape shows up with token-count estimates, cache-hit advisories, model-version aliases, and "soft" limits in dashboards. Each one is a number that is honest about what it represents and dishonest about how much weight it can carry.
The fix is not to demand that providers make their numbers more contractual — they will not, and they are right not to, because the alternative is exposing internal accounting that genuinely does drift. The fix is to design control loops that treat provider-side numbers as the noisy, lagged, scope-limited signals they actually are, and to put your reliability budget against signals you own end-to-end: your own observed throughput, your own measured success rate, your own reconciled token counts.
The burndown chart that read the documentation while the runtime enforced something else was not a bug in the chart. It was a category error in what the chart was charting. Once you stop asking the header to be a contract, you can stop being surprised when the throttler enforces one anyway.
- https://platform.claude.com/docs/en/api/rate-limits
- https://developers.openai.com/api/docs/guides/rate-limits
- https://learn.microsoft.com/en-us/azure/ai-services/openai/quotas-limits
- https://learn.microsoft.com/en-us/azure/api-management/llm-token-limit-policy
- https://techbytes.app/posts/adaptive-rate-limiting-variable-cost-ai-inference-apis/
- https://community.openai.com/t/openai-response-x-ratelimit-header-values-1-and-0/1366625
- https://community.openai.com/t/x-ratelimit-headers-missing/935514
- https://clemenssiebler.com/posts/understanding-azure-openai-x-ratelimit-remaining-tokens-x-ratelimit-remaining-requests-headers/
- https://github.com/BerriAI/litellm/issues/18728
- https://portkey.ai/blog/rate-limiting-for-llm-applications/
- https://orq.ai/blog/api-rate-limit
- https://support.anthropic.com/en/articles/8243635-our-approach-to-api-rate-limits
