LLM API Resilience in Production: Rate Limits, Failover, and the Hidden Costs of Naive Retry Logic
In mid-2025, a team building a multi-agent financial assistant discovered their API spend had climbed from $127/week to $47,000/week. An agent loop — Agent A asked Agent B for clarification, Agent B asked Agent A back, and so on — had been running recursively for eleven days. No circuit breaker caught it. No spend alert fired in time. The retry logic dutifully kept retrying each timeout, compounding the runaway cost at every step.
This is not a story about model quality. It is a story about distributed systems engineering — specifically, about the parts of it that most LLM application developers skip because they assume the provider handles it.
They do not.
LLM API providers operate at roughly 99.0–99.5% uptime. That sounds fine until you convert it: 99% uptime means 3.5 days of downtime per year. By contrast, big-three cloud providers average 99.97% uptime — about 2.5 hours per year. That is a 6–14x difference in downtime exposure, and it is not improving quickly. API uptime across the LLM industry fell from 99.66% to 99.46% between Q1 2024 and Q1 2025 — 60% more downtime year-over-year, as demand growth outpaced infrastructure scaling.
If you are calling an LLM API from a production system and your resilience strategy is "retry on error," you have already made several expensive assumptions that will eventually be wrong.
Why Retry Logic Is Where Good Intentions Die
The single most common resilience mistake in LLM applications is not the absence of retries — it is retries without jitter, without layering discipline, and without a budget.
When a rate limit or timeout occurs, a naively implemented retry loop fires again immediately. This hammers the same already-overloaded endpoint, exhausts the retry budget within milliseconds, and produces no recovery window. Worse, in systems with multiple service layers, the amplification compounds. Three retries at each layer of a five-service call chain produce 3^5 = 243 backend calls for each original user request. This is the canonical retry storm: the original problem was minor; the retry behavior made it fatal. About 40% of cascading failures in distributed systems trace back to retry logic.
The fix is not "remove retries." Retries are essential. The fix is three-part:
Use full jitter, not none. Pure exponential backoff without jitter synchronizes all clients to retry at the same moment, recreating the thundering herd on every attempt. Full jitter spreads retries: sleep = random_between(0, min(cap, base * 2^attempt)). Starting values that work: attempt 1 waits up to 1s, attempt 2 up to 2–3s, attempt 3 up to 4–6s, cap at 32–60s with a maximum of 3–5 attempts.
Retry only at one layer. If your application calls a service that calls another service, retries at every hop multiply. Pick one layer — usually the outermost application layer — and make it the only place retries happen. Internal layers should propagate failures cleanly.
Implement a retry budget. Set a global constraint: total retries should not exceed 10% of total requests at any given time. If your retry rate exceeds the budget, fail fast. This prevents one degraded endpoint from pulling down everything else.
One more thing: never retry 4xx errors blindly. A 400 or 403 will fail every time regardless of how many times you retry it. The only 4xx worth retrying is 429 (rate limit), and even then, read the retry-after header before choosing a wait duration. If the provider tells you the exact reset time, use it rather than guessing.
TPM vs. RPM: You Are Probably Only Handling One
LLM rate limits operate on two independent axes simultaneously, and most teams only think about one of them.
RPM (requests per minute) limits the number of API calls. It protects infrastructure from request floods. TPM (tokens per minute) limits compute consumption. It protects GPU capacity from workloads with long prompts or extensive agent chains. You can stay within RPM while blowing past TPM, and vice versa. Both will produce a 429, but the underlying cause and the right response differ.
For agents and RAG pipelines, TPM is almost always the binding constraint. A pipeline that retrieves 20 documents and stuffs them into a 15,000-token prompt burns TPM at roughly 15x the rate of a short-form query, even at the same request count.
Production-grade token management requires:
- Pre-request estimation using a tokenizer (
tiktokenfor OpenAI, provider-specific equivalents elsewhere) to reject or queue requests before they blow the budget. - Always setting
max_tokensto cap output. Without this, a model that decides to write an unusually thorough response can silently exhaust your TPM budget on a single request. - Dual rate limiting at the application layer, not just at the provider edge. Enforce both RPM and TPM limits in your own code, with a queue that smooths burst traffic using Redis or Kafka rather than shedding it.
Azure deployments add another dimension: per-instance limits and shared regional caps are independent. A deployment with five Azure instances each configured for 450K TPM on GPT-4o may still hit a region-wide limit that caps all instances combined at 300K TPM. This is not documented prominently and is typically discovered under load.
Circuit Breakers: The Mechanism That Separates Graceful Degradation from Self-Inflicted Collapse
A circuit breaker sits between your application and the LLM provider. In normal operation (closed state), all requests pass through. When the failure rate exceeds a threshold over a rolling window — say, more than 20% of requests fail over the last 60 seconds — the circuit trips open. In the open state, requests fail immediately without touching the provider, giving the provider time to recover. After a cooldown period, the circuit enters half-open state and allows a small fraction of test traffic through to probe whether recovery has occurred.
The concrete production impact is significant. For an application making 100 requests per minute during a five-minute outage:
- Without a circuit breaker: 500–1,000 requests hang for 30 seconds each waiting for timeouts. Users experience degraded responses throughout the outage.
- With a circuit breaker: after roughly 10–15 failed requests trip the threshold, the remaining ~485 requests fail fast in under 10ms. Fallback logic engages immediately. Users see a 200ms response from the secondary provider rather than a 30-second timeout.
Mean time to detection drops from 30 minutes to 2 minutes with circuit breaker telemetry, because the circuit state is an explicit signal that something is wrong.
For LLM applications, standard HTTP circuit breaker triggers — error rate, consecutive failures, latency P95 — are necessary but not sufficient. Add:
- Cost per request exceeding a threshold: the $47,000/week runaway agent mentioned at the top would have been caught by a circuit breaker configured to open when cost per conversation exceeds $X.
- Conversation turn count: break circuits at 20+ turns in an agentic conversation. Legitimate reasoning chains rarely need more; runaway loops almost always need more.
- Output quality score falling below threshold: requires a lightweight LLM-as-judge running on outputs before they reach the user.
Multi-Provider Failover Is No Longer Optional
By mid-2025, 40% of production LLM teams had multi-provider routing in place, up from 23% just ten months earlier. The main forcing function was a series of notable provider outages — including multi-hour incidents at both major foundation model providers — that left single-provider applications completely dark while multi-provider applications failed over in seconds.
The failure modes are predictable: a rate-limit storm on one provider, a 10-hour inference infrastructure outage at another, silent quality degradation that HTTP success rates cannot detect. None of these affect providers uniformly at the same time. Routing across providers converts single-provider outages into brief blips.
There are two failover architectures worth knowing:
Sequential failover: primary → secondary → tertiary. Simple to implement. The cost is ~1–3 seconds of additional latency per hop, which is often acceptable for non-interactive workloads.
Parallel hedging: fire requests to primary and secondary simultaneously; use whichever responds first; cancel the other. Eliminates the latency penalty of sequential failover but roughly doubles token cost. Reserve this for interactive use cases where first-token latency is the primary SLO.
The engineering challenges of multi-provider routing are underappreciated:
- Every provider has different error formats, rate-limit headers, and response schemas. A library like LiteLLM normalizes these, but at ~2,000 RPS LiteLLM's memory usage climbs past 8 GB. Higher-throughput environments need purpose-built gateways (Portkey, Bifrost in Go, or Bedrock for AWS-native stacks).
- Fallback models may produce structurally different outputs. Falling back from one model to another during an outage can break downstream JSON parsers if the models format responses differently.
- Cost can spike dramatically during failover. If your primary provider is the cheapest option and failover routes to a more expensive provider, a 10-hour outage during peak traffic can generate significant unexpected cost.
Silent Degradation Is the Failure Mode You Are Not Monitoring For
In August 2025, an LLM provider published a postmortem documenting three simultaneous bugs that had been degrading response quality for weeks. None of them were hard errors. HTTP success rates looked normal throughout. The failures:
- A load balancing change caused requests to be routed to servers configured for a different context window size. At peak, 16% of requests on one model were affected.
- A TPU configuration error caused high probability weight to be assigned to rare tokens, producing responses in the wrong language intermittently.
- A compiler arithmetic mismatch caused the highest-probability token to "sometimes disappear from consideration entirely" — producing technically plausible but factually wrong outputs.
Standard uptime monitoring caught none of these. They were detected via user complaints and manual investigation.
This is the category of failure that is hardest to defend against and most consequential for applications where output correctness matters. The monitoring requirements are different from what most teams have in place:
- Output schema validation: if your application expects structured JSON, validate the schema on every response. Schema failures are a leading indicator of model regression.
- LLM-as-judge on a sample: run a small percentage of responses through a lightweight quality assessment. A drop in quality scores before a drop in HTTP success rates is a valuable early warning signal.
- Embedding drift on outputs: track the semantic distribution of responses over time. Sudden drift in output embeddings — even when outputs are syntactically valid — indicates something changed upstream.
None of these are expensive to implement. All of them would have caught the August 2025 bugs faster than waiting for user complaints.
Putting It Together: A Minimum Viable Resilience Stack
The full picture: an LLM call in a production system should pass through a request queue that enforces dual TPM/RPM limits, then through a circuit breaker with error-rate and cost-threshold triggers, then to a gateway that handles exponential backoff with full jitter and can route to a secondary provider on 429s, 5xxs, or latency threshold breaches. Outputs should be schema-validated before returning to the application, with a percentage sample sent to a quality monitor.
This is not exotic infrastructure. It is the same distributed systems engineering that makes HTTP microservices reliable — applied to a new category of external dependency that happens to be slower, more expensive per call, and more likely to silently degrade than most services engineers have worked with before.
The teams that have built this are the ones whose applications kept serving users during every major provider outage in 2025 and 2026. The teams that have not built it are the ones writing incident postmortems about why their application was down for ten hours when the provider was down for ten hours.
The provider outage is not optional. The circuit breaker is.
- https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
- https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/
- https://portkey.ai/blog/failover-routing-strategies-for-llms-in-production/
- https://portkey.ai/blog/rate-limiting-for-llm-applications/
- https://portkey.ai/blog/retries-fallbacks-and-circuit-breakers-in-llm-apps/
- https://www.sitepoint.com/claude-api-circuit-breaker-pattern/
- https://www.zenml.io/blog/what-1200-production-deployments-reveal-about-llmops-in-2025
- https://www.anthropic.com/engineering/a-postmortem-of-three-recent-issues
- https://medium.com/google-cloud/building-bulletproof-llm-applications-a-guide-to-applying-sre-best-practices-1564b72fd22e
- https://learn.microsoft.com/en-us/azure/architecture/antipatterns/retry-storm/
- https://cookbook.openai.com/examples/how_to_handle_rate_limits
- https://www.runtime.news/as-ai-adoption-surges-ai-uptime-remains-a-big-problem/
