The Tail-Tolerant Retry Policy Your LLM Gateway Doesn't Have
Pull up your gateway's retry config. Three attempts. Exponential backoff with jitter. Retry on 5xx and timeout. Maximum delay capped at a few seconds. It looks reasonable, and someone copied it from a microservices runbook two years ago. It is also the single largest reason your P99 is twice your P50, your token bill spikes during provider incidents, and a meaningful slice of your users see a thirty-second spinner before silently bouncing.
A retry policy designed for 50ms RPCs does not survive contact with an 8-second LLM call. The shape of the failure is different, the cost of every attempt is different, and the user-perceived clock is different. The default is not safe, it is just familiar. Most teams discover this the same way: a postmortem where the gateway logs a successful response and the customer screenshot shows a frozen UI.
The microservices retry idiom assumes three things, all of which are wrong for LLM traffic. It assumes that retries are cheap, because the original call was cheap. It assumes that timeout is rare, because well-tuned services rarely time out. It assumes that the slow path and the failure path are the same path, because for a 50ms call they almost are. None of those hold once you are calling a model that streams tokens for ten seconds, where every retry replays a 100k-token prompt, and where a timeout is the modal failure rather than the exception.
This post walks through the retry policy an LLM gateway actually needs. The good news: most of the building blocks are old. The bad news: nobody assembled them for this workload before, and the defaults inherited from your service mesh are quietly burning both money and trust.
The Microservices Retry Defaults Don't Survive Contact With LLMs
Start with the math. A typical LLM call has a P50 around 8 seconds and a P95 closer to 18. Apply the standard "retry on timeout, three attempts, exponential backoff" pattern with a 10-second per-attempt timeout, and a single slow request now spans up to 30 seconds before fallback even fires. LiteLLM users have documented this exact scenario, where the fallback path everyone configured "for resilience" silently adds 30+ seconds of latency to the failing tail.
Worse, the retries do not converge. When a provider is overloaded and returning 529 errors, every other team's retry loop is firing simultaneously into the same already-saturated endpoint. The original incident was a 30-second blip; the retry storm extends it to ten minutes. Roughly 40% of cascading failures in distributed systems trace back to retry logic, and LLM workloads sit at the worst end of that distribution because every retry replays a multi-thousand-token request that costs the same as the original.
Then there is the cost dimension that microservices never had to think about. A failed REST call costs you a TCP handshake. A failed LLM call that aborted halfway through generation costs you the full input token bill, often the partial output token bill, and a fresh full-price retry on top. There are public bug reports of gateway restarts triggering full re-sends of 100k-token contexts and producing surprise four-figure overnight charges. The retry budget is no longer "how much CPU am I willing to burn" — it is a line item on next month's invoice.
Timeout Is the Modal Failure, Not the Exception
The single biggest shift from microservice retry to LLM retry is that timeout stops being an edge case. For a 50ms service, timing out at 1 second is a clear "something is broken" signal. For an LLM, timing out at 30 seconds is "the model is taking longer than usual to think," which happens routinely under load, during long-context reasoning, or when the provider is mid-deploy.
This means timeout retries need their own policy, separate from error retries. Lumping them together produces two failure modes:
The first is over-retrying genuine slow generations. A long reasoning chain that would have completed at 25 seconds gets killed at 10, retried, killed again at 20 cumulative, retried, killed again at 30 cumulative — three retries deep, three full token bills, and the user has gotten nothing. This is what users mean when they say "the AI feature got worse" after a release where the only change was a tightened timeout.
The second is under-retrying transient errors. A 529 overload from Anthropic's infrastructure resolves in 10–30 seconds for the median case. If your retry budget is shared with timeouts and was already exhausted by a slow generation that completed normally, the actual recoverable error has nowhere to go and surfaces as a user-visible failure.
The fix is to classify before you retry. Build a small taxonomy: model-overloaded errors (529, 503), tool-call validation failures, schema-validation failures, network errors, and pure timeouts. Each gets its own budget, its own backoff curve, and its own ceiling. Pure timeouts especially need a separate budget because they are normal, expected, and high-volume — treat them like cache misses, not like exceptions.
Hedged Requests Are Better Than Retries For The Slow Tail
Google's "The Tail at Scale" paper made this point in 2013 and it has aged into a load-bearing pattern for any system with high tail variance: instead of retrying after a request fails, fire a duplicate request after the request takes too long, and take whichever returns first. The difference matters. Retry-after-failure waits for a confirmed problem, then makes you wait again for the second attempt. Hedged requests treat slowness itself as the signal and run the second attempt in parallel, so the user-visible latency is min(primary, hedge) instead of primary + hedge.
For LLM traffic the math is brutal in hedging's favor. A team using adaptive hedging — where the hedge fires when the primary exceeds its measured P90 — has documented dropping P99 from 64ms to 17ms with about 9% extra load on a non-LLM workload. For LLMs the absolute numbers are different (your P90 baseline is in seconds, not milliseconds) but the relative win is similar: the slowest 5–10% of requests stop being slow, and you trade some tokens for the difference.
There is one trap specific to LLM hedging that microservice hedging libraries do not handle. Inference servers commonly send a 200 OK response header before the model has produced a single token, then stream the body afterwards. A naive hedger that measures latency on header receipt thinks the request is fast and never hedges, even when time-to-first-token is ten seconds and the user is staring at a spinner. Measure latency on first body byte, not first header byte. That alone fixes a class of "the dashboard says we are fast but users disagree" mysteries.
The other discipline is hedge rate-limiting. A token bucket capped at 5–10% of base traffic prevents the worst-case scenario, which is a real provider outage where every primary call is slow, every hedge fires, and your gateway suddenly doubles the load on the already-degraded provider — turning a 30-second blip into a five-minute outage. The token bucket drains within seconds when something is genuinely wrong, hedging stops, and the system fails normally rather than catastrophically.
Failover Routing Beats Retry-Same-Path When The Path Is The Problem
Retrying the same provider against the same regional endpoint after a timeout is, more often than not, a worse choice than routing the second attempt elsewhere. If the primary timed out because that region is degraded, retry-same-path adds latency without changing the outcome. If the primary timed out because the model itself is slow on this prompt, retry-same-path produces the same slow generation a second time.
The pattern that consistently outperforms is hedged failover: when the primary exceeds its P95, fire the second attempt against a different region or a different provider entirely, take whichever returns first. The cost looks higher on paper because you may pay for two LLM calls, but the marginal cost of the hedge is dwarfed by the savings on the long tail and the avoided incident-shaped customer-impact cost.
Two implementation notes that the LiteLLM-style "try provider A, then B, then C in sequence" pattern gets wrong:
First, sequential failover accumulates latency. Three providers × 10-second timeout = 30 seconds before the user sees a result, even when provider B would have answered in 2 seconds. Parallel hedged failover takes the fastest answer available, so the user sees a 2-second response from provider B regardless of where provider A was in its slow path.
Second, sequential failover often retries with the same prompt that broke the primary, which means the same prompt-injection or context-length issue triggers the same failure on the secondary. A failover that classifies the error and adapts the request — for example, switching to a longer-context model when the failure was a context-window overflow — beats one that just changes the destination URL.
The Retries-Enabled View Is The Only Latency Number That Matters
A subtle eval discipline failure haunts almost every LLM gateway dashboard. The team measures and reports P50, P95, and P99 of model latency — meaning the latency of the underlying provider call, before retries, hedges, or failover. They optimize this number. They publish it in the latency SLA. The customer-facing latency, however, is the latency after the retry policy has done its work. And those two numbers diverge dramatically once retries are in play.
Imagine a P99 model latency of 18 seconds with a 10-second timeout, a 1-second exponential backoff, and three retry attempts. The model-latency-P99 the dashboard shows is 18 seconds. The customer-experienced P99 — the worst case where the timeout fired three times — is closer to 33 seconds, because retries add their own latency on top of the call they replaced. If the retry policy itself is the variable being tuned, a dashboard that hides the retry effect is hiding the very signal the team needs.
Track latency at the user-visible boundary, with retries enabled, including the time spent in backoff. Then stack a second graph for model latency without retries, so you can see the gap. The gap is the cost of your retry policy, paid by users in waiting time and by the company in tokens. When the gap is large and stable, the retry policy is doing real work. When the gap is large and growing, the retry policy is masking a degradation in the underlying service that you should be alerting on.
Idempotency, Budget Caps, and the Failure Modes You Should Plan For
A few engineering details separate a tail-tolerant retry policy from one that explodes during incidents.
Idempotency keys make speculative hedging safe when the LLM call has side effects — sending a message, charging a card, writing to a database via a tool call. Without an idempotency key, a hedge can produce the side effect twice. With one, the second writer detects the dedupe and no-ops. This matters specifically for agent workflows where the model call is wrapped around state mutation.
Budget caps prevent the runaway storm. Set a global cap — total tokens per minute spent on retries, total retry-attempt count per minute across all callers — that is small relative to base traffic, perhaps 5–15%. When the cap is exceeded, fail-fast on new retries rather than queueing them. A single failing provider region should not be allowed to consume your entire LLM budget; the cap forces the system to give up on the bad path and let the user see an error rather than a 90-second wait.
Tab-cancellation handling is the unglamorous one most teams forget. The browser tab gave up at 15 seconds. The gateway is on its second retry at 22 seconds. The provider returns successfully at 28 seconds. The gateway logs a 200, the cost dashboard shows a successful billable request, and the user sees nothing because they closed the tab thirteen seconds ago. The fix is bidirectional cancellation: when the upstream connection closes, propagate a cancel to the in-flight retry, and stop spending tokens on a response no one will read.
The Retry Policy Is Part Of Your Latency SLA
The team running microservice defaults on an LLM gateway is making an architectural choice without knowing it. They are choosing to ship a P99 they did not measure, a token bill they did not budget for, and a customer experience that diverges from the dashboard by tens of seconds during the exact incidents where the dashboard matters most.
Treating the retry policy as a load-bearing piece of the latency SLA is the move. That means measuring it with retries on, sizing the budget against the cost line on the invoice, separating timeout from error, hedging the slow tail rather than waiting for it to fail, failing over in parallel rather than in sequence, and capping the storm so a provider blip stays a blip. None of these are exotic ideas. They are just the wrong defaults to inherit from a service mesh, applied to a workload the service mesh was never designed for.
The good news is that this is one of the highest-leverage areas in the stack. A two-week project to rewrite the gateway's retry layer can move P99 by 40–60% and cut incident-driven cost spikes by an order of magnitude. The bad news is that nobody on the org chart owns it, because retry policy lives in the gateway, the gateway is owned by platform, the latency number is owned by the AI team, and the cost number is owned by FinOps. That is the meta-pattern: the highest-leverage problems are the ones with no DRI, and the team that names one wins.
- https://github.com/bhope/hedge
- https://docs.llmgateway.io/features/routing
- https://redis.io/blog/p99-latency/
- https://www.truefoundry.com/blog/observability-in-ai-gateway
- https://dev.to/silentwatcher_95/supercharge-your-nodejs-application-with-hedge-fetch-eliminating-tail-latency-with-speculative-37d5
- https://dev.to/onurcinar/beating-tail-latency-a-guide-to-request-hedging-in-go-microservices-p81
- https://markaicode.com/fix-llm-api-timeout-errors-production/
- https://www.vellum.ai/blog/what-to-do-when-an-llm-request-fails
- https://tokenmix.ai/blog/anthropic-overloaded-error-why-workarounds-2026
- https://dev.to/debmckinney/your-litellm-failover-might-be-adding-30-seconds-of-latency-heres-why-1lm
- https://medium.com/@spacholski99/circuit-breaker-for-llm-with-retry-and-backoff-anthropic-api-example-typescript-1f99a0a0cf87
- https://www.getmaxim.ai/articles/retries-fallbacks-and-circuit-breakers-in-llm-apps-a-production-guide/
- https://stack.convex.dev/rate-limiting
- https://particula.tech/blog/fix-slow-llm-latency-production-apps
- https://portkey.ai/blog/rate-limiting-for-llm-applications/
