The Brownout Pattern: When Your LLM Provider Is Slow but Not Down
The pager that wakes you at 3 a.m. for an outage is the easy one. The provider returned 503 for forty minutes, your fallback kicked in, your runbook fired, your post-mortem writes itself. The pager that does not wake you — the one that lets your support queue fill up over six hours while every dashboard stays green — is the brownout. The provider's API still answers. The status page still says "operational." Your p99 latency has quietly drifted from 2.1 seconds to 14 seconds, your error rate from 0.1% to 4%, and the only people who noticed are the users who already left.
Provider availability is not binary. The fallback story most teams write — "if provider is down, switch to backup" — is a state machine with two states for a continuous variable, and it does not fire when the provider is sad rather than dead. Building for brownouts is a different design problem than building for outages, and almost every production agent harness I have seen ships without solving it.
The brownout pattern matters more in 2026 than it did even a year ago. Anthropic's Claude API slipped to roughly 98% uptime over a recent ninety-day window — three days down per quarter is not the right number for a substrate carrying production workflows. OpenAI logged 294 incidents since the start of 2025, most of them partial: a region timing out while another spiked with 5xx errors, latency drifting upward across a subset of endpoints, structured-output corruption on one model variant, response coherence dropping silently because peak load got routed to a quantized model serving 200 OK with degraded quality. None of these tripped a binary outage detector, and most of them ate hours of user-perceived quality before anyone diagnosed them.
The Failure Mode That Doesn't Trip Your Fallback
The textbook fallback chain looks clean on paper: primary model, then a cheaper sibling at the same provider, then a different provider, then a self-hosted last resort. The chain assumes a binary signal — the primary either works or it doesn't. Production reality is messier. The primary works, mostly. Some calls return in 1.8 seconds; some return in 22 seconds; the failure rate sits at 3.5% instead of the 0.1% you wrote your SLO against. Your circuit breaker is configured to trip on consecutive failures, and consecutive failures never come. You are running at half-broken indefinitely.
The problem is that the primary signal most teams alert on — error rate — is the wrong axis when the provider is brownouts-degraded. Error rate stays near baseline; latency triples; your users abandon the feature; your error-rate alert never fires. Equally, latency-only alerts fire constantly during legitimately slow long-context requests and get muted within a week of being introduced. Neither metric in isolation is enough.
The detector that actually works is a joint signal: latency above a threshold and error rate above a threshold over a sliding window, scored together rather than independently. P99 above three times baseline is a useful secondary trigger because it captures the asymmetric cases — provider is dropping a long tail of requests rather than failing visibly — while staying quiet during normal variance. The window length matters too: sixty seconds catches real brownouts without firing on a single bad burst. Most teams write their detector against a fixed-count threshold (10 failures in a row) because it's easier to implement, then discover six months later that brownouts never trigger it.
The Graduated Fallback as a State Machine
Once you accept that provider health is continuous, the fallback policy needs more than two named states. The minimum useful set is five, and they should have explicit names so engineers and runbooks can refer to them by state rather than by the symptoms that triggered them.
The primary state is preferred provider — your default model on your default provider. The first degradation step is secondary provider, where the same model class runs at a different vendor, paying a quality and cost delta to recover latency. The second step is cheaper sibling on primary, where you stay with the original vendor but switch to a smaller, faster model — useful when the brownout is concentrated on a specific model variant rather than the whole API. The third step is cached or canned response, where you serve the most likely correct answer from a precomputed cache for high-volume queries, accepting that quality is now bounded by what was cached. The final step is feature off with apology, where you return a structured "unavailable, try again later" response that does not hallucinate, does not retry, and does not lock up a user-facing UI.
The state machine moves between these on signals, not hard-coded thresholds. The transitions should be asymmetric — easy to degrade, slow to recover — because flapping between states is worse than sitting in the wrong one. A common mistake is to recover the moment latency dips for a single window; you want sustained recovery (five to ten consecutive healthy windows) before returning to the higher state. This is the same logic Netflix uses for prioritized load shedding: degrade fast, recover with hysteresis.
The Cost Frame Is Backwards
The argument against running graduated fallbacks is usually phrased as a cost argument: "we don't want to pay for the secondary provider." This frame is wrong, and it survives mainly because the costs that show up on dashboards are inference dollars while the costs that don't show up are user trust and support load.
Run the numbers honestly. A brownout that lasts ninety minutes at 4% error rate and 6× latency on a feature serving ten thousand requests per hour costs you roughly six hundred frustrated users at the upper bound and a flood of support tickets that take a day to drain. The marginal inference cost of routing a fraction of those calls through a secondary provider for ninety minutes — even at a 30% premium per call — is two or three orders of magnitude smaller than the user-cost. The right comparison is not "secondary inference cost" against "primary inference cost"; it is "secondary inference cost" against "feature degraded for ninety minutes plus engineering time to debug plus reputation drag." The fallback path almost always wins, and teams that don't run the comparison end up writing better post-mortems instead.
The accounting trap that hides this: inference cost is on a single line item from a single vendor and finance tracks it monthly; the rest of your agent's COGS — vector DB load, retrieval-side embedding inference, telemetry storage, support labor — is distributed across teams and budgets. The cost-of-not-having-a-fallback shows up in those other lines, where nobody is watching closely enough to attribute it back to provider health.
The User-Facing Honesty Pattern
When the system is in a degraded state, the user needs to know — but not in a way that leaks vendor names, model versions, or implementation details. The pattern that works is acknowledging quality reduction without explaining its provenance. "We're using a faster but less detailed model right now to keep things responsive" is the kind of message that builds trust. "Anthropic returned 429s so we switched to GPT-4o" is the kind that destroys it, both because users do not care about your vendor mix and because it tells your competitors and adversaries exactly which knobs you are tweaking.
A degradation banner should be granular per feature surface, not site-wide. If the agent is in cheaper sibling state for chat but the rest of the product is healthy, only chat needs the banner. Site-wide degradation messages train users to ignore them. The banner should also include an estimated recovery time when one is available — even a rough "expected back within 15 minutes" is dramatically better than the status-page tradition of "we are investigating," which is interpreted as "we have no idea and you should not expect news soon."
The other side of the coin is what you say to engineers. Internal observability needs to expose the named state explicitly. If a customer reports a quality issue, the support engineer should be able to look up what state the feature was in when the request happened — preferred, secondary, cheaper sibling, cached, off — without reconstructing it from latency graphs. Most teams do not capture this state in their request traces, and as a result, brownout-era support tickets get routed back to model-quality investigations and the actual cause goes unidentified.
Testing Brownouts Before They Happen
The single largest reason brownout handling is broken in production is that it is never tested. Outage testing is well-established — chaos engineering, kill the API, watch what happens. Brownout testing is rare because the failure mode is not a clean kill; it is a continuous distortion that fault-injection libraries don't ship with by default.
The eval suite for brownout handling needs synthetic brownouts as first-class fixtures: latency-injected calls (return the right answer but after 18 seconds), partial-error responses (4% of calls return a 5xx), slow-streaming chunks (response arrives but at 5 tokens per second), and silent quality degradation (return a smaller-model response while reporting it came from the primary). Each of these exercises a different part of your detector and your state machine. Running them together — a 12% error rate plus 3× latency plus 30% of responses returning lower coherence — surfaces interactions between detection axes that single-fault tests miss.
The test that catches the largest class of bugs is the one that injects a brownout for forty minutes during a synthetic load test and verifies that the harness correctly transitions through preferred → secondary → cheaper sibling, that the user sees the right banner at each step, that no requests hang past the timeout budget, and that the system recovers to preferred within a defined window after the brownout clears. Most teams have never run this test because they don't have the harness to inject the brownout, and so they discover their fallback policy is wrong during the actual brownout, with paying users on the line.
The complementary discipline is shadow-running brownout detection in production. When real provider health dips below threshold, log the state transition and the actions the harness would have taken — without yet acting on them — and review the log weekly. This finds the false positives in your detector before they cause unnecessary failovers, and it exposes brownouts your team didn't know were happening. Anthropic's quietly missed reliability target is not visible to most of its API consumers because nobody on their side is running this kind of shadow detector. Run yours.
Provider Availability Is a Continuous Variable
The architectural realization underneath all of this is that the failure model your reliability story was written against — provider up, provider down — does not match the failure model your provider actually exhibits. Provider availability is a continuous variable, and any system that treats it as binary will work for the easy case and fail silently for the common case. The brownout pattern is not a refinement on top of outage handling; it is a different design problem with different signals, different state, different cost frames, and different tests.
The teams that ship reliable agent products in 2026 are the ones that have internalized this. They run joint latency-and-error detectors, named multi-state fallback policies, asymmetric transition logic, per-feature user-facing honesty, and synthetic brownout fixtures in their eval suite. They do not have nicer providers than anyone else. They have a better model of what provider failure actually looks like, and they design the harness around the reality rather than the pager-friendly fiction. The fallback story written for "provider down" is the wrong story for "provider sad," and rewriting it is the work most teams have not yet done.
- https://www.requesty.ai/blog/implementing-zero-downtime-llm-architecture-beyond-basic-fallbacks
- https://www.requesty.ai/blog/handling-llm-platform-outages-what-to-do-when-openai-anthropic-deepseek-or-others-go-down
- https://www.getmaxim.ai/articles/retries-fallbacks-and-circuit-breakers-in-llm-apps-a-production-guide/
- https://gitplumbers.com/blog/the-circuit-breaker-that-saved-our-llm-fallbacks-guardrails-and-observability-th/
- https://netflixtechblog.com/keeping-netflix-reliable-using-prioritized-load-shedding-6cc827b02f94
- https://netflixtechblog.medium.com/performance-under-load-3e6fa9a60581
- https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/
- https://www.pymnts.com/artificial-intelligence-2/2026/anthropic-outage-shows-digital-reliability-cracking-under-ais-weight/
- https://explore.n1n.ai/blog/circuit-breakers-llm-api-sre-reliability-patterns-2026-02-15
