Skip to main content

The Provider Failover That Multiplied Your Incident Surface

· 10 min read
Tian Pan
Software Engineer

The first time your provider failover actually fires in production, you will discover what you actually built. The gateway flips the traffic over in seconds — that part works. Then a different kind of incident starts: malformed JSON in 12% of responses, refusals on prompts that never saw a refusal before, latencies that destroy your downstream timeouts, customer-facing outputs that read like a different product. The primary came back ninety minutes later. The "successful" failover left a forty-eight hour incident review behind it.

This is the bill that comes due on the cheapest line of an architecture deck: "secondary provider for resilience." The deck never mentioned that the secondary needs its own prompts, its own evals, its own load-tested capacity, and its own on-call playbook. The deck just said you would not be down. The deck was right about that and wrong about everything else.

What teams ship as "multi-provider redundancy" is, in most cases, single-provider production with a second provider stapled on as a binary switch. The prompts were tuned against the primary's idiosyncrasies for six months: the formatting it prefers, the JSON keys it consistently emits, the instructions it interprets the way you meant them. The eval suite was run against the primary. The latency budgets, the token estimates, the cost projections — all calibrated to the primary's behavior. Then somebody added a fallback URL to a config file and called the system redundant.

The Coupling You Didn't Cost

A prompt is a contract negotiated with one specific model. The contract is invisible because it lives inside the prompt itself — phrasings the model has learned to follow, examples that anchor on outputs the model tends to produce, defensive instructions that pre-empt failure modes specific to that model's training data. Six months of "the agent kept saying X, so we added 'do not say X' to the system prompt" produces a prompt that is precisely shaped around one model's behavior.

Swap the model and the shaping no longer fits. The new model has different default failure modes; the old defensive instructions are now token tax against problems it never had, and the failure modes the new model does have are not pre-empted. Claude tends to over-explain code edits when asked to make small changes — entire-file output instead of diffs, even with instructions to the contrary. GPT-class models route through a hidden tier system where "think hard about this" can flip you into a slower, pricier reasoning model without warning. Gemini defaults to verbose multi-paragraph prose unless a standalone "Be concise." line forces it back. None of those quirks are bugs; they are personalities, and your prompts are tuned to one of them.

The cost is silent because it never shows up on the primary path. The eval suite passes. The customer feedback stays steady. The fallback path looks fine in the smoke test you wrote against it on a Tuesday afternoon a year ago. It is fine — for prompts the secondary model can interpret cleanly. The prompts that matter, the ones tuned through a hundred small fixes against the primary's specific failure shape, are quietly broken on the secondary in ways that only show up under real traffic.

What "Failover Worked" Actually Means

When the primary degrades and the gateway routes around it, three things happen in sequence. The technical failover succeeds: the secondary returns 200s, latency stays inside the SLO, no errors propagate to the client. The semantic failover fails: the outputs are subtly different — different formatting, different refusal patterns, different reasoning depth, different tool-call shapes for the agent layer that consumes them. The business outcome diverges from both: support tickets climb, completion rates drop, downstream automations break because they expected the primary's output shape.

The team watches dashboards that report "failover engaged, traffic flowing" and concludes the system held. The dashboard is correct about the layer it measures. The layer it does not measure is the only one the user feels.

This is the failure mode that makes multi-provider strategy worse than single-provider strategy for teams that do not invest in both halves. A single-provider team takes the outage as an outage: a clean error budget burn, a clear post-mortem, a known cost. A multi-provider team that has not paid the prompt-portability tax takes the outage as a degraded run that nobody can fully characterize because the prompts produce different outputs under stress and the eval suite never tested for it.

The 2025 outages made this concrete. Providers logged dozens of incidents per month across the major labs — 47 incidents across major AI systems in one month of last December alone. Teams with no failover were down. Teams with naive failover were up, technically, while quietly shipping different outputs to customers for the duration. The second group's incidents took longer to detect, longer to diagnose, and longer to write up than the first group's.

The Eval Suite That Only Passes on Tuesdays

Almost every team's eval suite was built against the primary provider. It probably runs on a CI workflow with the primary's API key in the environment. The pass/fail thresholds were tuned to the primary's output distribution. Some cases are written in ways that implicitly assume the primary's behavior: "the response must include the phrase 'I cannot' for refusal cases" or "the JSON output must use key X" or "the answer must be under 200 tokens." Those assertions were calibrated against one model; the secondary model would fail them not because it is wrong but because it phrases refusals differently or names keys differently or runs longer by default.

The fix is provider-parity evals: the entire suite runs against every provider in the failover chain, the gate is parity not absolute score, and any prompt that fails parity gets reworked or annotated as primary-only. This is more work than it sounds. The team that wrote evals for the primary did not write them in a provider-agnostic style; they wrote them in the style of "what we know the primary does." Converting them requires going through each case and asking whether the assertion is about the task or about one model's way of doing the task. Most of them turn out to be about the model.

The teams that get this right run the eval suite as a matrix: every prompt × every provider, on every change to either the prompt or the eval cases. The matrix is large enough that running it is non-trivial, which is the point — the size of the matrix is the actual cost of multi-provider strategy, and pretending it is small is the mistake that produces brittle failover.

Continuous Failover Is the Only Failover That Works

A failover path that runs once a year, when the primary degrades, has never been tested against the traffic distribution that matters. The smoke tests were synthetic. The eval cases were curated. The real prompts — the ones users actually send, in the long-tail distribution that took six months to learn against the primary — have never hit the secondary at production volume. The first time they do is during an incident, with the on-call team trying to read dashboards and rebuild context simultaneously.

The pattern that closes this gap is shadow traffic: a small percentage of real production prompts runs through the secondary on every request, asynchronously, with the outputs logged and compared. The primary still serves the user; the secondary is exercised at production rates against production distributions. When the secondary's outputs drift from the primary's in ways that matter, the comparison logs flag it before an outage forces the question.

A weaker version of the same idea: scheduled failover drills. A pre-announced window during which the gateway routes a meaningful fraction of traffic to the secondary, and the team watches for the regressions that shadow comparison would have surfaced continuously. Drills are easier to set up but harder to maintain because they require coordination; teams skip them under deadline pressure and discover the gap only when the unscheduled drill — the real outage — arrives.

The teams that ship robust failover treat the secondary not as a binary switch but as a continuously-warm path. The on-call playbook for switching providers is a short document because the switching has already been happening. The cost dashboards include the secondary's spend because the secondary is always spending. The eval pipeline gates merges on parity because parity is the contract, not "score on the primary."

Multi-Provider Is a Prompt-Engineering Commitment, Doubled

The framing that produces brittle failover is treating multi-provider as an infrastructure decision: pick a gateway, configure the chain, ship. The framing that produces robust failover is treating it as a prompt-engineering commitment, doubled: every prompt you tune for the primary is a prompt you also tune for the secondary, every eval you write is an eval that runs across both, every regression you fix is a regression you fix in both prompt sets.

The cost is real, and it is the cost the architecture deck did not name. A team that runs production on two providers is doing roughly twice the prompt-engineering work, not the same work with a redundant runtime. Teams that do not pay this cost ship optionality — the ability to route around outages, technically — without shipping reliability — the property of producing equivalent outputs under failure. Optionality without parity is a second incident in a trench coat.

The decision worth making explicitly is which of these you want. Some workloads do not need parity: a low-stakes summarization endpoint can tolerate visibly different outputs during a failover because the downside is small and the alternative is being down. Some workloads need strict parity: a structured-output API that downstream systems parse cannot tolerate a key rename or a format drift. Most teams have both kinds of workloads and route them undifferentiated, which is how the high-parity workloads get the low-parity treatment by default.

The Architecture Question You Have to Answer First

Before adding a secondary provider, the question to answer is not "which secondary" but "what would parity mean for our outputs, and what would we need to invest to maintain it." If the answer is "we cannot afford that," the right move is often single-provider with a clean degradation path — explicit error messages, queued retries, a status page that tells users what is happening — rather than multi-provider with hidden output drift.

If the answer is "we will pay that cost," the work is structured: a provider-agnostic prompt format that both providers can run against (which usually means rewriting prompts to be more explicit and less reliant on model-specific cues), a parity eval suite that gates merges on both providers, continuous shadow traffic to the secondary to keep it exercised, and on-call documentation that treats provider switches as routine rather than emergency.

The deck that says "secondary provider for resilience" is the deck that has not yet had the meeting where someone asks what resilience means. Resilience is not the absence of downtime; it is the presence of equivalent behavior under failure. The team that ships the first without the second has bought a second incident channel and called it a feature.

Multi-provider strategy is one of the highest-leverage architectural decisions in a serious AI system. It is also one of the most under-priced. The price is not in dollars; it is in the parallel work the team commits to doing forever. Failover that works is a continuous practice, not a config flag. The teams that learn this from the first real outage are lucky. The teams that learn it from the post-mortem of the second one — the one the failover caused — are the rest of us.

References:Let's stay in touch and Follow me for more thoughts and updates