The Provider Failover That Multiplied Your Incident Surface
The first time your provider failover actually fires in production, you will discover what you actually built. The gateway flips the traffic over in seconds — that part works. Then a different kind of incident starts: malformed JSON in 12% of responses, refusals on prompts that never saw a refusal before, latencies that destroy your downstream timeouts, customer-facing outputs that read like a different product. The primary came back ninety minutes later. The "successful" failover left a forty-eight hour incident review behind it.
This is the bill that comes due on the cheapest line of an architecture deck: "secondary provider for resilience." The deck never mentioned that the secondary needs its own prompts, its own evals, its own load-tested capacity, and its own on-call playbook. The deck just said you would not be down. The deck was right about that and wrong about everything else.
