The AI Gateway Is the SPOF Nobody Named
The pitch sounded responsible. "Let's not hardcode OpenAI everywhere — we'll put a thin abstraction in front, then we can swap providers if we need to." Two years later, that thin abstraction is a service with its own deploy pipeline, its own SRE on-call, an eval gate that blocks bad prompts, a semantic cache that saves seven figures a year, a retry policy with provider-specific backoffs, an observability schema every dashboard depends on, and a key vault holding the credentials for six model vendors. Every AI feature in the company terminates there.
It is also, almost by accident, the single point of failure with the worst blast radius in the stack. When the primary LLM provider goes down — and in 2025 OpenAI was tracked having 294 outage events since January, with Anthropic logging 184.5 hours of total customer impact in December alone — the gateway routes around it and most users never notice. When the gateway itself dies, every AI feature in every product simultaneously stops, the failover that was supposed to fire never gets a chance, and the postmortem opens with "the abstraction layer we built to insulate us from provider outages was the outage."
The uncomfortable part is that nobody decided this. The gateway accumulated responsibility the way a kitchen junk drawer accumulates rubber bands. Each addition was the right local call — caching saves money, governance keeps legal off your back, eval gating prevents a bad prompt from shipping — but the cumulative effect is a single process whose blast radius dwarfs any one provider's, treated with a fraction of the SRE rigor that protects, say, the payments service.
What the gateway actually holds
Walk through what lives inside a mature LLM gateway and the blast radius becomes obvious. The unified API contract is the visible part — one OpenAI-compatible endpoint that translates to fifty providers behind the scenes. But that's the thinnest layer.
Underneath sits the retry and failover policy: which errors trigger a retry, which trigger a failover to a secondary provider, what the exponential backoff looks like for each error class, and which streaming errors can be transparently retried versus surfaced to the client. A 503 from the primary should hop to a backup; a 400 should not, because retrying a malformed request anywhere just burns latency. That logic, when correct, is invisible. When wrong, it amplifies upstream failures into customer-facing errors.
Then there is the prompt cache — both the exact-match cache that turns a repeated prompt into a free lookup and the semantic cache that recognizes "what is AI" and "explain artificial intelligence" as the same question. For a high-traffic feature, the cost difference between a cache hit and a miss is roughly ten-to-one, so this layer is not optional once you cross any meaningful scale. Routing around it means re-paying that cost for every request the gateway outage rains down on the providers.
The eval gate is the layer most engineers underestimate. New prompt versions, new tool definitions, new model versions all flow through a CI-style check that runs them against a corpus of cases and blocks deployment if regression thresholds trip. That gate is in the gateway because the gateway is the only place that sees every request and response — which makes it the natural enforcement point and the only easy place to compute the rollback signals.
The observability schema is the layer that decides what every dashboard looks like, what gets logged, what gets indexed for search, what cost attribution looks like by team and feature. If the gateway emits the wrong shape, every dashboard downstream is wrong; if the gateway stops emitting, every dashboard goes dark in a coordinated way that looks identical to "AI is broken everywhere" because, from the dashboard's perspective, it is.
Add governance — PII scrubbing, prompt-injection guards, rate limiting by team — plus the credential vault that holds keys for every model vendor, and you have a service that is simultaneously the policy plane, the data plane, the cost plane, the eval plane, and the observability plane for every AI feature in the company.
Why a gateway outage is worse than a provider outage
Provider outages are graceful in a way most teams do not appreciate until they live through a gateway outage. When OpenAI has a degraded hour, the gateway's failover logic shifts traffic to Anthropic or to a self-hosted fallback. Latency degrades, costs blip up, a small fraction of edge cases fail, but the system mostly works. The blast radius is bounded by which routes were configured to fail over.
When the gateway itself is down, none of that machinery runs. Every AI feature loses its failover, its cache, its eval gate, its observability, and its key access simultaneously. The dashboards that would tell you what's wrong are downstream of the same outage. The retry policy that would route around the problem lives in the thing that crashed. The keys that would let an emergency hotfix bypass the gateway and call the provider directly are locked in the vault inside the gateway.
The brutal version of this scenario plays out when the gateway is healthy enough to accept connections but unhealthy enough to corrupt requests — a deploy that ships a bad observability schema, a configuration push that swaps two provider names, a memory leak that surfaces only under cache-miss load. The gateway answers 200 OK, downstream systems trust it, and bad behavior propagates everywhere before any alarm fires. A provider outage is a clean 503; a gateway brownout is silent contamination.
There is a load pattern that makes this worse. When the gateway recovers from an outage, every retrying client and every queued request hits it at once. The cache is cold, so what would normally be 80% cache hits is now 0%. Every miss goes to the providers — who are now seeing an aggregated thundering herd from every team in the company, all without the rate-limit smoothing the gateway normally applies. The recovery costs more than the outage.
Patterns that match the blast radius
The fix is not "don't build a gateway." Once you have more than a handful of AI features, the gateway pays for itself in cache hits, governance, and the ability to swap providers without touching application code. The fix is treating the gateway like the load-bearing infrastructure it is, with the operational discipline that implies.
Multi-region deployment with independent control planes. The gateway is a stateful service — it holds cache, key material, deployment state — and a single-region deployment inherits the failure mode of that region's network and control plane. Multi-region means two things: replicas of the gateway in multiple regions with health-checked routing in front, and independence of the control planes so a bad config push to region A cannot poison region B in the same minute. The cache locality problem makes this harder than it sounds, but the alternative is a regional cloud incident becoming a company-wide AI outage.
A break-glass bypass path. Every team that depends on the gateway needs a documented, periodically-tested code path that calls the provider directly with credentials checked into a secret store the gateway does not own. This is not the happy path. It bypasses the cache, the eval gate, the observability schema, the governance layer — all the things the gateway is for. But when the gateway is wedged at 3am, the alternative is every AI feature staying down until the gateway team wakes up. The bypass should be ugly enough that nobody uses it casually and accessible enough that on-call can flip it in a single command. Practicing it once a quarter is the only way to know it still works.
Contract tests against the gateway boundary. The gateway has an API contract with every downstream service. That contract drifts — fields get renamed, error shapes change, a streaming protocol gets adjusted — and downstream services don't notice until production. Contract tests that pin the contract on both sides catch this. The same discipline applies to provider boundaries: each provider adapter inside the gateway should have a contract test that fails loudly when the provider changes its response shape, rather than letting the change leak into customer-facing outputs.
Cache warm-up and graceful recovery. Recovery from a gateway outage should not be "turn it back on." It should be "turn it back on at 10% traffic, let the cache warm, then ramp." Otherwise the recovery hammers the upstream providers with the cold-cache thundering herd. This is standard service-recovery discipline applied to a service that most teams don't yet recognize as one.
Observability the gateway doesn't own. If every dashboard about AI behavior is fed by the gateway's own observability schema, the gateway has visibility privilege over its own incidents. At least one parallel observability path — sidecar metrics, a separate trace exporter, async event streams to a different sink — needs to exist so that when the gateway is sick, somebody can still see what's happening.
Why teams underestimate this
The honest answer is that the gateway accumulates blast radius in increments small enough that no one ever has to file a design doc titled "Let's centralize every AI failure mode." Caching is shipped because the cost spike from the holiday launch is real. The eval gate is shipped because a bad prompt got to production and somebody got paged. The observability schema is shipped because finance wants cost attribution by team. Each shipment is the right call. The cumulative weight is not visible until the day the gateway is down and somebody asks why the company can't ship around it.
A second reason is that the people who build gateways are usually platform engineers thinking about abstraction, while the people who run them in production are SREs thinking about uptime. The handoff between those two mental models is where the SRE discipline goes missing — the gateway gets the deploy pipeline of a stateless service even though it holds cache and credentials and routing state, gets the monitoring of a "thin proxy" even though it implements policy, and gets the on-call rotation of a tier-two service even though its blast radius rivals tier zero.
A third reason is that the failure mode is rare enough to underweight. Provider outages happen every week; gateway outages happen every quarter or two. The frequency lulls teams into believing the gateway is reliable, when the truth is that its failures are infrequent and catastrophic — exactly the profile that earns extra SRE attention, not less.
The architectural admission
The gateway is no longer the abstraction layer the design doc described. It is a piece of load-bearing infrastructure whose blast radius is the entire AI surface area of the product. The teams that survive this realization are the ones who name it explicitly, give it a service-tier classification commensurate with its blast radius, and apply the same multi-region, break-glass, and contract-testing discipline they apply to payments and auth.
The teams that don't will keep running it as if it were the thin proxy in the original design doc. They will keep adding responsibilities — guardrails, prompt management, agent orchestration, MCP server registration — until the gateway is the implicit operating system of the AI stack. And eventually, on a slow Tuesday afternoon, it will go down for forty minutes, and every AI feature in every product will go with it, and the postmortem will conclude that the abstraction was, all along, the system.
The mature move is to admit that's already true and operate accordingly.
- https://langfuse.com/blog/2024-09-langfuse-proxy
- https://www.getmaxim.ai/articles/top-ai-gateway-platforms-with-automatic-failover-in-2026/
- https://www.getmaxim.ai/articles/failover-routing-strategies-for-llms-in-enterprise-ai-applications/
- https://portkey.ai/blog/failover-routing-strategies-for-llms-in-production/
- https://llmgateway.io/blog/how-we-handle-llm-provider-failover
- https://www.truefoundry.com/blog/llm-gateway
- https://www.truefoundry.com/blog/llm-proxy
- https://www.truefoundry.com/blog/observability-in-ai-gateway
- https://deepwiki.com/BerriAI/litellm/7-deployment
- https://cyberunit.com/insights/llm-reliability-business-continuity-ai-outage-risk/
- https://www.cloudflare.com/developer-platform/products/ai-gateway/
- https://apisix.apache.org/ai-gateway/
