The Provider Failover That Swapped Your Safety Policy Mid-Conversation

June 2, 2026 · 11 min read

Software Engineer

A user is twelve turns into a careful conversation with your assistant about prescribing patterns for a controlled substance. The model has been measured, asking clarifying questions, citing guidance, declining to extrapolate beyond the literature. On turn thirteen, the user asks a follow-up that should land the same way the prior twelve did. Instead, they get a flat refusal: "I can't help with that." The conversation is over. They write to support furious — they were not asking anything different, the assistant was just helping them, what changed.

Your logs explain what changed. Halfway through turn thirteen, your primary provider returned a 503 in the middle of the stream. Your gateway, doing exactly what it was configured to do, failed over to the secondary provider for the remainder of the request. The secondary provider's refusal threshold for that class of query is calibrated more conservatively than the primary's. The user did not ask anything different — they asked the same question to a different model under the same brand, and the new model said no.

The team that built the failover priced it as an availability decision. They measured the gateway against uptime targets and showed the postmortem-graph that proves the SLO held. What they did not price was that they had just shipped a safety policy variance the trust-and-safety team never reviewed, never approved, and cannot detect from any dashboard they own. The Cartesian product of providers in the failover pool is the real policy surface, and nobody is its owner.

Why "interchangeable LLM endpoints" is a load-bearing fiction

The failover layer is built on a premise that, at the API level, is mostly true: providers expose roughly compatible chat-completions interfaces, accept similar system prompts, return similar streaming token sequences. You can swap them with adapter code and a router. From an availability standpoint they are interchangeable in exactly the way two AZs of the same database are interchangeable.

At the safety surface they are not interchangeable at all. Published refusal-rate benchmarks make the variance quantitative. On OR-BENCH and similar over-refusal probes, Claude-family models refuse meaningfully more often than GPT-family models on the same prompts — one widely-cited result puts Claude-3.5-Sonnet at 73% refusal on controlled-substance queries against GPT in the high-90s safe-completion rate on the same set. Gemini lands somewhere different again, with a profile closer to "get the job done" on agentic loops and a stricter posture on social-bias questions. There is no axis on which the providers' refusal surfaces line up.

This is not a bug in any of the providers. They each tuned for a defensible operating point that reflects their own threat model, their own customer mix, and their own assumptions about who else is moderating. The bug is in your gateway, which is treating those operating points as if they were properties of "frontier LLM" rather than properties of the specific contract each vendor signed. The moment your failover pool has more than one entry, the user-facing policy is the union of every entry's policy at every routing decision, and you own that union whether you modeled it or not.

It is worse than just refusal rates. The same kind of variance shows up in tone, in willingness to speculate, in how the model handles uncertainty, in how it phrases caveats, in how it interprets the same system prompt. A conversation that the user perceives as a coherent assistant is, under the hood, a relay between several different personalities whose lines you spliced together. Most days nobody notices. The day the splice falls on a sensitive turn, you have a support ticket whose root cause is not visible from any single trace.

Mid-stream is the worst place to swap

The failover paths in production gateways are not symmetric. The non-streaming case is comparatively well-behaved: a 503 comes back as a clean error, the router retries against the next provider in the pool, the user sees a slight latency penalty and an unchanged answer. The streaming case is where the seams show.

A streaming request that fails halfway through the response has already emitted some tokens to the client. The gateway now has to decide whether to discard those tokens and restart from the original prompt, or to continue the response from where the first provider stopped. Both options are bad. Discarding emitted tokens means re-rendering the UI from scratch and confusing the user. Continuing the response means constructing a fallback request that includes an assistant prefill of the partial output so the secondary provider can pick up the sentence.

That second path is where a documented class of incidents lives. LiteLLM's own issue tracker carries multiple reports of mid-stream fallback paths constructing requests with assistant-prefill blocks that the fallback target rejects, of streaming fallback configs being read by a different code path than the non-streaming one, of MidStreamFallbackError surfacing as Available Model Group Fallbacks=None even when the config plainly listed fallbacks. Each of these is a code-level bug that gets fixed; the architectural point is that mid-stream failover is intrinsically a place where two models complete the same utterance with their own respective alignment, and the result is a hybrid whose policy is the sum of both.

The pathological case is the one where the first provider has emitted a careful, qualified opening — "I want to be careful here, let me explain what we do know" — and the second provider, picking up from that prefill, decides on its own posture and emits a refusal in the same sentence. The user reads "I want to be careful here, let me explain what we do know — actually, I can't help with that request." This is not a hypothetical seam. It is what the seam looks like when both providers do exactly what they were trained to do.

What "conversation affinity" actually buys you

The first-order fix is the one most production gateways now support under names like sticky routing, session affinity, or session pinning: once a conversation lands on a provider, pin it there for the rest of the session. Byteplus's API gateway documents an explicit LLM session-affinity routing primitive. Kgateway exposes session-persistence headers. Truefoundry offers thirty-minute provider affinity windows. LiteLLM merged routing-affinity logic for its Responses API. The vLLM semantic-router project has an open issue tracking "Conversational Routing Momentum" — a low-pass filter on routing decisions to stop the gateway from switching models on complexity spikes.

Affinity does two things at once. It bounds the policy surface to one provider per conversation, which means the user's perception of "this assistant has a posture" stops being violated mid-thread. And it preserves whatever prefix-cache state the provider had built up for that session, which is a non-trivial latency win on long contexts. Both reasons are good; they are not the same reason, and the team that wired affinity for the latency win usually did not document the policy invariant they accidentally restored.

But affinity is not a full answer. It pins within a conversation; it does nothing for the first request of a new conversation, which still hits whichever provider the router favors that day. It pins on a best-effort basis; if the pinned provider is unavailable for the next turn, the gateway has to fail over anyway, and the policy variance you delayed by twelve turns now lands on turn thirteen. The user's perception of a coherent posture is preserved across most sessions and shattered across the ones that survive a real outage — which is to say, exactly the sessions where the user already had a reason to be frustrated.

Affinity also does not help the agentic case, where a single user-facing request fans out to dozens of tool-using sub-conversations across many model calls, any subset of which can route to a different provider. The sub-agents are not sessions in the affinity sense. They are a swarm of small conversations whose policy variance compounds across the fan-out.

The unified policy layer above the providers

The architectural answer is the one the safety-tooling vendors have been pricing toward: a unified policy layer that runs in your gateway, above both providers, enforcing the user-facing contract independent of which model executed the turn. The shape of this layer is converging across the industry. Bifrost, the LiteLLM-adjacent gateway, integrates AWS Bedrock Guardrails, Azure Content Safety, and Patronus AI as routable guardrail steps. Research groups have published unified moderation frameworks — Legilimens, StreamGuard, AEGIS — whose explicit motivation is that "identical prompts may execute through the OpenAI API but be filtered or rejected through Google's GenAI interface," and that the application owner needs a single policy surface that both providers' outputs flow through.

The discipline this layer enforces is the one the providers themselves cannot give you, because none of them can speak for the others: a single refusal policy, a single tone calibration, a single content boundary, applied at the gateway boundary so that whichever provider executes the turn, the user-facing behavior is your behavior. The provider becomes a generator; the policy becomes yours. The Cartesian product collapses back to one row.

This is the same architectural move that mature platforms made for authentication a decade ago. Identity used to be whichever IdP a given service happened to integrate; the user's experience varied by which backend handled their session. Mature platforms pulled identity up into the gateway so that whichever backend ran, the user's identity was the gateway's identity, normalized. Safety calibration is now in the same evolutionary phase. The providers each have an identity model; the gateway has to assert the user's.

Two operational primitives make this work in practice. The first is a safety-regression eval that runs across every provider in the failover pool, with a fixed set of probes for refusal, tone, and content-boundary, before any new provider is added to the pool. If the new provider's calibration falls outside a tolerance band, it is not pool-eligible. The eval is the entry condition; adding a provider for an availability win is a safety review, not an SRE ticket. The second is per-provider observability on refusal rate, broken out by query class and session-affinity state, with alerting on rate-of-change. The day a provider's underlying model updates and its refusal rate shifts five points, you find out from the dashboard, not from a support escalation.

The architectural realization

The team that designed the failover layer as a uptime decision was reasoning from the right premise: providers do go down, users do want responses, and a gateway that absorbs single-provider outages is genuinely valuable. The premise they did not examine is that an LLM is not a stateless function. It is a tuned posture with a refusal surface and a tone calibration, and those are properties of the specific contract each provider signed. Multi-provider routing turns those into a Cartesian product, and the team that did not model the product owns its full perimeter anyway.

The path forward is the one the platform layer has been quietly converging on for the last eighteen months: pin sessions, regress-test the pool, and lift the policy contract into the gateway so it stops being a per-provider property. The providers will keep tuning their own thresholds; the user will not see those tunings as edges of your product. The failover decision will stop being a policy decision by accident, and start being what it was always supposed to be — a generator swap behind a stable contract.

The team that priced multi-provider routing as a reliability investment did not realize they were also making a trust-and-safety investment. Either they pay for the unified policy layer now, or the support tickets keep arriving every time the wrong turn lands on the wrong model — and the postmortem keeps blaming the user for noticing.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Provider Failover That Swapped Your Safety Policy Mid-Conversation

Why "interchangeable LLM endpoints" is a load-bearing fiction

Mid-stream is the worst place to swap

What "conversation affinity" actually buys you

The unified policy layer above the providers

The architectural realization

Recommended Reading

About Tian Pan

Why "interchangeable LLM endpoints" is a load-bearing fiction​

Mid-stream is the worst place to swap​

What "conversation affinity" actually buys you​

The unified policy layer above the providers​

The architectural realization​

Recommended Reading

About Tian Pan

Why "interchangeable LLM endpoints" is a load-bearing fiction

Mid-stream is the worst place to swap

What "conversation affinity" actually buys you

The unified policy layer above the providers

The architectural realization