Skip to main content

Your Fallback Path Is the Only Untested Code in Production

· 9 min read
Tian Pan
Software Engineer

Every serious AI system ships with a fallback. When the primary model is rate-limited, route to a cheaper one. When the provider returns 5xx, serve a cached answer. When confidence drops below a threshold, fall back to a hand-written heuristic. The architecture diagram has a clean little branch labeled "degraded mode," and everyone feels safer for it.

Here is the uncomfortable part. That branch is the only code in your system that almost never runs. The primary path executes millions of times a day and gets debugged, profiled, and battle-tested by sheer traffic volume. The fallback executes approximately never — until the day it executes for everyone at once, under load, during an incident, while three engineers watch a dashboard turn red.

A fallback you do not exercise is not redundancy. It is a second, unmonitored system whose debut is statistically guaranteed to happen at the worst possible moment.

Code That Only Runs During Disasters Decays Like Code That Is Never Run

Software does not rot because it changes. It rots because everything around it changes while it stands still. Your primary prompt gets edited weekly — someone tightens the system instruction, adds a few-shot example, swaps the model version. Your output schema gains a field. Your downstream parser learns to expect that field.

The fallback path watches all of this happen and participates in none of it. Its prompt is a fossil of what the primary prompt looked like six months ago. Its output schema was last validated against a parser two refactors out of date. The cheaper model it routes to was deprecated by the provider in a release note nobody on your team reads.

None of this surfaces, because the fallback does not run. CI does not catch it, because CI exercises the paths your tests cover, and nobody writes tests for the branch they assume will never fire. Code review does not catch it, because the diff that broke the fallback was a change to the primary path — the fallback was collateral damage, silent and offscreen.

This is the same failure pattern that has haunted disaster recovery for decades. The most common backup failure is not a backup that silently stops working. It is a backup whose restore process has never been run, so the team discovers the gaps for the first time during a real incident. Surveys consistently find only about half of organizations test their DR plans yearly. An untested backup, as the saying goes, is a hypothesis, not a safety net. Your fallback path is exactly that hypothesis — and worse, it has a worse property than no fallback at all: it manufactures confidence. The architecture diagram says you are covered, so nobody load-bears the assumption until production stops cooperating.

The Outage Will Come — The Question Is Whether the Fallback Is Ready

This is not a hypothetical risk you can defer. In 2025, every major LLM provider had at least one significant disruption. Public incident trackers logged hundreds of outages across the big APIs. One provider's status data showed nearly 300 incidents in a year; another recorded over a hundred incidents in a single 90-day window, dozens of them major. One provider suffered a multi-hour outage when routing nodes hit memory limits, failed readiness checks, and cascaded into a capacity shortfall that served no one.

If you depend on a model API, your fallback path will be invoked. The only open question is whether it works when it is.

And "works" is a higher bar than teams assume. The fallback firing is not the same as the fallback succeeding. Consider what actually has to hold true at the moment a degraded path engages:

  • The cheaper or backup model still exists and accepts the request.
  • The fallback prompt still produces output that matches the schema your parser expects.
  • The cached response you serve is not stale to the point of being wrong.
  • The circuit breaker actually trips, and trips on the right signal, within seconds rather than minutes.
  • The fallback's latency and token cost do not themselves cascade into a second failure.

Every one of those is a live assumption. Every one of them drifts. And the first time you find out which ones broke is the incident itself — unless you deliberately exercise the path before then.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates