Fallback Path Atrophy: Your Graceful Degradation Stopped Working Three Months Ago
The fallback path you wrote nine months ago — the one that catches model timeouts, swaps to a cheaper provider, returns a templated message when both are down — has not actually run in production for the last twelve weeks. It was exercised once during the original launch, the integration tests still pass against it, and the runbook still references it. None of that means it works. A refactor in week six changed the shape of the upstream context object. A library bump in week nine quietly moved a config key. The code still compiles. The tests still pass because they were written against the same stale fixtures the code was. The next time your primary path 504s, your "graceful degradation" will throw a NullPointerException into a user's face, and the postmortem will note — for the third time this year — that the fallback was never re-tested after the upstream contract changed.
This is the quiet failure mode of resilience engineering in AI systems. The fallback path is the part of your application that exists specifically to be ignored. Production traffic flows around it for ninety-nine days out of a hundred. CI never exercises it because no test was ever wired to. The team that owns it forgets it exists between incidents. Then on day one hundred, when the primary model provider has a regional outage and you finally need it, the path bit-rots in front of a paying customer.
