The Vanishing Blame Problem in AI Incident Post-Mortems
When a deterministic system breaks, you find the bug. The stack trace points to a line. The diff shows the change. The fix is obvious in retrospect. An AI system does not work that way.
When an LLM-powered feature starts returning worse outputs, you are not looking for a bug. You are looking at a probability distribution that shifted, somewhere, across a stack of components that each introduce their own variance. Was it the model? A silent provider update on a Tuesday? The retrieval index that wasn't refreshed after the schema change? The system prompt someone edited to fix a different problem? The eval that stopped catching regressions three sprints ago?
The post-mortem becomes a blame auction. Everyone bids "the model changed" because it is an unfalsifiable claim that costs nothing to make.
