Model Rollback Velocity: The Seven-Hour Gap Between 'This Upgrade Is Wrong' and 'Old Model Fully Restored'
The playbook for a bad code deploy is a sub-minute revert. The playbook for a bad config push is a sub-second flag flip. The playbook for a bad model upgrade is whatever the on-call invents at 09:14, and on a typical day it takes seven hours to finish. During those seven hours the regression keeps compounding — wrong answers ship to customers, support tickets pile up, and the dashboard shows a slow gradient rather than a clean cliff back to green.
The reason the gap is seven hours is not that the team is slow. It is that "rollback" for a model upgrade is not the same primitive as "rollback" for code. It is closer to a database schema migration: partial, hysteretic, and not reversible by pressing the button you wish existed. The team that wrote its incident playbook around a button does not have the controls the actual rollback requires.
This post is about what those controls look like, why they have to be paid for in advance, and what you find out about your platform the first time you try to roll back a model under load.
