The Kill Switch With a Latency Budget Your Incident Never Met
The runbook said "disable the agent." The on-call followed it. Forty-three minutes later, when the kill switch finally propagated through the config service, the agent had already filed 1,200 incorrect tickets, called the billing API 8,000 times, and sent emails to customers who hadn't signed up for any of it. The runbook was correct. The runbook was also useless, because nobody had ever measured how long "disable the agent" actually takes when an agent is producing damage by the second.
Most AI features ship with a kill switch the same way most buildings ship with a fire extinguisher: someone signed off that it exists, nobody timed how long it takes to reach. The compliance review asks "is there a kill switch?" and the answer is yes. The incident asks "how fast does it stop the bleeding?" and the answer is whatever the underlying plumbing happens to take — a number nobody on the team has ever measured against the rate at which the feature is doing harm.
The mismatch is the whole problem. A feature whose containment time is longer than its blast time has shipped containment theater.
