Skip to main content

The Kill Switch With a Latency Budget Your Incident Never Met

· 12 min read
Tian Pan
Software Engineer

The runbook said "disable the agent." The on-call followed it. Forty-three minutes later, when the kill switch finally propagated through the config service, the agent had already filed 1,200 incorrect tickets, called the billing API 8,000 times, and sent emails to customers who hadn't signed up for any of it. The runbook was correct. The runbook was also useless, because nobody had ever measured how long "disable the agent" actually takes when an agent is producing damage by the second.

Most AI features ship with a kill switch the same way most buildings ship with a fire extinguisher: someone signed off that it exists, nobody timed how long it takes to reach. The compliance review asks "is there a kill switch?" and the answer is yes. The incident asks "how fast does it stop the bleeding?" and the answer is whatever the underlying plumbing happens to take — a number nobody on the team has ever measured against the rate at which the feature is doing harm.

The mismatch is the whole problem. A feature whose containment time is longer than its blast time has shipped containment theater.

Every AI Feature Has an Implicit RTO

In traditional incident response, recovery time objective (RTO) is a property you declare per service. For AI features, RTO is implied by something subtler: the rate at which a misbehaving model accumulates damage.

If your agent calls a paid third-party API at 50 requests per second and each call costs five cents, then every second of downtime-the-wrong-way costs $2.50. If your agent writes to a customer-visible queue, every second adds entries you'll have to reconcile manually. If your agent sends emails, every second is reputational damage you cannot recall.

Multiply that rate by your containment latency — the wall-clock time between "we decided to stop" and "it actually stopped" — and you have the incident's minimum cost floor. Not the cost of the bug. Not the cost of the regression. The cost of the kill switch's plumbing, charged to the customer every time something goes wrong.

This is the number nobody measures. Teams measure model accuracy, eval scores, latency, token spend. The latency between the on-call's keypress and the agent's silence is a number that lives in the gap between "we ship a feature" and "we contain it" and falls through the cracks of every roadmap.

How a Kill Switch Accumulates Three Minutes Without Anyone Noticing

The kill switch on paper is a config flag. The kill switch in practice is a request that has to traverse several systems before any node serving traffic believes the flag changed.

A typical path looks like this:

  • The on-call edits a config in the feature-flag dashboard. Network round trip to the control plane: ~200ms.
  • The dashboard writes to a global KV store. Replication to regional read replicas: 1–3 seconds for sub-second-replication providers, much longer for ones that rely on periodic sync.
  • The KV store fronts a CDN. CDN cache TTL: typically 30–60 seconds, sometimes higher.
  • The SDK on each application pod polls the CDN. Default polling interval: 60 seconds for most providers, configurable but rarely tuned.
  • The application code reads the flag value through the SDK and decides whether to call the model. The flag is checked once per request, but the SDK only re-fetches on its polling cadence.
  • A web client (if there is one) calls the application server through its own SDK with its own polling interval. Add another 30–60 seconds.

The cumulative latency is the sum of the slowest hop in each direction. A team that picked a managed feature-flag SDK with a 60-second polling default has effectively shipped a kill switch with a one-minute floor. A team whose CDN sits in front of the flag service has stacked a CDN TTL on top of that floor. A team whose client refetches flags every two minutes has stacked one more layer.

None of this is anyone's fault. Each component made a sensible default decision for a use case that wasn't "stop the AI agent from doing damage right now." The defaults compose into a containment time the original architecture never considered.

The Blast-Time vs Containment-Time Inversion

The structural problem can be stated in one line: a feature whose containment time exceeds its blast time has no kill switch.

If the agent can do five seconds of irreversible damage per request and the kill switch takes three minutes to propagate, then by the time the switch flips, 36 multiples of "the worst the agent can do" have already happened. The kill switch was theater. The damage was real.

This is a different framing than "we need to make the kill switch faster." Faster is good, but the discipline that matters is matching containment time to blast time per feature. Some features have blast times measured in milliseconds — anything that writes to an external system without an undo. Some have blast times measured in hours — anything purely advisory whose damage compounds slowly. The kill switch for each one needs a propagation budget shorter than the blast time, or the team has accepted the inversion as a design choice without saying so.

A feature flag is not a primitive. A propagation budget is. The flag is the surface; the propagation budget is the contract the surface has to meet.

Tiered Switches, Because One Size of Latency Does Not Fit Every Incident

The way out of the inversion is not "make every flag instant." That trades off cost, reliability, and complexity for a budget most features don't need. The way out is to design multiple switches at different points in the stack, each with a measured activation latency, and to know which one to reach for in which incident.

A working tiered design looks something like this:

  • Tier 1: in-process circuit breaker. A flag the application reads from local memory at every model call, refreshed via push (server-sent events, websocket) rather than poll. Activation latency: milliseconds to a few seconds. Use case: fast-burn incidents where the agent is producing irreversible damage per call.
  • Tier 2: fleet-wide config toggle. A managed feature-flag service with default polling, the kill switch most teams ship. Activation latency: seconds to a minute. Use case: slower-burn incidents where you want the change to propagate across all environments and survive process restarts.
  • Tier 3: deployment-level disable. A code change that removes the agent's call path entirely, deployed through CI. Activation latency: minutes to tens of minutes. Use case: long-tail incidents where you have time to do it properly and want the rollback to outlive any flag state.

The discipline is to know your blast time per feature and to wire the cheapest tier whose activation latency is shorter than it. A nightly-batch agent doesn't need Tier 1. A live-customer-facing agent that writes to external systems probably needs Tier 1 and Tier 2.

Tier 1 deserves the most thought because most teams skip it. The in-process circuit breaker is also where you put automatic triggers — action counts above a threshold, error rates above a band, cost per session above a budget — that flip the switch without waiting for a human to notice. The human's job is to verify the breaker did its job; the breaker's job is to act before the human can react.

Activation Latency Belongs in the Runbook as a Number

The cheapest, highest-leverage improvement you can make to a kill-switch design isn't a new switch. It's writing the number next to the existing one.

Most runbooks say things like "disable the agent via the feature flag dashboard." The next line should say "expected propagation: 45 seconds. If the agent is still producing requests after 60 seconds, escalate to Tier 3 (PR + deploy, ~8 minutes)."

The number does several things at once. It tells the on-call when to stop waiting and start escalating. It tells the incident commander what the irreducible cost of this incident is — if the agent does $1/second of damage and the kill switch takes 60 seconds, every incident has a $60 floor before anyone has a chance to act. It tells the postmortem reviewer what to measure: did the switch hit its budget? It tells the next team designing a similar feature what to budget against.

The number also has to be measured, not estimated. The activation drill is the only honest source of it. Pick a low-traffic window, flip the switch in a staging or canary environment, time the wall-clock from keypress to last request served. Do it quarterly, because the underlying systems drift — the SDK gets upgraded, the CDN config changes, the polling interval gets tuned by someone optimizing cost. A drill that was current six months ago is folklore now.

The Containment-Theater Org Pattern

Most organizations don't ship containment theater because someone designed it; they ship it because the kill switch and the feature it protects are owned by different teams.

The AI team builds the agent. The platform team owns the feature-flag service. The deployment team owns CI/CD. The on-call is whoever drew the short straw this week, often someone who has never personally activated the switch. Each team's incentive is to do their part — the agent works, the flag system is up, the deploy pipeline is green. None of them is incentivized to measure the end-to-end latency of the switch against the agent's blast rate, because that measurement is nobody's KPI and nobody's runbook.

This is why kill-switch latency tends to regress silently. Each team makes locally reasonable changes — the platform team adds a CDN to reduce flag service load, the AI team adds a retry layer to handle transient failures, the deploy team raises the canary period for safety — and each change adds latency the end-to-end test never re-runs.

The fix is structural: assign the kill-switch latency budget to a single owner, usually the team that owns the AI feature, and give them the authority to require changes from any team in the path. The platform team's polling interval is now negotiable. The CDN TTL is now negotiable. The CI canary period is now negotiable. The negotiation is against a budget the AI team derives from the feature's blast rate, and any change that violates the budget needs a sign-off.

The negotiation also produces a useful artifact: a map of every system the kill switch traverses, with measured latencies for each hop. That map is the single most useful thing an on-call can have when a kill switch fails to activate, because it tells them which hop to interrogate first.

The Drill You Don't Want to Run

The most uncomfortable test in any team's calendar is the live kill-switch drill: flip the switch in production during a low-traffic window, confirm the agent stops, then flip it back. Most teams don't run it because the risk of the drill itself causing an incident feels like enough reason to avoid it.

The risk is real. The cost of not running it is also real, and it accrues in a way the team doesn't notice until the day they need the switch and it doesn't work the way they remembered.

A defensible drill cadence is quarterly: run the drill, time the activation, log the number, compare against the previous quarter. Failures show up as drift — a kill switch that took 12 seconds last quarter and takes 47 seconds this quarter is telling you something about the platform changes that happened in between, and the time to investigate that drift is now, not during an incident.

What the Compliance-Review Kill Switch Is Not

A kill switch that exists to pass a compliance review is a kill switch that has been designed to be checked, not used. The checklist asks if it exists, and as long as the box has something in it, the box is checked.

The kill switch that contains a real incident is something else: a piece of infrastructure with a measured activation latency, a designed-against budget, a drill that proves the latency is real, and a tier hierarchy that matches different blast profiles. It's a system, not a checkbox. It's owned by the team whose feature it protects, not the team whose service hosts it. It has a runbook with numbers in it, not aspirations.

The two kinds of kill switches look identical in a compliance document. They look very different at 3am on the worst day of your quarter. The team that confused one for the other won't find out which one they shipped until the day they need it.

The architectural takeaway is simple, even though the implementation isn't: every AI feature has an implicit RTO defined by how fast it does damage. If you didn't measure your kill switch's activation latency against that RTO, you didn't ship a kill switch — you shipped the idea of one. The incident that finds out the difference is the incident the kill switch was supposed to prevent.

References:Let's stay in touch and Follow me for more thoughts and updates