Skip to main content

The Kill Switch Nobody Wired Because the Feature Never Failed

· 10 min read
Tian Pan
Software Engineer

The launch flag worked perfectly. You shipped the AI summarizer behind it, ramped 1% to 10% to 50% to 100% over two weeks, watched the dashboards, saw nothing on fire, and at the end of the quarter the platform team's flag-hygiene bot opened a PR to delete the now-redundant gate. You approved it. The PR merged with the rest of the expired-flag cleanup, and the codebase got 200 lines lighter. Six weeks later at 2am, the provider rolls a fresh model snapshot, your summarizer starts confidently fabricating clauses into legal documents, and your on-call engineer discovers there is no fast lever to turn it off — only a deploy.

The flag did its job. The flag was the wrong artifact to keep. A launch flag answers "should this new code path be reachable?" and once everyone agrees yes, deleting it is the correct hygiene move. A kill switch answers "is the upstream model behaving today?" — and that question never expires, because the upstream model never stops changing. Cleaning them up together is the same category error as treating a smoke detector like a construction permit: the permit gets archived once the building is up, but the detector stays wired forever because the thing it watches for can still happen.

This is the failure mode hiding inside every shipped, working, "done" AI feature: it was launched as if it were a deterministic feature, gated behind the same flag pipeline you use for a new checkout button, and now it shares the checkout button's lifecycle — including the part where its disable lever evaporates as a matter of routine engineering tidiness.

A Working AI Feature Is Not a Permanent One

The mental model most teams import from non-AI software is that a feature, once stable in production, stays stable until you change it. That model holds for code you control. It does not hold for code where the most consequential dependency — the model — sits behind an API you do not own and updates on a cadence you do not see.

Industry tracking of provider stability has found that the majority of prompt-and-model combinations regress when the provider updates the underlying weights, and a large share of those regressions are not small drifts but multi-percent accuracy drops. Major providers cycle their models on roughly a 12-to-18-month deprecation timeline, with smaller snapshot updates happening far more often. From your application's perspective, none of this is observable as a deploy event. Your code did not change. Your prompt did not change. Your tests, if you ran them, would still pass on the cases you wrote two months ago. The model under the prompt is a different artifact than the one you launched with, and the failure surface of that artifact is whatever the provider decided it should be this week.

The dangerous corollary is that "we've been running this for three months with no incidents" is not evidence the feature is durable. It is evidence that the underlying model has not yet drifted in a direction that hits your specific surface. Treating that absence as confirmation is the same logic as treating a calm year as proof you do not need flood insurance.

The Launch Flag and the Kill Switch Are Different Artifacts

Most feature-flag systems do not distinguish between flags by purpose, which is precisely why they get cleaned up together. A flag that gates a UI experiment, a flag that ramps a backend rollout, and a flag that disables a runtime hazard all look identical in the dashboard. They all have a name, a percentage rollout, and an audit log. The hygiene bot that opens "delete expired flag" PRs cannot tell them apart, and neither, often, can the engineer reviewing the PR.

A useful discipline is to treat these as two separate categories of configuration with different lifecycles and ownership:

  • Launch flags are temporary scaffolding. They exist to support a rollout, accumulate technical debt as soon as the rollout is complete, and should be removed once the feature reaches 100% with no rollback. Their lifecycle is measured in weeks.
  • Kill switches are permanent runtime configuration. They exist to give an operator a sub-minute lever over a feature whose external dependencies cannot be trusted to stay stable. Their lifecycle is the lifecycle of the feature itself. Removing one is a deliberate decision, not a cleanup pass.

The naming matters here. If the only artifact you ship is enable_ai_summarizer_rollout, the bot will eventually decide that name has outlived its purpose. If you ship two artifacts — enable_ai_summarizer_rollout for the launch and ai_summarizer_disable (or ai_summarizer_mode, with values like live, fallback, off) for the durable lever — the bot can clean up the first one and leave the second alone. Some teams encode this in the flag system itself with a "permanent" tag that exempts the flag from expiration sweeps; some teams put kill switches in a different config store entirely, away from the feature-flag UI, on the theory that mixing them invites the same accident.

The point is not which mechanism you pick. The point is that the kill switch needs to be a named, durable, separately-owned thing — not a temporary side effect of the rollout flag.

A Kill Switch Without a Decided Fallback Is Half a Lever

A binary on-off kill switch is better than nothing, but for most AI features it is not the right granularity. "The summarizer is broken — turn it off" leaves the on-call engineer in an awkward spot: the surface that used to summarize now shows nothing, or a generic error, or the unsummarized raw text, depending on what the code happened to do in the absence of a model response. The blast radius of "off" was never designed for, because the team launched the feature assuming on was permanent.

A kill switch carries more weight when its operator does not have to invent a fallback at 2am. Decide the fallback chain at design time, and encode each step as a position the kill switch can be moved to. A common shape:

  • Live — normal model and prompt.
  • Degrade — route to a smaller, cheaper, more conservative model that has a longer track record of stability. Lower ceiling, lower floor.
  • Abstain — keep the feature visible but widen the conditions under which it declines to answer. The summarizer returns "summary unavailable for this document" instead of risking a confident hallucination.
  • Static — replace the dynamic output with a pre-written deterministic response, an empty state, or the unprocessed underlying content (the raw document text, the original email, etc.).
  • Off — remove the surface entirely. Hide the summarizer card, skip the AI step in the pipeline.

The choice of which level to ship matters less than the fact that all of them are pre-decided. The on-call engineer's job during an incident is to pick the right level for the symptom — confident hallucinations probably warrant Abstain or Static; provider 5xx storms probably warrant Degrade — not to design the fallback while users are being hit. The fallback is part of the feature. A surface that does not have one was not actually finished.

This also forces a design conversation that gets quietly skipped during the original launch. "What does this product look like when the model is wrong?" is a hard question, and the easy answer — "the model is good enough now, we'll figure that out later" — is what creates the 2am scenario in the first place. Wiring the kill switch to a real set of fallback positions makes the team answer the question while they have time to think.

The Runbook Has to Name the Lever, Not the Dashboard

Most incident runbooks for AI features are written from the perspective of detection. They tell the on-call engineer which dashboard to open, which metric to watch, which Slack channel will already be discussing the issue. They are good at confirming a problem exists. They are usually bad at telling the operator how to stop it.

A runbook entry for an AI feature should answer three questions in the first paragraph:

  1. What is the lever? Name it exactly — the flag key, the config path, the admin URL, the CLI command. Not "the kill switch in LaunchDarkly," but the specific identifier. An engineer who has never touched this feature should be able to find and operate it without grepping the codebase.
  2. What are the positions? List the fallback levels and what user-facing behavior each produces. The operator should not have to guess what "abstain" does to the UI.
  3. Who is allowed to pull it? Either give on-call broad authority and trust them, or pre-authorize specific positions (degrade is fine without escalation; off requires a director). The implicit policy is the worst-case one: "I wasn't sure if I was allowed to, so I waited for someone to wake up."

The dashboards are still in the runbook, but they live below the levers. Detection is supporting evidence for an action; the action is the headline. Runbooks written this way also surface their own gaps: if you cannot write down what the lever is, the lever does not exist yet, and that is the actionable finding from the runbook exercise.

A useful adjacent practice is to test the kill switch the way you test backups. Pick a quiet hour, flip the switch in a staging environment that mirrors production traffic shape, watch the fallback render, flip it back. Once a quarter is enough. The thing you want to discover is whether the lever still works — whether the config key was renamed in a refactor, whether the fallback code path still compiles, whether the static response still makes sense given the current UI. A kill switch that has not been pulled in a year is a kill switch you do not actually have.

Treating the Provider as a Live Dependency Changes the Discipline

Underneath the specific mechanics of flags and runbooks is a posture shift the team has to adopt: the model provider is not a vendor relationship in the SaaS sense, where uptime and behavior are governed by a contract you can sue over. It is a live, mutating dependency that you have integrated into the request path, and the operational discipline that fits it is closer to the discipline of consuming a third-party data feed that can change its schema without warning. You would not consume a financial market feed without circuit breakers; the model deserves the same posture.

This reframes a lot of decisions that look like over-engineering otherwise. A second provider on standby is not paranoia, it is a fallback position. An abstention threshold that errs toward declining to answer is not a regression in capability, it is a calibration knob that an on-call engineer can tune downward when the model starts misbehaving. Logging the model version and snapshot identifier on every request is not bloat, it is the only way you will be able to correlate a behavioral incident with the upstream change that caused it. The kill switch is the most visible piece of this posture, but the posture is the bigger commitment.

The forward-looking version of this discipline is that every AI feature ships with three things, not one: the feature itself, the metric that tells you it is misbehaving, and the lever that lets you do something about it within seconds. If the third thing was not part of the launch checklist, the feature is not really launched — it is just running. The next time the provider quietly rolls a snapshot, those two states will feel very different at 2am.

References:Let's stay in touch and Follow me for more thoughts and updates