Rolling back an LLM upgrade isn't a button press — it's a partial, hysteretic operation closer to a database migration. Here's the control plane your incident playbook needs before the next bad model rolls out.
Routing 60% of LLM traffic to a cheaper model bends the cost graph — and silently splits your AI feature into two products. The aggregate accuracy metric averages over the segment that gets hurt, two failure modes show up as one bug report, and customers experience two assistants with no release notes.
Your English eval suite cost $40K. The seven-locale international launch will not cost $280K — the real curve is closer to N×L^1.3 because cross-locale comparison is a meta-eval that doesn't decompose.
Shared on-call rotations break the moment one of those services is an LLM-backed feature. Here is the literacy prerequisite, dashboard hygiene, and shadow-period playbook that keeps the AI team out of bed at 2am.
Shipping one on-device model to every user means you're either burning battery on flagships or shipping a degraded product on the long tail. The discipline that fixes it looks more like a CDN than a model registry.
Tools that return unbounded lists turn agents into the SELECT * antipattern of the function-calling era. Pagination is a load-shedding primitive — make it a convention in your tool catalog, not a per-tool decision.
Vector stores ship without the migration tooling Postgres has had for two decades — no ALTER TABLE, no online schema change, no per-row version. The discipline that makes embedding upgrades survivable starts with a single column most teams forget to add.
Prompt caching's discount is real until one tenant's launch evicts everyone else's prefixes. The shared inference cache is a tenant-coupling surface, and the bill lands weeks after the incident.
A four-word edit to a system prompt can break parsers, judges, and chained agents that pinned the old wording. Prompts are APIs with silent consumers — and the discipline that keeps them stable looks a lot like REST endpoint deprecation.
Behavioral evals catch what your model says; they cannot catch what your prompt is. A prompt linter — fast, deterministic, structural — closes the gap that ships eval-green and surfaces as a 11pm production incident.
When a system prompt grows past 2K tokens, position bias makes a moved instruction as load-bearing as a rewritten one — and line-based diffs hide it. How three teams silently overwrite each other's intent, and the section-ownership and eval discipline that catches it.
The 'representative customer' you pasted into your few-shot prompt six months ago is still in production — re-identifiable, re-shipped, and invisible to DLP.