Skip to main content

The First 100 Tickets After You Launch an AI Feature

· 12 min read
Tian Pan
Software Engineer

The bug count after an AI launch is not a quality problem. It is a discovery sequence — a sequence so predictable that you can sketch it on a whiteboard before the launch announcement goes out, week by week, ticket by ticket, and be embarrassingly close to right by the time the dashboards catch up. Every team that ships an AI feature runs this sequence. The only choice is whether you run it with a runbook or with a series of unscheduled all-hands.

I have watched enough launches now to believe the sequence is not really about engineering quality. It is about an information gap. Pre-launch, the team has a synthetic traffic mix, a curated eval set, a happy-path demo, and a board deck. Post-launch, real users arrive with intents the synthetic traffic never modeled, a marketing team that runs campaigns engineering hears about secondhand, a model provider that ships changes the team did not authorize, and a privacy reviewer who was on vacation when the feature shipped. The sequence below is the friction that happens when those two worlds collide.

If you are about to launch, the value of mapping the sequence in advance is not that it makes the tickets stop. It is that the tickets stop being surprises, which is the difference between an on-call rotation that sleeps and one that does not.

Weeks 1–2: The Cost Spike and the Slack Screenshot

Week one is almost always cost. Not because the model is broken, but because someone in marketing scheduled a campaign that nobody told engineering about, and your synthetic-traffic cost model is now staring at a 4× overage by Wednesday. The pattern is consistent enough that you can write the postmortem in advance: most cost spikes trace to one of six causes — a runaway agent loop, a prompt regression that ballooned token counts, an unintended model upgrade, a retry storm, a noisy tenant, or a leaked API key. None of those are new failure modes. What is new is the price tag on each one.

The fix that actually works in week one is not "add a budget alert." Budget alerts fire monthly, after the bill exists. The fix is per-cohort burn-rate alarms — token cost projected forward like an AWS cap-burn alert, scoped per intent class, per tenant, per prompt version. If your cost dashboard shows different numbers than your finance dashboard, the conversation about why the launch is over budget will fall apart before it starts. Reconcile those two views before you ship, not after.

Week two is the screenshot. A customer takes a screenshot of the AI saying something weird, posts it in a public Slack, and an hour later it is in a P1 retro. The content of the screenshot is rarely the lesson. The lesson is that nobody on the team knew the prompt the model was answering, the version of the model that produced the response, the retrieval context that fed it, or the user's prior turns. Without that, you cannot reproduce, cannot triage, cannot tell whether the issue is one user or a thousand. Logging that captures the full prompt-context-version triple is unglamorous infrastructure that pays for itself the first week. Skip it and you will spend week three rebuilding it under pressure.

Weeks 3–4: Eval Drift and the Privacy Review

By week three, your eval set starts disagreeing with your users. The benchmark numbers look fine. The user reports do not. This is not a bug in either signal — it is the gap between offline evaluation, which assumes fixed prompts, fixed labels, and a fixed notion of correctness, and the production system, which violates all three the moment users arrive. Even when benchmark accuracy stays stable, real-world performance can drift because the benchmark no longer reflects how the system is being used. Practitioners describe the late-stage version of this as accuracy drops being a symptom, not a leading indicator: drift has already escaped the semantic layer by the time the metric moves.

The way out is not "run more evals." It is to treat evals as a continuous system rather than a release gate. Sample real production traffic into an eval queue weekly, label a slice, compare against your synthetic eval, and treat divergence between the two as its own metric. When the curated eval and the production-sampled eval start to disagree, your eval is rotting and the production-sampled set is the truth. Plan for the eval set to need a rewrite by month three. Plan for it explicitly, not as cleanup.

Week four is the privacy review. A tenant's prompt context shows up in another tenant's support reply, or an internal field marked private surfaces in a model output, or an autocomplete suggestion includes a snippet that no anonymous user should have seen. The cause is almost never "the model leaked it." It is shared infrastructure — a cache keyed on something other than the tenant ID, a memory layer with looser scoping than the API, a vector store whose access controls do not match the application's. This is the AI-era equivalent of IDOR vulnerabilities, and it does not show up in pre-launch testing because pre-launch testing rarely involves two tenants exercising the same code path concurrently. Real cross-session leak incidents in production AI systems usually trace to misconfigured caches, shared memory, or improperly scoped session context — exactly the boundaries that pre-launch tests do not exercise.

A cross-session privacy audit before launch — running synthetic Tenant A and Tenant B traffic through every shared component and asserting strict isolation at each layer — is the cheapest test you can write and the one most teams skip.

Weeks 5–6: The Latency Tail and the Silent Refusal Shift

Week five is when the heavyweight retrieval path you optimized for the demo becomes the user's modal flow. The path was supposed to be the fallback. Real users discovered it is the only thing that answers their actual questions, so the long-tail latency you were tracking as P99 is now P50. The existing monitoring assumed the previous distribution. The dashboards say everything is healthy. The users say the product is slow.

The architectural lesson is that your latency budget is not a single number. It is a per-intent budget, because users do not run the average case — they run their case, repeatedly, and your tail latency is somebody's median. Tag every request with the retrieval and reasoning path it actually took, then track P50/P95/P99 per path. When a fallback path's traffic share crosses ten percent, treat it as a primary path and re-budget.

Week six is the refusal shift. The model provider ships what they describe as a minor update. The benchmark numbers all look stable. Internal monitoring is green. But your refusal rate on a specific category of legitimate user request quietly doubles overnight, and customer support starts pinging you about users who say the AI "got worse." Anthropic's own April 2026 disclosure showed exactly this pattern — quality degradation concentrated in long-context coding, multi-step reasoning, and iterative problem-solving, not in the headline benchmarks. By the time the model provider acknowledges the incident, you have already shipped the workaround.

The defense is unglamorous: pin the model version, run a regression suite against the new version before adopting, and treat any provider-initiated change as a deployment that needs the same scrutiny as your own. "Auto-upgrade to latest" is a setting that looks responsible at the dashboard level and behaves like an unreviewed dependency bump in practice.

Weeks 7–8: The Runbook and the Feature Flag That Does Not Exist

Week seven, the on-call rotation realizes the runbooks they inherited do not cover any of this. The runbooks tell them what to do when a service returns 500s, not what to do when it returns 200s with the wrong words inside. The shape of an AI incident is different — the signal is qualitative, the rollback target is sometimes a prompt and sometimes a model and sometimes a piece of retrieval logic, and the kill switch lives at a different layer than the deploy.

The runbook gap is fillable, but it has to be filled with AI-specific shape. A useful incident response runbook for an AI feature includes detection (acknowledge the alert, scope the affected endpoints, regions, and tenants), activation (pick the least-broad kill switch that likely mitigates — record actor, incident ID, reason, expected expiry), and verification (synthetic transactions and real-user metrics; if error rate falls to baseline, mark stabilized; if not in 5–10 minutes, escalate to broader rollback). The kill switch needs to be a real product surface, not a config file the on-call has to learn to edit at 2am.

Week eight is the feature flag that does not exist. The PM asks for the ability to disable the feature for a specific cohort while keeping it on for everyone else, or to roll back the prompt without redeploying code, or to A/B test two versions against each other. If rolling back a prompt change takes more than fifteen minutes, the system is not production-ready. Mature teams report rollbacks under sixty seconds using environment pointers or feature flags that decouple the prompt version from the deployment. The flag has to be designed in, not retrofitted under pressure during an incident.

What the Launch Playbook Should Pre-Stage

The pattern across all eight weeks is that the answers exist, they are just usually built reactively. The launch playbook that pre-stages them looks like this.

A seeded triage taxonomy. Before launch, write down the categories you expect to see — hallucination, refusal-too-strict, refusal-too-loose, latency, cost, privacy, prompt-injection, retrieval-miss — and pre-route each to an owner. The categories will be wrong. That is fine. The point is that the first ticket after launch lands somewhere instead of nowhere.

Eval drift alarms wired before launch, not after. Sample production traffic into an eval queue from day one. Compute the divergence between curated and production-sampled eval as its own metric. Alert on the divergence, not just on absolute scores.

A kill switch that is a real product surface. Not a config file. Not an environment variable that requires a deploy. A button that the on-call can press, that scopes by tenant or cohort or feature, and that logs who pressed it and why. RBAC the emergency-operator role to a small set of people. Rehearse it before launch.

A rollback strategy for prompts, not just code. Track code version, model version, and prompt version independently. Know what "rolling back" means at each layer. Pin the model version explicitly. Treat prompt changes as deployments with the same review and rollback discipline as code changes.

Per-cohort cost burn-rate alarms. Project spend forward. Tag by tenant, intent class, and prompt version. Reconcile the engineering view with finance's view before launch so the conversation in week one is about response, not about whose numbers are right.

A cross-session privacy audit. Two synthetic tenants, every shared component, isolation asserted at each layer. Run it before launch and re-run it whenever the architecture changes.

The Org Pattern Nobody Documents

The hardest part of the sequence is not technical. It is that the team that ships the AI feature is rarely the team that operates it on day sixty. The launch team has the model in their head, the prompt in their head, the retrieval logic in their head. The operations team inherits a codebase, a dashboard, and a Slack channel.

The handoff document that closes this gap usually does not exist, and writing it after the fact is harder than writing it before. The minimum viable handoff is short: what the feature does, which prompts and models and retrieval paths it uses, what the kill switches are and how to invoke them, what the known failure modes are with their triage owners, what the cost shape looks like by intent class, and where the eval queue lives. Write that document the week before launch. Hand it to the operating team. Sit with them through the first incident.

There is a leadership realization that lands somewhere around month three: AI feature operations are not a quality assurance problem and not an SRE problem in the classical sense. They are a discovery sequence that every team will run. The PagerDuty 2026 State of AI-First Operations report puts unplanned-disruption costs above a million dollars an hour at some organizations, and the qualitative shape of those disruptions consistently maps to the eight-week pattern above. The teams that suffer least are not the teams that prevent the sequence — that is not on the table. They are the teams that pre-staged the answers.

Run the sequence with a runbook. The bills, the screenshots, the eval drift, the privacy reviews, the latency surprises, the silent provider updates, the runbook gaps, and the missing flags will all show up anyway. The only variable is whether your team finds out from a dashboard or from a customer.

References:Let's stay in touch and Follow me for more thoughts and updates