The First 100 Tickets After You Launch an AI Feature
The bug count after an AI launch is not a quality problem. It is a discovery sequence — a sequence so predictable that you can sketch it on a whiteboard before the launch announcement goes out, week by week, ticket by ticket, and be embarrassingly close to right by the time the dashboards catch up. Every team that ships an AI feature runs this sequence. The only choice is whether you run it with a runbook or with a series of unscheduled all-hands.
I have watched enough launches now to believe the sequence is not really about engineering quality. It is about an information gap. Pre-launch, the team has a synthetic traffic mix, a curated eval set, a happy-path demo, and a board deck. Post-launch, real users arrive with intents the synthetic traffic never modeled, a marketing team that runs campaigns engineering hears about secondhand, a model provider that ships changes the team did not authorize, and a privacy reviewer who was on vacation when the feature shipped. The sequence below is the friction that happens when those two worlds collide.
If you are about to launch, the value of mapping the sequence in advance is not that it makes the tickets stop. It is that the tickets stop being surprises, which is the difference between an on-call rotation that sleeps and one that does not.
Weeks 1–2: The Cost Spike and the Slack Screenshot
Week one is almost always cost. Not because the model is broken, but because someone in marketing scheduled a campaign that nobody told engineering about, and your synthetic-traffic cost model is now staring at a 4× overage by Wednesday. The pattern is consistent enough that you can write the postmortem in advance: most cost spikes trace to one of six causes — a runaway agent loop, a prompt regression that ballooned token counts, an unintended model upgrade, a retry storm, a noisy tenant, or a leaked API key. None of those are new failure modes. What is new is the price tag on each one.
The fix that actually works in week one is not "add a budget alert." Budget alerts fire monthly, after the bill exists. The fix is per-cohort burn-rate alarms — token cost projected forward like an AWS cap-burn alert, scoped per intent class, per tenant, per prompt version. If your cost dashboard shows different numbers than your finance dashboard, the conversation about why the launch is over budget will fall apart before it starts. Reconcile those two views before you ship, not after.
Week two is the screenshot. A customer takes a screenshot of the AI saying something weird, posts it in a public Slack, and an hour later it is in a P1 retro. The content of the screenshot is rarely the lesson. The lesson is that nobody on the team knew the prompt the model was answering, the version of the model that produced the response, the retrieval context that fed it, or the user's prior turns. Without that, you cannot reproduce, cannot triage, cannot tell whether the issue is one user or a thousand. Logging that captures the full prompt-context-version triple is unglamorous infrastructure that pays for itself the first week. Skip it and you will spend week three rebuilding it under pressure.
Weeks 3–4: Eval Drift and the Privacy Review
By week three, your eval set starts disagreeing with your users. The benchmark numbers look fine. The user reports do not. This is not a bug in either signal — it is the gap between offline evaluation, which assumes fixed prompts, fixed labels, and a fixed notion of correctness, and the production system, which violates all three the moment users arrive. Even when benchmark accuracy stays stable, real-world performance can drift because the benchmark no longer reflects how the system is being used. Practitioners describe the late-stage version of this as accuracy drops being a symptom, not a leading indicator: drift has already escaped the semantic layer by the time the metric moves.
The way out is not "run more evals." It is to treat evals as a continuous system rather than a release gate. Sample real production traffic into an eval queue weekly, label a slice, compare against your synthetic eval, and treat divergence between the two as its own metric. When the curated eval and the production-sampled eval start to disagree, your eval is rotting and the production-sampled set is the truth. Plan for the eval set to need a rewrite by month three. Plan for it explicitly, not as cleanup.
Week four is the privacy review. A tenant's prompt context shows up in another tenant's support reply, or an internal field marked private surfaces in a model output, or an autocomplete suggestion includes a snippet that no anonymous user should have seen. The cause is almost never "the model leaked it." It is shared infrastructure — a cache keyed on something other than the tenant ID, a memory layer with looser scoping than the API, a vector store whose access controls do not match the application's. This is the AI-era equivalent of IDOR vulnerabilities, and it does not show up in pre-launch testing because pre-launch testing rarely involves two tenants exercising the same code path concurrently. Real cross-session leak incidents in production AI systems usually trace to misconfigured caches, shared memory, or improperly scoped session context — exactly the boundaries that pre-launch tests do not exercise.
A cross-session privacy audit before launch — running synthetic Tenant A and Tenant B traffic through every shared component and asserting strict isolation at each layer — is the cheapest test you can write and the one most teams skip.
Weeks 5–6: The Latency Tail and the Silent Refusal Shift
Week five is when the heavyweight retrieval path you optimized for the demo becomes the user's modal flow. The path was supposed to be the fallback. Real users discovered it is the only thing that answers their actual questions, so the long-tail latency you were tracking as P99 is now P50. The existing monitoring assumed the previous distribution. The dashboards say everything is healthy. The users say the product is slow.
The architectural lesson is that your latency budget is not a single number. It is a per-intent budget, because users do not run the average case — they run their case, repeatedly, and your tail latency is somebody's median. Tag every request with the retrieval and reasoning path it actually took, then track P50/P95/P99 per path. When a fallback path's traffic share crosses ten percent, treat it as a primary path and re-budget.
Week six is the refusal shift. The model provider ships what they describe as a minor update. The benchmark numbers all look stable. Internal monitoring is green. But your refusal rate on a specific category of legitimate user request quietly doubles overnight, and customer support starts pinging you about users who say the AI "got worse." Anthropic's own April 2026 disclosure showed exactly this pattern — quality degradation concentrated in long-context coding, multi-step reasoning, and iterative problem-solving, not in the headline benchmarks. By the time the model provider acknowledges the incident, you have already shipped the workaround.
- https://mlopsworld.com/post/the-real-ai-risk-shows-up-after-launch/
- https://www.techtarget.com/searchenterpriseai/feature/AI-deployments-gone-wrong-The-fallout-and-lessons-learned
- https://venturebeat.com/infrastructure/monitoring-llm-behavior-drift-retries-and-refusal-patterns
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://runcycles.io/troubleshoot/llm-cost-spike-debugging
- https://www.giskard.ai/knowledge/cross-session-leak-when-your-ai-assistant-becomes-a-data-breach
- https://witness.ai/blog/llm-system-prompt-leakage/
- https://www.pluralsight.com/resources/blog/ai-and-data/how-cut-llm-costs-with-metering
- https://featbit.co/ai-rollback-strategy
- https://www.pagerduty.com/newsroom/2026-state-of-ai-first-operations/
- https://beefed.ai/en/kill-switches-incident-response-feature-flags
- https://insightfinder.com/blog/hidden-cost-llm-drift-detection/
- https://www.isaca.org/resources/news-and-trends/isaca-now-blog/2025/avoiding-ai-pitfalls-in-2026-lessons-learned-from-top-2025-incidents
