Skip to main content

The Customer-Facing AI Postmortem When Nothing Crashed

· 12 min read
Tian Pan
Software Engineer

Your status page is green. Your error rate is zero. Your uptime dashboard reads 100% for the seventh consecutive month. And yet at 9:14 AM on a Tuesday, your account team is forwarding you a message from a Fortune 500 customer that says, "Our team noticed the assistant has been worse this week. Can you tell us what changed?" Twelve more like it land before lunch. None of them will be answered by the existing incident-comms playbook, because that playbook was built for outages, and nothing has crashed.

This is the customer-facing AI postmortem problem, and it is the single most consistent gap I see across teams shipping LLM features into enterprise contracts. The reliability surface has shifted from "is it up" to "is it as good as it was last week," and almost none of the comms infrastructure has caught up. Status pages don't have a tile for it. Severity rubrics don't grade it. Support macros default to "we identified an issue and resolved it," which reads as either dismissive or alarming depending on the customer's mood that day.

The mistake most teams make is treating these incidents as edge cases. They are not edge cases. Practitioners report detection lags of 14–18 days between when a quality regression starts and when the first user complaint reaches support, which means a customer-facing soft-regression incident is happening to your product roughly every time you ship a prompt change, judge update, or model version bump — you just usually don't hear about it. The ones you do hear about are the ones a customer's internal eval, or worse their CTO's gut, has decided to escalate. By the time it reaches you, the customer has already lost some baseline trust in the channel. The postmortem is no longer optional.

The Shape of a Quality-Regression Incident

The first problem with writing about these incidents is that engineers and customers do not share a vocabulary for them. An outage is easy: the API returned 503, the dashboard went red, the apology writes itself. A soft regression is harder. Every API call returned a 200 status code. Latency was nominal. The user-facing feature did not break in a way any monitor would catch. What changed was that the assistant started giving worse answers, and "worse" lives in a measurement layer most customers cannot see and most internal teams have not learned to summarize.

This produces four canonical root causes that every AI team needs to be able to name out loud, because they are the four answers to "what changed":

Model migration. The provider deprecated the version you were pinned to, or you proactively moved to a newer one, and the new model has different output characteristics on your prompts. Provider deprecation timelines are real — major foundation-model providers now give 60-day notices on model retirement, which means every team running LLM features in production has a quarterly migration calendar event whether they have planned for it or not. Migrations are the most common root cause and the easiest to misdiagnose, because the metrics that drop are not always the metrics you were watching.

Prompt change. Somebody edited the system prompt, the rubric, the few-shot examples, or the tool descriptions, and the planner downstream of those changes started behaving differently. Often this was a deliberate change for a different optimization target — say, reducing token cost or fixing a refusal — and the regression is on a dimension nobody was tracking when the change shipped.

Judge drift. Your LLM-as-judge — the offline grader you use to score eval runs and trigger regression alarms — has itself drifted, usually because the judge's own backbone model was silently updated by the provider or because somebody tweaked the rubric prompt. The judge now scores the same outputs differently, which means either you missed a real regression (the judge got more permissive) or you are now chasing a phantom one (the judge got stricter). Either way, your trust calibration on every score the judge produced this quarter is in question.

Input distribution shift. Your customers' usage changed. Maybe a new product launch on their side started routing a different mix of queries into your assistant. Maybe seasonality. Maybe a competitor in their workflow stopped supporting some feature and pushed traffic to you. The model has not changed and the prompts have not changed, but the inputs no longer match the inputs your evals were measured against, so the in-production quality score is now meaningfully different from the one you reported at deployment time.

These four are the entire vocabulary you owe a customer. Every soft-regression incident is one of them, or some named combination. If your postmortem cannot place the root cause into one of those buckets in language a non-AI procurement team can act on, the postmortem will read as either evasive or technical-jargon-flavored noise.

What a Customer-Facing Severity Scale Has to Cover

The next problem is that traditional severity scales — SEV0 through SEV3, blast radius measured in users-affected and minutes-of-downtime — do not grade quality regressions in any usable way. A regression that takes a per-task eval score from 88% to 82% is not zero minutes of downtime, but it is also not full downtime. It needs its own axis.

The teams I have seen do this well build a parallel severity scale that maps three dimensions:

Magnitude. How big is the metric move? "Eval-on-traffic dropped 4 points" is the kind of phrase that needs to be a named incident class. So is "judge agreement with humans dropped from 0.81 to 0.68 on the calibration set." The scale should have explicit thresholds: a 1-point eval drop is a warning, a 4-point drop is a SEV2 equivalent, an 8-point drop is the AI-quality version of a SEV0.

Blast radius. Which customer segments saw it? AI quality regressions are rarely uniform across the customer base. A model migration might be neutral on enterprise users with structured query patterns and catastrophic on long-tail use cases the eval set under-sampled. The severity scale should explicitly carry "% of traffic affected" and "% of named customers affected" as separate numbers, because the second number is the one a customer reads in their postmortem to decide whether to escalate inside their org.

Dimension. Which quality axis moved? A drop in factuality has very different downstream consequences than a drop in formatting compliance, and both have very different consequences than a drop in tone or refusal-rate. The severity scale needs to name the dimensions, and the postmortem needs to say which one moved. "Quality went down" is a non-answer.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates