The AI Incident Severity Taxonomy: When Is a Hallucination a Sev-0?
A legal team's AI-powered research assistant fabricated three case citations and slipped them into a court filing. The citations looked plausible — real courts, real-sounding case names, coherent holdings. Nobody caught them before the brief was submitted. The incident cost the firm an emergency hearing, a public apology, and a bar inquiry.
Was that a sev-0? A sev-2? The answer depends on which framework you use — and traditional severity models will give you the wrong answer almost every time.
Software incident severity classification was built for deterministic systems. A service is either responding or it isn't. A database query either succeeds or throws an error. The failure modes are binary, the blame is traceable to a commit, and the fix is a rollback or a patch. AI systems break all three of those assumptions simultaneously, and organizations that apply traditional severity frameworks to LLM failures end up either panicking over noise or dismissing structural failures as one-off quirks.
Why Traditional Severity Breaks for AI Systems
Standard severity levels — usually sev-0 through sev-4 or sev-5 — are designed around observable, persistent failure states. A sev-0 is total outage. A sev-1 is critical degradation affecting most users. The model assumes that severity correlates directly with scope and duration, and that both are measurable.
LLM systems violate this model in three ways.
Non-determinism defeats persistent failure states. Run the same input through the same model at temperature 0.7 and you get different outputs on every call. An error that reproduces 30% of the time looks like a sev-3 in spot checks but functions like a sev-1 at scale. Traditional incident response asks: "Is it still happening?" For probabilistic systems, that question doesn't have a stable answer.
Semantic errors pass validation. When a service throws a 500, your monitoring catches it. When an LLM produces a confidently wrong answer formatted exactly as expected, every downstream check passes. The failure propagates as valid data. In multi-agent systems, research shows independent agent pipelines amplify these errors 17x compared to single-agent baselines, while centralized coordination constrains amplification to around 4x. A hallucination in step 2 of a 10-step pipeline doesn't just affect step 2 — it poisons everything downstream.
Causation is distributed. Traditional postmortems trace failures to a commit. AI incidents have no single cause. The model changed. The prompt template was updated. The training data distribution shifted. The inference temperature was bumped. The retrieval corpus went stale. Any of these, in any combination, can produce the same observed degradation, and blaming "the model" is as useful as blaming "the network."
A Multidimensional Incident Classification Framework
The fix is to stop treating severity as a single axis and start treating it as a matrix of four dimensions. Each dimension is independently measurable and independently actionable.
Dimension 1: Scope — Per-Session vs. Per-Cohort
The first question to answer in any AI incident is whether the failure is affecting individual sessions or correlated groups.
A per-session failure looks random. User A got a hallucinated answer; User B asking the same question got a correct one. This is the baseline behavior of any probabilistic system — you're measuring the tail of the distribution, not a structural fault.
A per-cohort failure is different in kind. All users who invoke the system with documents longer than 8,000 tokens get truncated answers. All users on iOS 17.4 through your mobile SDK hit a context marshaling bug. All queries containing date ranges return subtly wrong results. Cohort failures indicate a structural issue: something about this class of inputs, this user segment, or this execution path is systematically broken.
The incident response implication: per-session failures warrant monitoring and statistical baselining. Per-cohort failures warrant immediate investigation and potential feature gating. A hallucination affecting 0.3% of sessions might be normal noise; a hallucination affecting 100% of sessions with a specific prompt pattern is a sev-1 regardless of raw user count.
Dimension 2: Failure Type — Factual vs. Stylistic Drift
Not all output degradation is equal. AI incidents split cleanly into two categories with very different urgency profiles.
Factual degradation means the model is producing claims that are wrong. Fabricated citations, incorrect statistics, hallucinated product features, wrong dates. These failures are measurable against ground truth, they create direct harm, and they're what most people mean when they say "the AI is broken."
Stylistic drift means the output changed in character but not in correctness. Responses became more verbose. Tone shifted from professional to casual. Formatting conventions changed. Answer length doubled. These changes are real — users notice them, eval scores shift, A/B tests detect them — but they rarely constitute an incident in isolation.
The practical implication: factual degradation escalates immediately. Stylistic drift goes into the monitoring backlog. The failure you need to be careful about is factual degradation that's been miscategorized as stylistic drift because it passed automated format checks. A system that validates JSON structure but not semantic correctness will miss this every time.
Dimension 3: Visibility — User-Facing vs. Internal
An embedding drift that reduces retrieval precision from 0.94 to 0.88 is a real degradation. Whether it's an incident depends on whether it surfaces to users.
Internal-only failures — degraded retrieval scores, slower chain-of-thought traces, lower reranker confidence — matter for the engineering team's situational awareness. They're leading indicators of future user-facing failures. But they're not incidents in the traditional sense: no users are harmed, no trust is lost, and the system still functions.
User-facing failures change the calculus entirely. A chatbot hallucinating a refund policy that the company is then legally pressured to honor is a user-facing incident with real financial consequences. The AI is "working" — it returned a response, it parsed correctly, it hit latency SLOs — but the output caused harm.
The classification question isn't just "did the system return an error?" It's "did the system return output that caused harm, confusion, or erosion of trust for a real user?" If yes, it's an incident. If no, it might be important monitoring data, but it's not a pager event.
Dimension 4: Damage Profile — Reversible vs. Compounding
This is the dimension most teams underweight, and it's the one that determines whether a 2% regression is background noise or a sev-1.
Reversible failures stop when you fix them. Roll back the prompt, revert the model version, gate the feature — and the damage stops. The Air Canada chatbot hallucinating a discount that the airline was forced to honor is a reversible incident: embarrassing, financially limited, but bounded.
Compounding failures accumulate. Consider three scenarios where compound damage changes the classification:
- An AI coding assistant starts producing subtly broken code patterns. Developers accept suggestions without testing. The patterns propagate across a codebase for three weeks before the issue surfaces. The fix is easy; the remediation of 18 months of technical debt is not.
- A data enrichment pipeline uses an LLM to classify customer records. The classifier develops a systematic bias. Those records feed a recommendation engine, which trains on them. The poisoning reaches the next model training run before anyone detects the original issue.
- A financial agent begins making subtly incorrect preprocessing decisions. Each individual decision is within normal variance. Across a portfolio, the errors compound.
Reversible failures with broad per-cohort scope: sev-1. Compounding failures even with narrow per-session scope: sev-1. A factual failure that's user-visible and irreversible: sev-0 or sev-1 depending on scope.
The Threshold Math Problem
Teams often ask for a simple rule: "When does a 2% regression become a sev-2?"
There isn't a universal answer, because the right threshold is a function of context:
- In a content generation tool, a 2% factual error rate might be acceptable baseline noise, clearly communicated to users through trust and uncertainty UI patterns.
- In a clinical decision support tool, a 0.1% factual error rate on drug interaction recommendations is a sev-0. Someone might die.
- In a legal research tool, a 5% citation fabrication rate is career-ending for the attorneys relying on it.
The right threshold math involves two calculations:
Statistical significance vs. business significance. A 2% regression is statistically significant if your cohort is large enough. But statistical significance doesn't determine severity — business impact does. The question isn't whether the regression is real; it's whether the magnitude of that regression, across that scope, with that damage profile, constitutes a threshold violation in your specific deployment context.
Confidence interval width for stochastic systems. Because LLM outputs are non-deterministic, your confidence intervals are wider than they would be for deterministic software. A regression that looks like 2% in a 50-sample spot check might be 8% or -1% with proper statistical power. Sequential testing methods — which detect significant changes before requiring full sample sizes — are more useful than batch A/B testing for AI incident detection. The question isn't "did we see a regression?" but "are we confident enough that this regression is real to declare an incident?"
The AI Postmortem: Questions That Don't Have Commits
Traditional postmortems work backwards from a commit or config change. AI postmortems work forward from symptoms, because the causal chain is distributed across factors that often weren't intentionally changed.
The questions that matter are different:
What changed? Not just code. Enumerate: model version, prompt templates, few-shot examples, retrieval corpus, inference parameters (temperature, top-p, max-tokens), orchestration logic, input distribution. Any of these could be the cause. All of them need to be versioned and logged if you want to be able to answer this question after the fact.
Which cohort was affected? Characterize the affected population as specifically as possible. What input features, user segments, or execution paths correlate with the failure? This isn't just for root cause analysis — it tells you whether you need to halt the feature entirely or gate it by cohort while you investigate.
Is the damage still accumulating? For compounding failures, stopping the root cause doesn't undo the damage already done. You need a parallel answer to "what caused this" and "what has already been affected and needs remediation."
What would reproduce this? Don't close a postmortem without an eval case. The answer to "we can't reliably reproduce it because it's non-deterministic" is to run a thousand examples against the suspected input class until you understand the failure rate distribution. If you can't characterize the failure statistically, you can't know whether your fix worked.
What's the counterfactual? Would the damage have been lower if the system had abstained rather than answered? Many AI incidents would have been lower-severity if the system had expressed uncertainty instead of producing a confident wrong answer. The postmortem should evaluate not just what went wrong but whether the system's confidence calibration is contributing to impact.
Putting It Together: A Classification Checklist
When a potential AI incident surfaces, work through this checklist before assigning severity:
- Scope: Is this per-session (isolated noise) or per-cohort (correlated failure pattern)?
- Failure type: Is this factual degradation (wrong claims) or stylistic drift (changed presentation)?
- Visibility: Is this user-facing (users harmed or deceived) or internal-only (monitoring signal)?
- Damage profile: Will stopping the root cause stop the damage (reversible) or has accumulated damage already propagated (compounding)?
A failure that scores "per-cohort + factual + user-facing + compounding" is a sev-0 or sev-1. A failure that scores "per-session + stylistic + internal-only + reversible" is a monitoring note, not a pager event. The middle cases — factual but internal-only, user-facing but reversible, per-cohort but stylistic — require judgment about business context.
The legal citation incident that opened this piece? Per-cohort (specific prompt patterns involving document retrieval), factual (fabricated case names), user-facing (submitted to a court), compounding (legal consequences that can't be unwound). That's a sev-0. The severity classification doesn't care that the system was "working" — it returned a response, it didn't throw an error, it hit latency SLOs. The output caused irreversible real-world harm. That's what severity is actually measuring.
Forward
The practical takeaway isn't to adopt this framework wholesale as a checklist. It's to stop asking "is the AI broken?" — that question has no stable answer for probabilistic systems — and start asking four smaller questions whose answers combine into a severity classification that actually drives the right response.
Organizations deploying AI features at scale need severity frameworks that distinguish between the tail of the distribution and a structural failure. They need postmortem processes that can survive the absence of a single culprit commit. And they need threshold math that's grounded in business impact rather than generic statistical conventions.
The cost of getting this wrong runs in both directions. Over-react to probabilistic noise and you'll trigger endless sev-1 incidents that burn out your on-call rotation and slow feature development. Under-react to structural failures and you'll discover the compounding damage six weeks after it started — when the fix is easy but the remediation is not.
- https://www.nature.com/articles/s41598-025-15416-8
- https://arxiv.org/html/2509.18970v1
- https://arxiv.org/pdf/2508.01781
- https://www.coalitionforsecureai.org/defending-ai-systems-a-new-framework-for-incident-response-in-the-age-of-intelligent-technology/
- https://www.mdpi.com/2624-800X/6/1/20
- https://www.microsoft.com/en-us/security/blog/2026/04/15/incident-response-for-ai-same-fire-different-fuel/
- https://galileo.ai/blog/multi-agent-ai-failures-prevention
- https://learnagentic.substack.com/p/what-is-error-cascading-in-multi
- https://www.evidentlyai.com/blog/llm-regression-testing-tutorial
- https://response.pagerduty.com/before/severity_levels/
- https://www.confident-ai.com/blog/llm-testing-in-2024-top-methods-and-strategies
- https://www.getmaxim.ai/articles/diagnosing-and-measuring-ai-agent-failures-a-complete-guide/
