Skip to main content

The AI Incident Severity Taxonomy: When Is a Hallucination a Sev-0?

· 11 min read
Tian Pan
Software Engineer

A legal team's AI-powered research assistant fabricated three case citations and slipped them into a court filing. The citations looked plausible — real courts, real-sounding case names, coherent holdings. Nobody caught them before the brief was submitted. The incident cost the firm an emergency hearing, a public apology, and a bar inquiry.

Was that a sev-0? A sev-2? The answer depends on which framework you use — and traditional severity models will give you the wrong answer almost every time.

Software incident severity classification was built for deterministic systems. A service is either responding or it isn't. A database query either succeeds or throws an error. The failure modes are binary, the blame is traceable to a commit, and the fix is a rollback or a patch. AI systems break all three of those assumptions simultaneously, and organizations that apply traditional severity frameworks to LLM failures end up either panicking over noise or dismissing structural failures as one-off quirks.

Why Traditional Severity Breaks for AI Systems

Standard severity levels — usually sev-0 through sev-4 or sev-5 — are designed around observable, persistent failure states. A sev-0 is total outage. A sev-1 is critical degradation affecting most users. The model assumes that severity correlates directly with scope and duration, and that both are measurable.

LLM systems violate this model in three ways.

Non-determinism defeats persistent failure states. Run the same input through the same model at temperature 0.7 and you get different outputs on every call. An error that reproduces 30% of the time looks like a sev-3 in spot checks but functions like a sev-1 at scale. Traditional incident response asks: "Is it still happening?" For probabilistic systems, that question doesn't have a stable answer.

Semantic errors pass validation. When a service throws a 500, your monitoring catches it. When an LLM produces a confidently wrong answer formatted exactly as expected, every downstream check passes. The failure propagates as valid data. In multi-agent systems, research shows independent agent pipelines amplify these errors 17x compared to single-agent baselines, while centralized coordination constrains amplification to around 4x. A hallucination in step 2 of a 10-step pipeline doesn't just affect step 2 — it poisons everything downstream.

Causation is distributed. Traditional postmortems trace failures to a commit. AI incidents have no single cause. The model changed. The prompt template was updated. The training data distribution shifted. The inference temperature was bumped. The retrieval corpus went stale. Any of these, in any combination, can produce the same observed degradation, and blaming "the model" is as useful as blaming "the network."

A Multidimensional Incident Classification Framework

The fix is to stop treating severity as a single axis and start treating it as a matrix of four dimensions. Each dimension is independently measurable and independently actionable.

Dimension 1: Scope — Per-Session vs. Per-Cohort

The first question to answer in any AI incident is whether the failure is affecting individual sessions or correlated groups.

A per-session failure looks random. User A got a hallucinated answer; User B asking the same question got a correct one. This is the baseline behavior of any probabilistic system — you're measuring the tail of the distribution, not a structural fault.

A per-cohort failure is different in kind. All users who invoke the system with documents longer than 8,000 tokens get truncated answers. All users on iOS 17.4 through your mobile SDK hit a context marshaling bug. All queries containing date ranges return subtly wrong results. Cohort failures indicate a structural issue: something about this class of inputs, this user segment, or this execution path is systematically broken.

The incident response implication: per-session failures warrant monitoring and statistical baselining. Per-cohort failures warrant immediate investigation and potential feature gating. A hallucination affecting 0.3% of sessions might be normal noise; a hallucination affecting 100% of sessions with a specific prompt pattern is a sev-1 regardless of raw user count.

Dimension 2: Failure Type — Factual vs. Stylistic Drift

Not all output degradation is equal. AI incidents split cleanly into two categories with very different urgency profiles.

Factual degradation means the model is producing claims that are wrong. Fabricated citations, incorrect statistics, hallucinated product features, wrong dates. These failures are measurable against ground truth, they create direct harm, and they're what most people mean when they say "the AI is broken."

Stylistic drift means the output changed in character but not in correctness. Responses became more verbose. Tone shifted from professional to casual. Formatting conventions changed. Answer length doubled. These changes are real — users notice them, eval scores shift, A/B tests detect them — but they rarely constitute an incident in isolation.

The practical implication: factual degradation escalates immediately. Stylistic drift goes into the monitoring backlog. The failure you need to be careful about is factual degradation that's been miscategorized as stylistic drift because it passed automated format checks. A system that validates JSON structure but not semantic correctness will miss this every time.

Dimension 3: Visibility — User-Facing vs. Internal

An embedding drift that reduces retrieval precision from 0.94 to 0.88 is a real degradation. Whether it's an incident depends on whether it surfaces to users.

Internal-only failures — degraded retrieval scores, slower chain-of-thought traces, lower reranker confidence — matter for the engineering team's situational awareness. They're leading indicators of future user-facing failures. But they're not incidents in the traditional sense: no users are harmed, no trust is lost, and the system still functions.

User-facing failures change the calculus entirely. A chatbot hallucinating a refund policy that the company is then legally pressured to honor is a user-facing incident with real financial consequences. The AI is "working" — it returned a response, it parsed correctly, it hit latency SLOs — but the output caused harm.

The classification question isn't just "did the system return an error?" It's "did the system return output that caused harm, confusion, or erosion of trust for a real user?" If yes, it's an incident. If no, it might be important monitoring data, but it's not a pager event.

Dimension 4: Damage Profile — Reversible vs. Compounding

This is the dimension most teams underweight, and it's the one that determines whether a 2% regression is background noise or a sev-1.

Reversible failures stop when you fix them. Roll back the prompt, revert the model version, gate the feature — and the damage stops. The Air Canada chatbot hallucinating a discount that the airline was forced to honor is a reversible incident: embarrassing, financially limited, but bounded.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates