Skip to main content

Debugging AI at 3am: Incident Response for LLM-Powered Systems

· 10 min read
Tian Pan
Software Engineer

You're on-call. It's 3am. Your alert fires: customer satisfaction on the AI chat feature dropped 18% in the last hour. You open the logs and see... nothing. Every request returned HTTP 200. Latency is normal. No errors anywhere.

This is the AI incident experience. Traditional on-call muscle memory — grep for stack traces, find the exception, deploy the fix — doesn't work here. The system isn't broken. It's doing exactly what it was designed to do. The outputs are just wrong.

A bad generation has no stack trace. It has a probability distribution. And if you haven't retrained your incident response instincts for probabilistic systems, you will spend a miserable hour staring at healthy metrics while your users get bad answers.

Here's how to debug it.

Why AI Incidents Are Structurally Different

A traditional software incident has a root cause you can point to: a nil pointer dereference, a misconfigured firewall rule, a database query missing an index. Something specific happened at a specific line.

AI incidents often have contributing factors, not root causes. The model isn't broken — it's producing outputs that follow from a set of conditions: the prompt you gave it, the context you passed, the examples you included, the sampling parameters you configured, the version of the underlying model the provider is running today. Any of those conditions can shift and cause degraded output without triggering a single error.

Research tracking LLM API stability found that 58.8% of prompt-and-model combinations regress when providers update their underlying models — and 70% of those regressions represent more than a 5% accuracy drop. The model "changed" without anyone touching your deployment.

This creates three diagnostic challenges that traditional incident response doesn't have:

Non-reproducibility by design. Temperature > 0 means the same input produces different outputs. You can't reliably reproduce the exact bad response a user got. You can only characterize the distribution of responses.

Silent success signals. Your infrastructure monitoring sees 200s all the way down. The only signal something is wrong comes from downstream quality metrics — user feedback, task success rates, output scoring pipelines — which you may or may not have set up.

Ambiguous blame assignment. When the model produces a wrong answer, was it a model failure? A prompt defect? A retrieval failure that gave the model bad context? A parameter misconfiguration? Each cause requires a different fix.

The Triage Decision Tree

When an AI incident fires, work through this sequence before concluding anything:

Step 1: Is this sampling variance or a systematic problem?

Take a representative failure input and run it through the model 10 times. If roughly half the runs produce good outputs and half produce bad ones, you're in variance territory. If every run produces a bad output, you have a systematic failure.

This single step saves enormous time. Variance issues are addressed differently than systematic ones — and many false alarms at 3am are actually variance events that happened to cluster in the last monitoring window.

Step 2: Did anything change in the last deployment cycle?

Check in order:

  • Prompt version deployed (even a wording change counts)
  • Model version or provider endpoint changed
  • Retrieval pipeline modified (if using RAG)
  • Upstream data sources updated

Systematic failures almost always have a recent change correlated with them. The change might not be yours — providers update their base models without announcement — but something changed.

Step 3: Is this a prompt problem or a model problem?

This distinction matters because they have completely different remediation paths. A prompt problem you can fix tonight. A model capability gap requires a different strategy.

Signs it's a prompt problem:

  • Failures cluster around a specific input pattern or topic domain
  • The model can produce the correct answer when you rephrase the input manually
  • In-context examples are outdated, contradictory, or no longer representative
  • The instruction is ambiguous or the task description buried late in a long context

Signs it's a model problem:

  • Failures occur uniformly across all input patterns
  • Manual rephrasing doesn't help
  • The model was never accurate at this task, even in early testing
  • A provider model update correlates with the regression onset

Step 4: For RAG systems, rule out retrieval failure first

Retrieval failures are especially insidious because the model receives bad context, generates a confident response based on that context, and everything looks fine to infrastructure monitoring. Seven documented failure modes exist in retrieval pipelines, but the most common at runtime are:

  • Retrieved chunks that are technically relevant but missing the specific fact needed
Loading…
References:Let's stay in touch and Follow me for more thoughts and updates