Skip to main content

LLM Self-Debugging: When the Explanation Is the Signal vs. When It's the Lie

· 8 min read
Tian Pan
Software Engineer

When your LLM agent fails, the most tempting thing in the world is to ask it why. It will answer fluently, specifically, and with what feels like self-awareness. It might say: "I misunderstood the user's intent and retrieved documents about X when I should have targeted Y." That sounds exactly like a root cause. You write it down, open the prompt editor, and spend forty minutes chasing the wrong problem.

This is the central trap of LLM self-debugging. The model's explanation and the model's actual failure mechanism are two different things. Sometimes they overlap. Often they don't. Knowing which situation you're in before you act on the explanation is the discipline that separates fast debugging from expensive detours.

The Faithfulness Gap Is Real and Large

Recent causal analysis work has put numbers on something practitioners feel intuitively: LLMs generate explanations that sound faithful to their reasoning process but often aren't causally connected to how the output was produced.

The finding is that "in-distribution faithfulness" — whether the explanation is consistent with the model's typical behavior — runs significantly higher than "strong faithfulness" — whether the explanation actually caused the specific output. On fact-checking tasks, in-distribution consistency reaches 74%, but strong faithfulness drops to 27%. That 47-point gap means the explanation looks right roughly three-quarters of the time, but only actually reflects what drove the answer about a quarter of the time.

The implication is not that LLM explanations are useless. It's that they have a specific failure mode: they reflect the model's training-time behavioral patterns rather than the decision mechanism that fired in this particular instance. The model explains what it would typically do when it produces output like this, not what it actually did to produce this specific output.

Three Cases Where the Explanation Is Genuine Signal

Despite the faithfulness gap, asking an LLM to explain a failure is genuinely useful in specific conditions. These are worth distinguishing precisely.

Trace summarization. When you have a multi-step agent execution log with dozens of tool calls and intermediate outputs, the LLM is doing something closer to reading comprehension than introspection. It's summarizing observed evidence, not explaining its internal reasoning. This is the mode where the explanation tracks reality well — the model is working from concrete artifacts, not inferring backwards from its output. Feed it the full trace and ask what went wrong, and you're likely to get a useful answer because the answer can be verified against the trace itself.

Error correlation across tool calls. An LLM can often identify structural patterns in a failure sequence that a human would take longer to spot: "The search tool returned an empty result set on step 3, and from that point on every retrieval assumed the user's query had no relevant documents." This is pattern recognition in a structured timeline, not causal introspection. The model is pointing at the data; you can verify the observation directly.

Intent disambiguation. When the failure looks like the agent misunderstood the goal, asking the model to reconstruct what it understood the task to be is often accurate. This is because understanding the task is a high-salience, early step that leaves clear traces in the context. The model is recalling something it explicitly computed, not rationalizing.

All three of these share a property: the explanation can be independently verified against artifacts that exist. The danger case is when it can't.

The Confabulation Mode

The dangerous alternative is what neuroscientists call confabulation — generating coherent, confident explanations that aren't grounded in the actual causal sequence. Unlike hallucination (producing false facts), confabulation involves the model creating a narrative that fits the output without tracking how the output was produced.

A study on LLM reasoning traces found that when models make errors in multi-step reasoning, their explanations of those errors often identify the wrong step or mischaracterize the error type. The explanations are internally consistent and sound authoritative. They're also wrong about the mechanism.

This happens most reliably in one specific scenario: when you ask the model to explain something that required genuine computation — arithmetic, index lookups, logical inference across multiple premises — and the computation failed silently. The model knows the answer it produced. It generates an explanation that justifies that answer. If the underlying computation was wrong, the explanation justifies the wrong answer coherently, because the explanation process is separate from the computation process.

There's also a named vulnerability that matters for debugging RAG systems specifically. Research on "anchored confabulation" shows that partially confirming a multi-hop reasoning chain doesn't gradually increase accuracy — it can increase confident wrong completions. If you retrieve one correct intermediate fact in a three-hop chain, the model becomes more likely to confabulate the remaining hops with false confidence than if you had retrieved nothing at all. Partial evidence is worse than no evidence in some configurations.

The Perturbation Check: Your Primary Validation Tool

The most reliable way to distinguish signal from confabulation is the perturbation check. Deliberately corrupt a specific part of the model's context or reasoning chain and observe whether the explanation changes.

If the model's explanation identified "retrieved document 3 was irrelevant and caused the hallucination," remove document 3 from the context and re-run. Did the failure pattern change? If yes, the explanation had causal validity. If the same failure appears with different rationalizations, the explanation was decorative — the model is generating plausible post-hoc narratives, not tracking causation.

The same principle applies to chain-of-thought: corrupt an intermediate reasoning step that the model identified as critical and observe whether the final output changes. If the output is stable across corruptions of the "critical" step, that step wasn't actually causal. You've identified a rationalization pattern, not a bug.

This check is more expensive than accepting the first explanation, but it's substantially cheaper than acting on a wrong diagnosis for an hour.

For-and-Against Prompting for Root Cause Analysis

When you need the model to identify root causes in a complex failure, a technique from formal verification debugging significantly improves signal quality: explicitly require the model to generate arguments both for and against each candidate root cause.

Standard prompting optimizes for coherence, which means the model commits to one explanation and elaborates it. When you add the instruction to also argue against each hypothesis, the model is forced to surface counter-evidence it would otherwise suppress. The result is a suspicion score over multiple candidates rather than a single confident explanation.

The concrete implementation: for each component or step in the failing pipeline, ask the model to:

  1. State evidence pointing toward this component as the root cause
  2. State evidence pointing away from it
  3. Assign a suspicion score from 0 to 1
  4. Classify whether this is a root cause or a downstream symptom

This takes longer. It produces more reliable diagnoses than single-pass explanations, especially when the true root cause is counterintuitive.

Boundary Conditions That Predict Which Mode You're In

Before you act on any LLM explanation of a failure, run through this checklist:

The explanation is probably reliable when:

  • The failure happened in a step with external artifacts (retrieved documents, API responses, tool outputs) that can be inspected directly
  • The model is summarizing observed behavior rather than explaining internal computations
  • You can trace the claim in the explanation to a specific piece of evidence in the context
  • The explanation points to something discrete and testable ("this document was empty") rather than abstract ("there was ambiguity in the goal")

The explanation is probably confabulation when:

  • The failure happened in a pure computation step — arithmetic, counting, logical inference across many premises
  • The explanation is highly specific about mechanism but you can't find the supporting evidence in the context
  • The model agrees with whatever alternative explanation you propose
  • Running the same query a few times produces different explanations with similar confidence

The last signal is particularly diagnostic. If the explanations vary but the confidence stays high, the model is sampling from a space of plausible rationalizations rather than reporting a stable causal observation.

The Debugging Discipline

The operational conclusion is not to stop asking LLMs to explain their failures — it's to treat those explanations as hypotheses rather than diagnoses.

When the explanation comes back, your next move should be to design a test for it, not to implement a fix for it. This is a minor shift in workflow that eliminates most of the wasted debugging time. The explanation tells you where to look; the perturbation check tells you whether you're looking in the right place.

The most expensive debugging mistakes come from treating a fluent, confident explanation as established fact. LLMs are extremely good at generating the explanation you expect, which means confirmation bias runs high in this loop. If you suspect the retrieval step failed, the model will explain how the retrieval step failed, coherently and in detail, regardless of whether that's actually what happened.

Build the check into the workflow: explanation, then perturbation, then fix. The validation step costs a few extra queries. Skipping it costs a lot more.

References:Let's stay in touch and Follow me for more thoughts and updates