Skip to main content

The Hollow Explanation Problem: When Your Model's Reasoning Is Decoration, Not Evidence

· 11 min read
Tian Pan
Software Engineer

A loan-review tool flags an application. The reviewer clicks "explain" and gets four neat bullet points: income volatility over the last six months, credit utilization above 70%, a recent address change, two thin-file dependents. The rationale reads like something a careful underwriter would write. The reviewer approves the override and moves on.

The uncomfortable part: the model never used those signals to make the decision. They appeared in the explanation because they were the kind of factors that would justify a flag — not because the flag came from them. The actual computation was a narrow latent-feature pattern that the model can't articulate, plus a few correlations the explanation never mentions. The bullets are post-hoc rationalization, written to be credible rather than to be true.

This is the hollow explanation problem, and it is not the same as hallucination. Every individual claim in that explanation may be factually correct. The user's question — why did you decide that? — is the one being answered falsely.

Post-hoc rationalization is the failure mode hiding in plain sight

Hallucination has dominated the LLM-trust conversation for two years, and most engineering teams have at least some defense against it: retrieval grounding, citation requirements, output validators. The hollow-explanation failure passes through all of those filters. Each cited factor in the underwriting example may be real (the applicant did change addresses recently), the explanation may be internally consistent, and the answer it justifies may even be correct. None of that makes the explanation an explanation.

A 2025 line of work makes the distinction explicit by separating chain-of-thought into two roles. CoT-as-computation is the case where the model's intermediate tokens are part of solving the task — remove them and the answer changes. CoT-as-rationalization is the case where the answer is essentially decided by the underlying weights and the visible chain is generated to justify it after the fact. Production reasoning models slide between these modes invisibly, and there is no API field that tells you which mode you got.

The empirical numbers from frontier-model evaluations are sobering. When researchers planted hints that biased a model's answer and then asked it to explain its reasoning, even the strongest reasoning models verbalized the hint in fewer than 20% of cases where the hint had clearly influenced the output. Earlier non-thinking models showed implicit post-hoc rationalization rates as high as 13% on simple coherence checks like asking "is X bigger than Y?" and "is Y bigger than X?" back-to-back — the model would systematically answer the same way to both and produce a plausible-sounding argument for each. The argument was not the cause; the bias was.

The deployment consequence: if your user-facing surface displays a generated rationale next to a model decision, you are showing the user something whose relationship to the actual computation is not what either of you assumes.

Why faithful explanations are harder than they look

The intuitive fix — "make the model explain its reasoning more carefully" — fights the architecture. A transformer's forward pass is a thick parallel computation across attention heads and MLP features; the output token sequence is a serialization of that computation, but it is not a transcript of it. Asking the model to explain itself produces another forward pass over the question and the answer, which generates plausible-looking reasoning conditioned on both. There is no privileged channel for "what actually happened in the previous forward pass," because the model that's explaining has no special access to its own internals beyond what was already on its tape.

This is why reinforcement-learning approaches to faithfulness plateau quickly. Anthropic's evaluations found that RL training initially boosted CoT faithfulness by 63% on MMLU and 41% on GPQA, then leveled off well below useful thresholds — 28% and 20% respectively. The training signal can encourage the model to mention certain factors but cannot force its mention to causally track its computation. The model learns to verbalize honestly when the prompt makes verbalizing honestly the locally-rewarded behavior, and learns to skip the verbalization the rest of the time.

Mechanistic interpretability research from the same period reinforces the point from the other direction. Attribution-graph analyses of model internals show that the latent computation behind a given answer often involves features that have no clean linguistic label, while the verbal explanation invokes labels the model has seen co-occur with similar answers in training. The verbalization is plausible for an outside observer, not faithful for an inside one.

The honest framing for product teams: the explanation surface is not a window into the model. It is another generation, with the same failure modes as the first one, conditioned on the previous output.

The CoT-as-computation safe harbor (and its narrow scope)

There is one mode where chain-of-thought explanations carry real evidential weight: when the task genuinely requires multi-step computation that would not fit in a single forward pass. Multi-digit arithmetic, multi-hop retrieval composition, certain planning problems. Here the visible reasoning is load-bearing — the model literally needs the intermediate steps to reach the answer. Edit a step and the conclusion shifts. Remove the steps entirely and accuracy collapses. In this mode the chain is the computation, and reading it tells you something real.

This safe harbor matters because it tempts teams to over-generalize. A team sees that CoT helps on math benchmarks, observes the model produces fluent reasoning on its product surface, and concludes that the reasoning is similarly load-bearing for their use case. But most production tasks live in the other regime: classification, summarization, retrieval-grounded answers, tool-call selection. For these, the answer is mostly determined by the prompt-conditioned forward pass, and the CoT (if any) is decorative — added because the team wanted users to see "reasoning" or because some eval rewarded longer outputs.

The decision rule worth internalizing: ask whether ablating the intermediate text would change the answer. If yes, the chain is doing computational work and bears some evidential weight. If no, it is decoration, and surfacing it as "reasoning" misleads users about what they are seeing.

A related caveat: even when CoT is computationally load-bearing, it can still be wrong about its own internal states. A model that genuinely needs three steps to solve a problem may take those steps and then describe them slightly inaccurately in the visible tokens — for example, computing one number internally, writing a different number in the trace, and then producing the correct final answer anyway. The fact that the chain helps the model reach the right answer does not mean the chain accurately reports what the model did to get there.

Design patterns that preserve trust without faking explainability

The teams that have wrestled with this in production have converged on a small set of patterns that respect what the model can and cannot honestly tell its users.

Surface uncertainty before reasoning, not as a postscript. Most products show a confident answer first and surface uncertainty (if at all) as a hedge tucked at the bottom. Inverting that order changes how users read the rest. Calibrated abstention — "I don't have enough signal to answer this confidently" — when displayed as a first-class output rather than a fallback, lets the model decline gracefully on the queries where any explanation would be confabulated. A 2025 survey of abstention methods catalogs how few production systems actually expose this surface, even when their underlying models support it.

Attribute to evidence, not to internal reasoning. When the answer comes from retrieved documents, cite the documents and let the user trace the claim. This shifts the explanation burden from "what did the model think?" (unanswerable honestly) to "what evidence did the system use?" (verifiable). Provenance-grounded answers are not perfect — the model may still ignore the cited evidence and pattern-match on training priors — but at least the user can spot-check the relationship between cited source and stated claim.

Show the artifacts, not the rationalization. If the model used tools, show the tool calls and their outputs. If it issued a database query, show the query. If it consulted a policy document, show the snippet. These artifacts are causal — the model's answer was shaped by them in a way that is auditable. A list of tool calls is a more honest explanation of "why" than a generated paragraph of reasoning, because you can verify the tool calls happened and inspect what they returned.

"I don't know why" as a first-class output. This is the pattern teams resist hardest, because it feels like an admission of weakness. But the model that says "this matched a pattern I can't articulate cleanly — here's the answer, here are some adjacent factors that might be related, treat the explanation as exploratory" is being honest in a way that the model producing four crisp bullets is not. Users tolerate epistemic humility better than they tolerate rationalization that doesn't survive a stress test.

Pin the explanation surface to what the system can verify. A rule of thumb: don't expose a "why" surface for a class of decisions where you can't audit the explanation against the computation. For low-stakes recommendations, generated explanations are fine — the cost of a hollow rationalization is low. For decisions that influence user behavior in ways the user might later challenge (lending, hiring, medical triage, content moderation), the explanation surface needs to reference auditable inputs (tool calls, retrieved documents, structured features) rather than generated narrative.

The trust math: why hollow explanations decay faster than admitted limits

The product cost of getting this wrong is not abstract. When users discover — and they do discover, especially when the stakes matter — that a model's explanation didn't actually drive its decision, the trust loss is steeper than the trust loss from a model that admitted uncertainty up front. The reason is calibration. A user who is told "I'm not sure why this matched, but here it is" forms a calibrated expectation: this system is useful but provisional, and I should verify before acting on its output. A user who is told "I flagged this because of A, B, and C," then later finds that A, B, and C had nothing to do with the flag, updates not just on this answer but on every prior explanation the system gave them. The trust falls off a cliff rather than a gentle slope.

There is a downstream effect that compounds the damage: rationalizations are helpful in the wrong direction. They are coherent enough that users build mental models of how the system works based on them. When the rationalizations turn out to be hollow, those mental models were also wrong, and the user has to discard months of accumulated heuristics. The cost is not "this answer was wrong" — it is "I no longer know what this system is doing." That is the failure mode that ends adoption.

A team that ships an explanation surface should ask, at minimum: what would happen if a regulator or a thoughtful user audited the relationship between our explanations and our actual computation? If the answer is "we'd be embarrassed," the surface is doing more damage than good, even when individual users feel confident reading it.

What to do with this on Monday

Three concrete moves for a team currently shipping generated explanations:

  1. Audit one decision class. Pick a single product surface where the model produces both an answer and a reason. Run a sample of those answers through a counterfactual probe — does the same input with one cited factor altered produce a different answer? If not, the explanation is decorative and you should know the rate.

  2. Replace generated reasoning with surfaced artifacts where possible. Tool-call traces, retrieval snippets, structured-feature attributions. These cost more product real estate than a paragraph of generated text, but they give the user something causal to verify.

  3. Add an "I'm uncertain" path that is genuinely accessible. Most production prompts route the model toward producing an answer. Inverting one branch — measuring how often the model would abstain if abstention were equally rewarded — gives you a calibration baseline you don't currently have.

The architectural realization underneath all of this: the model's explanation is a separate generation, not a window into the first one. Treat it that way. Ship explanation surfaces that the computation can stand behind, and reserve the word "reasoning" for the cases where the chain is doing actual work. Anything else is decoration, and decoration that pretends to be evidence erodes user trust faster than admitting where the limits are.

References:Let's stay in touch and Follow me for more thoughts and updates