Why Your RAG Citations Are Lying: Post-Hoc Rationalization in Source Attribution
Show a user an AI answer with a link at the end of each sentence, and the needle on their trust meter swings halfway across the dial before they have read a single cited passage. That is the whole marketing pitch of enterprise RAG: "grounded," "sourced," "verifiable." It is also the most-shipped, least-tested claim in AI engineering. Recent benchmarks find that between 50% and 90% of LLM responses are not fully supported — and sometimes contradicted — by the sources they cite. On adversarial evaluation sets, up to 57% of citations from state-of-the-art models are unfaithful: the model never actually used the document it is pointing at. The citation was attached after the fact, to rationalize an answer the model had already decided to give.
This is not a retrieval bug. You can have perfect retrieval and still get lying citations, because the failure is architectural. The generator writes prose first and stitches links on second. The links look like evidence. They are decoration.
The industry has been so focused on whether cited documents are relevant that it has skipped past a more uncomfortable question: does the cited span actually entail the claim it is attached to? The answer, at production scale, is frequently no. And the more polished your UI makes the citations look — footnote superscripts, hoverable previews, colored highlights — the more decisively users stop checking.
Correctness Is Not Faithfulness
The research community is finally drawing a line between two things that enterprise RAG products treat as one: citation correctness and citation faithfulness.
- Correctness asks: does the cited document support the statement? You can measure this with a natural language inference (NLI) model asking "does passage P entail claim C?"
- Faithfulness asks: did the model actually derive the claim from the cited document, or did it generate the claim from parametric memory and then hunt for a passage that looks compatible?
A post-rationalized citation is indistinguishable from a faithful one at the output level. It may even be technically correct — the passage really does support the claim — but the model ignored the passage when generating the answer. That makes the whole chain of trust a lie of omission. The user (or the downstream agent) assumes the retrieved evidence drove the answer. It did not. The model's pre-training drove the answer, and retrieval was theater.
This matters because the failure mode is silent. If your generator confidently asserts something plausible, attaches a real-looking citation, and the cited span does loosely relate to the topic, no amount of "check the sources" UI will catch it. Humans skim. Agents treat a passing link as confirmation. The hallucination is laundered through the citation step.
How Architecture Bakes In the Lie
Look at how most RAG pipelines are wired and the post-rationalization becomes predictable, almost inevitable.
The dominant pattern is generate-then-retrieve-then-cite or retrieve-then-generate-then-cite. In both, retrieval runs, generation runs, and a third step — often a separate prompt, sometimes a separate model — assigns citations to the already-written text. By the time the citation step runs, the generator has no mechanical connection to any specific passage. It chose tokens based on the blended distribution of (prompt instructions) × (parametric memory) × (loosely attended retrieved context). The citer then does the only thing it can: similarity-match each sentence of the output to the nearest chunk of retrieved context. "Nearest" is not "causal."
That architectural seam is where faithfulness dies. Recent work comparing generation-time citation (G-Cite) against post-hoc citation (P-Cite) finds the tradeoff baked into the design: P-Cite achieves higher citation coverage (it can find some passage that matches almost any claim) but lower semantic precision, while G-Cite commits to evidence during decoding and is stingier about what it will cite at all. On the FEVER fact-verification task, G-Cite hit 94% correctness with 27% coverage; P-Cite balanced at 75%/75%. Coverage is what marketing wants — every sentence gets a footnote. Precision is what users need.
The other architectural culprit is chunk-level retrieval paired with sentence-level citation. Your retriever returns a 512-token chunk. Your generator writes a sentence. The citer pins the chunk ID to the sentence. The chunk contains twelve claims; only one of them (maybe) supports the written sentence. The user sees "[3]" and clicks; they land on a paragraph containing the keyword; their brain files the claim as "grounded." Nobody verified that the specific sentence they read is entailed by any specific span of the cited chunk. This is why sub-sentence and span-level citation research has exploded recently — coarse-grained pointers are, functionally, misinformation.
The Citation-Faithfulness Eval
The fix on the measurement side is to stop treating "is there a citation?" as a pass/fail check and start treating citation-level entailment as a first-class eval.
A minimal pipeline:
- Decompose the answer into atomic claims (one verifiable assertion per claim).
- For each claim, extract the cited span (not just the cited document).
- Run an NLI model: does the cited span entail the claim? Label as
SUPPORTS,CONTRADICTS, orNEUTRAL. - Roll up to citation precision (fraction of citations that actually entail) and citation recall (fraction of claims that have at least one entailing citation).
This is the skeleton of benchmarks like ALCE, which uses an NLI model to automatically check whether cited passages entail the generated text. It is cheap to run on a canary set and brutal in what it reveals. Teams who install this eval for the first time routinely discover citation precision in the 40-60% range on their "production-ready" RAG system — meaning roughly half the footnoted claims are backed by a passage that does not actually support them.
Two caveats worth calling out:
- NLI models have their own error profile. They struggle with partial support, numerical claims, and long spans that drift from the claim. Treat the eval as a signal, not an oracle. When possible, layer a stronger LLM-as-judge over borderline cases — and spot-check with humans on a small sample, because the judge shares failure modes with the generator.
- "The cited document contains the answer" is not enough. If your eval only checks document-level relevance, you are measuring retrieval quality, not citation faithfulness. The span matters. The claim-to-span alignment matters.
There is also a separate eval for faithfulness in the strict sense — whether the model actually used the retrieved documents. The most rigorous probes swap in modified or counterfactual documents and see whether the answer changes. If it does not, the model is running on parametric memory and your citations are performative regardless of whether they happen to be correct.
Architectural Fixes That Actually Help
Measuring the problem is not the same as solving it. The architectural fixes fall into a short list, ranked roughly by how much they disturb your existing stack.
Generate-with-citations, not cite-after-generation. Force the model to emit the citation inside the same decoding pass as the claim it supports — ideally as an interleaved pattern like claim → citation → claim → citation. This ties the citation to the hidden state that produced the claim, instead of asking a downstream step to guess which document the generator was "thinking of." Methods like ReClaim and its successors do this with structured output grammars and a penalty when the model emits a claim without a paired citation.
Constrained decoding from retrieved passages. The strongest version: at the citation site, the decoder can only emit tokens that exist as a contiguous span in the retrieved corpus. A prefix tree over the passage tokens makes this enforceable at inference time. You end up with quoted evidence whose provenance is mechanical, not aspirational. The cost is that the generator can no longer paraphrase freely at the quote site, but that constraint is the point — paraphrase is where faithfulness leaks.
Span-level and sub-sentence citations. Stop pinning footnotes to entire chunks. Store passages with token-level offsets. Emit citations that point to a span of 3-30 tokens inside a specific passage. Display the highlighted span in the UI, not the whole chunk. Users actually verify span-level citations because the cognitive load is low; they never verify chunk-level citations because scanning a 512-token paragraph for relevance is work nobody does.
Separate the citation model from the generator. A smaller, specialized model trained only to do span-level NLI matches claims to evidence more reliably than the generator does reflexively. Make citation assignment a dedicated, auditable step rather than an afterthought in the system prompt. This looks like P-Cite, but unlike naive P-Cite it enforces entailment rather than similarity.
Refuse to cite when entailment is weak. This is the cultural fix. Most systems are tuned to cite always, because missing citations look bad in demos. Invert the policy: a missing citation is a signal of honest uncertainty. Adding a "no supporting passage found" branch to the UI — rendered as visible, not hidden — preserves trust better than fabricating a footnote.
Why This Keeps Shipping Broken
If the research is this clear, and the fixes are this known, the obvious question is why enterprise RAG keeps shipping with unfaithful citations. Three reasons keep surfacing.
The first is evaluation asymmetry. Teams measure retrieval recall, answer correctness, and latency. They rarely measure citation-span entailment, because it requires a second eval harness and an NLI model most teams never deploy. What gets measured gets fixed.
The second is demo incentives. Footnoted answers look authoritative. In a procurement demo, the difference between "here is my answer [1] [2] [3]" and "here is my answer — I could not ground this claim in the provided corpus" is the difference between a signed contract and a follow-up meeting. Until buyers start asking for citation-faithfulness numbers, vendors have no reason to surface them.
The third is the asymmetric cost of failure. A missed citation costs a demo; a false citation costs a user weeks later, privately, in a context the vendor never sees. The feedback loop for citation faithfulness is long, diffuse, and rarely reaches engineering. So the bug persists.
What to Do on Monday
If you run a RAG system in production and you have never measured citation-span entailment, assume your citation precision is between 40% and 70%. Do these three things in order:
- Build a small canary eval — 100-200 queries, human-verified answers, human-verified supporting spans. Use an NLI model (or an LLM-as-judge with a strict entailment rubric) to score citation precision and recall. Run it on every prompt change and every model upgrade.
- Audit your pipeline for the architectural seam between generation and citation. If citations are assigned in a separate step from generation, you are almost certainly post-rationalizing. Move citation into the decoding loop, even if you have to switch to interleaved output format.
- Push granularity down to the span. Store retrieval results with offsets. Render citations as highlighted spans, not chunk IDs. Let users verify at a glance; they will, and your own review meetings will too.
"We cite our sources" is the assertion enterprise AI makes most often and tests least. Flipping that ratio — from performance of trust to evidence of it — is the work that separates RAG products that hold up under scrutiny from the ones whose users quietly stop trusting them, usually months before they admit it to the vendor.
- https://arxiv.org/abs/2412.18004
- https://www.alphaxiv.org/overview/2412.18004v1
- https://dl.acm.org/doi/10.1145/3731120.3744592
- https://arxiv.org/html/2509.21557
- https://www.nature.com/articles/s41467-025-58551-6
- https://arxiv.org/abs/2305.14627
- https://arxiv.org/html/2407.01796
- https://arxiv.org/abs/2509.20859
- https://arxiv.org/abs/2507.04480
- https://arxiv.org/html/2510.17853v1
- https://www.whyaitech.com/notes/systems-note-002.html
- https://aclanthology.org/2023.findings-emnlp.307.pdf
