The Citation Index Your Chunker Shifted by One When It Started Prefixing Line Numbers
The chunker started prepending [line N] to every chunk. The eval went green. Every citation the model produced after that day pointed to the paragraph one position before the actual evidence, on every document, in the regulated industry the product serves. The team did not find out from the eval. The team found out from an auditor who looked at the cited sentence, read it, and pointed out that it contradicted the claim it was supposed to support.
This is the kind of regression that survives a code review, a manual QA pass on three sample documents, and a feature-flag rollout. None of those checks were wrong in isolation. They were all asking the same question — does a citation appear where one is expected — and none of them were asking the question the auditor asked, which is whether the citation points at the sentence the claim came from. The gap between those two questions is where the off-by-one lived for as long as it lived.
What makes this failure mode worth a separate write-up is not the bug itself. Off-by-one errors are old news. The interesting part is that the failure was produced by two systems that continued to agree on the structure of an integer while silently disagreeing about what the integer meant.
The chunker and the citation parser were never on the same call
A document chunker emits chunks. A citation extractor consumes the model's references to those chunks and resolves them back to spans in the original source. In most production RAG architectures, these two components are owned by different teams, deployed on different cadences, and tested by different eval suites. They communicate through a single integer per citation — the paragraph index, the line range, the chunk position.
That integer is a coordinate. Coordinates require a coordinate system. The chunker writes integers in one coordinate system; the parser reads them in another; the contract between them is the implicit agreement that both sides are counting the same thing from the same origin.
When the chunker added a [line N] prefix to every chunk so the model could cite line ranges instead of paragraph numbers, the prefix consumed the first line of the chunk. The chunker's emitted indices were unchanged at the storage layer. The model, reading the prefixed chunk, started numbering from the prefix. The citation parser, parsing what the model emitted, still mapped that number through the pre-prefix paragraph index. Every paragraph index shifted by one in the model's frame and by zero in the parser's frame, and the difference came out as a citation to the paragraph immediately before the actual evidence.
No code path threw. No regex failed to match. No chunk count changed. The two systems remained structurally compatible — same data type, same range, same response shape — while their semantic agreement quietly dissolved.
"Citation present" is not a citation metric
The eval suite scored citations as a boolean: did the model produce a citation, and did the citation resolve to a chunk in the corpus? On both axes, the new chunker passed cleanly. Every response had a citation. Every citation resolved. The numerical score on the eval dashboard went up if anything, because the prefix gave the model a clearer signal to cite from in the first place.
The metric that would have caught this is not "citation present" but "citation correct" — defined as a semantic match between the cited span and the claim it supports. Citation correctness is a substantially more expensive metric to compute. It needs a notion of which atomic claims live in the answer, which span each claim was meant to come from, and a comparator that says yes-or-no on the alignment. Most teams do not maintain this. The teams that do typically only maintain it on a small golden set, not on a sample large enough to detect distributional shifts inside a single sub-corpus.
The cheaper proxies all degrade in the same direction. Citation accuracy in production RAG averages around 65–70% without explicit attribution training, but that's an aggregate; it doesn't tell you whether the 30% wrong ones are wrong in the same way, on the same documents, or after the same deploy. An off-by-one is a structured wrongness, and structured wrongness is the kind of failure that aggregate metrics smooth over.
The lesson is not that "citation present" is a bad metric. It's a fine smoke test. The lesson is that it is a smoke test, and smoke tests do not protect against semantic regressions in the things they are not measuring. Treating citation correctness as a first-class metric — instrumented continuously, alerted on slope rather than absolute, computed against documents whose answers are known — is the only way the off-by-one becomes visible before the auditor sees it.
Two systems that agreed on the type and disagreed on the meaning
The deeper failure here is one of typing. The chunker emitted an integer for a paragraph index. The parser accepted an integer for a paragraph index. The compiler, the linter, and the type-checker all signed off. Nothing in the type system said "these two integers must be in the same coordinate system."
This is the same class of bug as passing meters into a function that expects feet. The function will return a number. The number will be wrong. No tool you have will tell you, because both sides agree on the dimensionality of the quantity — only on its interpretation.
The pattern that closes this is a content-coordinate type. Instead of paragraph_index: int, name the indexing scheme:
ChunkPositionInPrefixedFrame
ChunkPositionInOriginalFrame
LineNumberInPrefixedFrame
SentenceIndexInChunk
Each is a distinct type. Conversion between them is explicit. The chunker emits citations in one frame; the parser converts to another frame explicitly before resolving against the source. The shift becomes a function the team has to write, name, and review. Any change to the chunker's framing — a prefix, a header, a structural insertion — forces a corresponding change to the converter, because the type signature of the converter changes.
This is not exotic engineering. It is the same kind of phantom-types trick that finance code uses for currency and that physics code uses for units. The reason RAG code rarely uses it is that the data flowing through is "just text," and "just text" feels like it should not need a type system. The off-by-one is the bill for that assumption.
Chunk-format changes are coordinate-system changes
The next pattern is the one that would have caught the regression on its own deploy: a regression test specifically exercising the citation parser against documents containing known answers, run on every change to the chunk format. Not the eval suite. A targeted contract test against the chunk-to-citation boundary.
The reason this is its own test is that chunk-format changes look small from the chunker's perspective. Adding a prefix is one line of code. The chunker still emits the same count of chunks, with the same approximate token budget, against the same documents. The diff is local. The blast radius is global.
A regression test at the chunk-to-citation boundary takes a document with a known answer, runs it through the full pipeline, and checks that the citation resolves to the span that contains the answer. Not "a citation appeared." Not "the citation parsed cleanly." That the resolved span contains the text. This is two or three documents and a fixture file. It is the cheapest possible investment that would have caught the regression on the feature-flag PR.
If the team also runs the same test under the old chunker as a baseline diff, the change in citation accuracy on the new chunker is visible as a delta. A 100% citation-target accuracy on the old path and a 0% citation-target accuracy on the new path is not a subtle signal. The reason the feature flag rolled is that nobody was reading the right number.
The audit catches what the eval underweights
The eval has its own selection bias. It tends to be built from queries the team thought of, on documents the team chose, against answers the team already knew. The auditor, by contrast, was running an adversarial workflow: take a citation, follow it, read the sentence, ask whether it supports the claim. That workflow does not appear in any standard eval harness, because it is the workflow of someone trying to falsify the model's claim rather than someone trying to verify the model's behavior.
The patterns that bring the auditor's workflow inside the eval loop are not new ideas. They are:
- Atomic-claim decomposition. Break the model's answer into the smallest standalone claims. Score each claim against its citation independently. This catches the case where the citation supports part of the answer and contradicts another part.
- Span-content verification. After resolving the citation to a span, run an entailment check between the span text and the claim text. A semantic mismatch is a citation failure even if the index resolved cleanly.
- Adversarial sampling. Sample queries from auditor-style workflows: high-severity tickets, regulated-industry edge cases, churn interviews where a customer named the specific kind of question their team kept getting wrong. The eval set evolves toward the cohort least likely to leave.
- Per-feature-flag comparison. When a chunker change rolls behind a flag, the eval should run both branches against the same query set and report a side-by-side delta. Promotion to production requires the delta to be flat or positive on every metric, not just the aggregate.
These four patterns together would have surfaced the off-by-one before the auditor did. None of them is expensive in absolute terms. The reason they aren't standard is that each one is owned by a different team — atomic claims by the eval team, span verification by the retrieval team, sampling by product analytics, feature-flag comparison by infra. Nobody owns the citation-correctness contract end-to-end.
What "correctness was a coincidence" means
The citation chain in a RAG system is a sequence of references. A model emits a reference. A parser resolves the reference to a chunk. A chunk maps back to a span in a source document. The user reads the span. Each link has its own validity check. None of those checks proves the chain.
The chunker shipped a change that broke one link in a way that left every other link still valid. The model still emitted a reference. The parser still resolved it. The span still rendered. The user still read it. The audit was the first check in months that actually walked the chain end-to-end against the meaning of the claim, and the chain failed at the first link in 100% of cases on the documents the new chunker had produced.
The phrase that ought to scare a team here is the one the postmortem will write: the citations were correct for as long as nothing changed. That's not the same as "the citations were correct." That's a coincidence. Two independent systems happened to agree on what the integer 1 referred to. As soon as one of them shifted its frame, the agreement evaporated, and the only check that would have noticed was the one nobody was running.
The systems thinking that closes the gap is not heroic. Name your coordinate systems. Test your contracts at the boundary, not at the endpoints. Score what the user reads, not what the model emits. Run the auditor's workflow inside the eval loop before the auditor runs it inside your customer's quarterly review.
The citation that contradicts its claim is the cheapest possible escape from the kind of failure that ends with a customer in a regulated industry walking into a meeting with a regulator and a model's output in front of them. The cost of that meeting is the budget you should be willing to spend on the second test.
- https://towardsdatascience.com/your-chunks-failed-your-rag-in-production/
- https://arxiv.org/pdf/2504.15629
- https://arxiv.org/html/2512.12117v1
- https://arxiv.org/pdf/2409.02897
- https://medium.com/@Nexumo_/rag-grounding-11-tests-that-expose-fake-citations-30d84140831a
- https://www.digitalapplied.com/blog/rag-anti-patterns-7-failure-modes-2026-engineering-guide
- https://blog.premai.io/building-production-rag-architecture-chunking-evaluation-monitoring-2026-guide/
- https://www.whyaitech.com/notes/systems-note-002.html
- https://www.getmaxim.ai/articles/complete-guide-to-rag-evaluation-metrics-methods-and-best-practices-for-2025/
