The Chunk Boundary That Bisected the Sentence Your Answer Depended On
Your RAG pipeline chunks documents into 512-token spans with 50-token overlap. It is a clean industry default. Somewhere in your corpus there is a sentence — "Refunds are processed within five business days unless the order originated from the EU region, in which case the regulatory window is fourteen days" — that landed across a chunk boundary. Chunk N contains the first half. Chunk N+1 contains the second.
A user asks "how long do EU refunds take." Retrieval scores chunk N highest because the query embedding aligns with "EU region" in the first fragment. Chunk N+1, which contains the only actual answer, ranks too low to be retrieved alongside. The agent answers "five business days" with a confident citation to chunk N. The customer is in Frankfurt. The answer is wrong. The pipeline behaved exactly as designed.
This is the failure mode that does not show up in your chunk-quality eval. The chunks are well-formed. The corpus is well-formed. The embedding model is well-formed. The boundaries between chunks — the lines you drew through your own documents — are where the answer lives.
The Chunker Was Built for the Context Window, Not for Meaning
The default chunker shipped with every popular RAG framework was designed around a single constraint: the embedding model's context window. Five hundred and twelve tokens is what the encoder can hold. Recursive token splitting with overlap is what fits cleanly under that ceiling while remaining cheap to compute at index time.
Nothing about that design considered discourse. A token splitter does not know that "unless the order originated from the EU region" is a subordinate clause whose meaning depends on the main clause it modifies. It does not know that a numbered exception in a contract loses its function when separated from the rule it qualifies. It does not know that a table row without its header is a row of numbers nobody can interpret. The splitter knows tokens and newlines. The discourse structure of the document is invisible to it.
The 50-token overlap was added to catch boundary bleed, but overlap is calibrated to typical sentence length. Long compound sentences — the ones that carry exceptions, qualifiers, conditions, and carve-outs — routinely exceed the overlap window. And the load-bearing sentences in legal, regulatory, financial, and policy documents are almost always the long ones. The overlap was designed to protect the median sentence. It does not protect the sentence that matters.
A January 2026 systematic analysis of overlap using SPLADE retrieval found that overlap, on aggregate, provided no measurable recall benefit and only increased indexing cost. The aggregate hides the cases where overlap was load-bearing and the cases where overlap was theatrical. It does not distinguish between them, and your eval suite probably does not either.
Why the Chunk Quality Eval Misses This Class of Failure
When chunk-quality evals run, they usually grade two things: was the right chunk retrieved, and does the chunk contain a sentence that answers the question. Both of those questions answer yes in the bisection case. Chunk N was retrieved. Chunk N contains a sentence about EU refunds. The eval is satisfied. The user is not.
The actual failure is that retrieval returned the wrong half of a sentence and the eval has no concept of "the answer spans a boundary." Top-k recall is measured at the chunk level. The gold answer is anchored to a chunk. The pipeline returns that chunk. The grader records a hit. Nobody asked whether the chunk's text, on its own, was sufficient — or whether the model would need to read the chunk above or below to actually answer.
This is the chunking analog of a known RAG pathology: context sufficiency. Recent work has shown that more than half of complex retrieval queries lack sufficient context for correct generation even when retrieval "succeeds" by top-k metrics. Boundary bisection is a specific generator of insufficient context. The retrieval call returns the chunk that contains the question's keywords. The chunk does not contain the question's answer. The generator confabulates the rest.
A coverage-aware eval has to test this directly. Take your real corpus. Identify the sentences that span chunk boundaries. Construct queries that target precisely those sentences. Measure retrieval recall conditional on the gold answer spanning a boundary, separately from your overall recall number. The two numbers will diverge. The gap is the failure mode you have been shipping.
Approaches That Actually Close the Gap
- https://jina.ai/news/late-chunking-in-long-context-embedding-models/
- https://arxiv.org/abs/2409.04701
- https://www.anthropic.com/news/contextual-retrieval
- https://medium.com/kx-systems/late-chunking-vs-contextual-retrieval-the-math-behind-rags-context-problem-d5a26b9bbd38
- https://arxiv.org/abs/2506.06313
- https://www.firecrawl.dev/blog/best-chunking-strategies-rag
- https://towardsdatascience.com/your-chunks-failed-your-rag-in-production/
- https://optyxstack.com/rag-reliability/rag-chunking-strategy-chunk-size-overlap-document-structure-recall
- https://www.glukhov.org/rag/retrieval/chunking-strategies-in-rag/
- https://unstructured.io/blog/contextual-chunking-in-unstructured-platform-boost-your-rag-retrieval-accuracy
