Skip to main content

Why the Chunking Problem Isn't Solved: How Naive RAG Pipelines Hallucinate on Long Documents

· 9 min read
Tian Pan
Software Engineer

Most RAG tutorials treat chunking as a footnote: split your documents into 512-token chunks, embed them, store them in a vector database, and move on to the interesting parts. This works well enough on toy examples — Wikipedia articles, clean markdown docs, short PDFs. It falls apart in production.

A recent study deploying RAG for clinical decision support found that the fixed-size baseline achieved 13% fully accurate responses across 30 clinical questions. An adaptive chunking approach on the same corpus: 50% fully accurate (p=0.001). The documents were the same. The LLM was the same. Only the chunking changed. That gap is not a tuning problem or a prompt engineering problem. It is a structural failure in how most teams split documents.

The uncomfortable truth is that chunking is load-bearing infrastructure and nearly every framework's default is wrong for non-trivial documents.

The Semantic Chunking Paradox

The intuitive fix to naive fixed-size splitting is semantic chunking: embed every sentence, detect similarity drops, and place chunk boundaries at topic transitions. It sounds like a clear upgrade — chunks that respect semantic boundaries should retrieve better and generate better answers.

The FloTorch 2026 benchmark tested this on 50 academic papers (905K tokens total). Semantic chunking achieved 91.9% retrieval recall — top of the leaderboard. End-to-end answer accuracy: 54%. Recursive splitting at 512 tokens, the structurally dumber approach, achieved 69% end-to-end accuracy.

The explanation is that semantic chunking over-fragments. Similarity-based boundary detection tends to produce very small chunks — the FloTorch study found an average of 43 tokens per chunk. A 43-token chunk retrieves with surgical precision: it finds the exact sentence you were looking for. But 43 tokens rarely contain enough context for the LLM to generate a correct answer. You retrieved the right sentence from the wrong paragraph, stripped of the surrounding argument that made it meaningful.

High retrieval recall with low generation accuracy is the semantic chunking paradox. It exposes the core measurement problem in RAG evaluation: teams optimize for retrieval metrics (NDCG, recall@k) and ship systems that perform poorly on the metric that actually matters — did the user get a correct answer?

Fixed-Size Chunking's Real Failure Modes

Fixed-size chunking fails differently. It doesn't over-fragment; it splits arbitrarily. Three concrete failure modes that appear regularly in production:

Anaphoric reference breakage. A legal document defines "the Company" in section 1.2. The rest of the document uses "it," "the firm," and "the entity" interchangeably. A 512-token chunk starting in section 7 contains "the entity shall not..." but contains no referent. The embedding matches "entity" to a user's query about company obligations. The LLM either fabricates which company is meant or produces a vacuous answer.

Definition/rule separation. Tax code documents, API specifications, and legal contracts define terms once and reference them repeatedly. Fixed-size chunking separates the definition from its application contexts. A query that invokes a defined term retrieves application chunks without the definition — the LLM has to guess or hallucinate the definition.

Boundary contamination. Edge chunks at split boundaries inadvertently combine the final sentences of one section with the opening sentences of the next. Both sections are topically coherent individually; the boundary chunk is internally contradictory. When retrieved, it injects confusion directly into the context window.

The LlamaIndex chunk-size evaluation on a financial 10-K found faithfulness (no hallucinations) peaked at 1024-token chunks. Smaller 128-token chunks had measurably higher hallucination rates — not because the model was worse, but because retrieved fragments were missing context required to answer correctly.

Where Every Chunking Strategy Breaks Down

Tables, code, and structured data are where all current chunking strategies fail most severely — and where most production documents live.

Tables. Standard text splitters treat tabular content as prose. A financial table split across two chunks means the header row is in chunk N and data rows are in chunk N+1. Neither chunk is independently meaningful. NVIDIA's 2024 benchmark showed page-level chunking performed best on FinanceBench precisely because PDF pages tend to preserve table integrity — they kept tables atomic not by design but by accident.

Code blocks. Recursive splitters break functions mid-body, separate docstrings from function signatures, or split class definitions from method implementations. The retrieved chunk is syntactically invalid or semantically misleading. An agent asked to explain a function gets back the loop body with no signature and no return type.

Multi-column PDFs and scanned documents. PDF text extraction is lossy for complex layouts. Reading order gets scrambled in multi-column documents. Tables become garbled prose. Chunking on already-broken extracted text amplifies the error.

The correct mitigation requires treating chunking as content-type-aware: AST-based splitting for code (respecting function and class boundaries), specialized table extraction (treating each table as an atomic unit regardless of size), and page-level or section-level chunking for PDFs with mixed layouts. Most frameworks provide none of this by default.

What Late Chunking Actually Fixes

Late chunking (2024) inverts the standard pipeline order. Instead of chunking first then embedding, it runs the entire document through the transformer's attention layers, then performs mean pooling over token spans corresponding to chunk boundaries. Each chunk embedding is conditioned on all surrounding context.

The benchmark results on BeIR datasets are real: NFCorpus (average document length 1,590 characters) showed a 28% relative improvement in retrieval accuracy (23.46% → 29.98% nDCG@10). The correlation with document length is the key finding — gains scale with document length. Short documents see no benefit. Long documents approaching the model's context window see the largest gains.

What late chunking does not fix: it does not solve generation failures from under-sized chunks, it does not help tables or code, and it is model-dependent. A 2025 study found that on the same NFCorpus benchmark, late chunking with one embedding model (Stella-V5) and early chunking produced nearly identical results, while BGE-M3 with late chunking was significantly worse than BGE-M3 with standard chunking. Late chunking is a useful technique for long-document text retrieval with compatible models, not a general solution.

There is also a cost: late chunking requires running full documents through the embedding model's attention mechanism, which means memory usage scales with document length rather than chunk length. On a corpus of long documents, ingestion time and memory pressure increase substantially.

Document Hierarchy Addresses a Different Problem

Hierarchical indexing — indexing at multiple granularities simultaneously (document → section → paragraph → sentence) — solves a different class of failures from single-level chunking.

The failure mode it targets: analytical queries require synthesizing information across sections, but precise factoid queries want the smallest possible fragment. A single fixed chunk size cannot serve both. A query like "What are the interest rate assumptions across all five years of projections?" needs section-level chunks that carry sufficient context to answer with full coverage. A query like "What is the interest rate in 2023?" needs paragraph-level chunks for precision.

Hierarchical indexing routes queries to the appropriate granularity level based on query type, and uses parent-document retrieval: find the matching small chunk, then return the parent section to the LLM for generation. This preserves retrieval precision while providing generation context.

The operational overhead is real: you maintain multiple index levels, increase storage proportionally, and need a routing layer to determine which level to query. For documents with strong hierarchical structure (legal filings, technical specifications, medical guidelines), the accuracy gains justify the cost. For flat conversational data, it is unnecessary complexity.

The Evaluation Gap That Hides the Problem

The reason chunking failures persist in production is that most teams measure the wrong thing during development.

Retrieval metrics (recall@k, NDCG, MRR) measure whether the correct document chunks are in the top-k results. They do not measure whether those chunks contain sufficient context for correct generation. The Chroma Research evaluation across 472 queries and 5 corpora found chunking strategy choice alone produced up to 9% recall variance — but their best-recall strategy (LLM-directed chunking via GPT-4o) achieved only 3.9% token-level IoU precision. High recall at catastrophically low precision.

The correct evaluation stack:

  • Faithfulness (RAGAS terminology): are the claims in the answer grounded in the retrieved context? This catches hallucinations introduced by chunking failures directly.
  • Context precision: are the retrieved chunks actually relevant to the query, or just topically adjacent?
  • Token-level IoU: do retrieved chunks overlap with the ground-truth answer spans at the token level, not just the document level?
  • End-to-end accuracy: does the final answer match the reference answer, not just the retrieved chunks?

Running only retrieval metrics against a development corpus is how teams ship systems with 90%+ benchmark recall and 50% real-world answer accuracy. The gap closes when evaluation includes generation metrics on representative document types from the actual production corpus.

What Actually Works in Production

There is no universally optimal chunk size. The research literature agrees on this: 512 tokens works for earnings documents, 1024 tokens works better for financial QA, page-level works for mixed-layout PDFs. The right answer depends on document structure, query distribution, and embedding model.

The defensible defaults that hold across most production workloads:

  • Recursive splitting at 512 tokens with 10-25% overlap wins not by being smart but by being structurally safe. It beats semantic chunking on end-to-end accuracy in the FloTorch benchmark, costs nothing at ingestion, and produces predictable index sizes.
  • Treat tables and code as atomic units. Never split mid-structure. Detect content type at ingestion and route to specialized parsers.
  • Use parent-document retrieval for sections. Retrieve at small chunk granularity for precision; expand to the parent section before generation for context.
  • Evaluate with generation metrics, not just retrieval metrics. Faithfulness and answer accuracy on a held-out dev set representative of your actual document corpus. Run it before you choose a chunking strategy, not after you've already indexed 500K documents.

The 2025 conclusion from the "Reconstructing Context" paper is accurate and worth repeating: neither late chunking nor contextual retrieval definitively solves context preservation in RAG systems. Both involve meaningful trade-offs between computational cost and retrieval quality.

Chunking is not a solved problem packaged inside your framework's TextSplitter. It is an active design decision that determines whether your RAG system gives accurate answers or confident-sounding wrong ones. The choice deserves the same engineering rigor as schema design or API contract definition — not a default parameter left in place because the tutorial didn't mention changing it.

References:Let's stay in touch and Follow me for more thoughts and updates