Document Extraction Is Your RAG System's Hidden Ceiling
A compliance contractor builds a RAG system to answer questions against a 400-page policy document. The system passes internal QA. It retrieves correctly against single-topic queries. Then it goes live and starts returning confident, well-structured, wrong answers on anything involving exception clauses.
The debugging loop looks familiar: swap the embedding model, tune similarity thresholds, experiment with chunk sizes, add a reranker. Weeks pass. The improvement is marginal. The real problem is that a key exception clause was split across two chunks at a paragraph boundary — not because of chunking strategy, but because the PDF extractor silently broke the paragraph in two when it misread the layout. Neither chunk, in isolation, is retrievable or interpretable. The system cannot hallucinate its way to a correct answer because the correct information never entered the index cleanly.
This is the extraction ceiling: the point beyond which no downstream optimization can compensate for corrupted or missing input data.
What PDFs Actually Are
PDFs are a display format, not a data format. The file stores character placement instructions — coordinates on a page — rather than semantic structure. A two-column academic paper is not two ordered streams of text; it is a collection of character positions that happen to visually form columns when rendered. The PDF format has no concept of "this text comes before that text" in a semantic sense.
This creates a fundamental mismatch with what RAG pipelines need: logically ordered, structure-preserving plain text. Extraction tools must reconstruct semantic order from spatial coordinates, and they fail in predictable patterns.
Column merging. Standard parsers read across the full page width before moving down a line. On a two-column paper, this interleaves left- and right-column text line by line. The output looks like English sentences — they are just the wrong sentences in the wrong sequence. PyMuPDF in its default mode parses "by the storage order of characters instead of their reading order," which produces chaotic results on complex layouts.
Table flattening. Converting a table to sequential text destroys row-column relationships. A financial table with merged cells, multi-level headers, and subgroup rows becomes a stream of numbers with no positional context. In a benchmark evaluating complex table extraction, LlamaParse misplaced column values from a 48-cell sustainability report table — capturing the right data but attributing it to the wrong dimensions, rendering the table uninterpretable. Unstructured on the same document suffered column-shift errors that made entire regional breakdowns unusable.
Section hierarchy collapse. Without heading detection, a 50-page document is flattened to undifferentiated body text. Heading styles — larger font, bold, specific positioning — are visual hints that PDF stores as raw character attributes. Tools that do not perform layout analysis cannot distinguish an H2 from a paragraph that happens to begin with a large word.
Scanned PDF corruption. Searchable PDFs layer OCR text over scanned images. Lighting conditions, rotation, skew, and handwritten annotations all degrade OCR quality before extraction begins. The text records may be split at arbitrary locations. Characters stored as curves rather than text glyphs are often missed entirely.
The dangerous property of all four failure modes: none of them raise exceptions. The extraction pipeline returns text, embeddings are computed, similarity search retrieves chunks, and the system produces wrong answers confidently.
Why You Keep Blaming the Embedding Model
When a RAG system underperforms, the typical debugging path jumps to the retrieval layer: the embedding model, the similarity threshold, the chunk overlap ratio, the reranker. These are all downstream of extraction. If the relevant information was never cleanly captured, no retrieval improvement can surface it.
The empirical evidence on this is striking. Across a benchmark of 800+ documents, parsers achieve 74% text accuracy but only 35% structure preservation — with a correlation of just 0.174 between these two metrics. High character-level accuracy tells you almost nothing about whether structure survived. An extraction output can score well on BLEU and edit similarity while destroying every table, merging all columns, and removing all heading context.
This gap explains a common frustration: switching embedding models yields marginal gains. The problem is not the representation; it is what is being represented. Improved PDF parsing (deep learning–based vs. rule-based) outperformed the baseline on 47% of RAG questions in a controlled study of 302 questions, tied on 38%, and lost on only 15%. Nearly half the question answering improvement in that study was attributable entirely to better extraction, with no changes to retrieval or generation.
A second compound effect is embedding drift. When you change your parser or chunking logic — even subtly — documents get re-processed but embeddings from the original chunks may persist in the vector store. The same document produces structurally different chunks under different extraction runs, but the old vectors remain indexed. Developers observe degraded retrieval recall over time, attribute it to model drift, and try a new embedding model. The real cause is a version mismatch between what was extracted and what is currently in the index.
The Diagnostic Framework: Isolating Extraction From Retrieval
The correct evaluation approach tests each layer in isolation. Do not run end-to-end evals until you have established what each layer contributes independently.
Layer 1: Extraction audit. Before you touch embeddings, build a golden extraction set. Sample 50–100 pages representative of your hard cases: tables, multi-column layouts, scanned pages, footnotes, nested lists. Manually annotate the correct text and structure. Run your parser against the same pages and measure character-level F1, n-gram overlap (BLEU-4), and structural element recovery rate — specifically, were all tables found, and did headings survive?
This step is skipped in almost every RAG project. Engineers jump straight to indexing and assume extraction worked because no exceptions were thrown. The golden set catches the silent failures.
Layer 2: Retrieval evaluation. With a known-good context set, create a golden Q&A set where you know which specific chunks should be retrieved for each question. Measure context recall (are relevant chunks in the top-k?) and context precision (are irrelevant chunks crowding them out?). This distinguishes retrieval algorithm failures from extraction failures.
The diagnostic signal: if context recall is low, audit your golden extraction set first. Only if extraction looks clean should you investigate embedding model choice, chunk size, or retrieval parameters.
- https://www.applied-ai.com/briefings/pdf-parsing-benchmark/
- https://arxiv.org/html/2410.09871v1
- https://arxiv.org/html/2401.12599v1
- https://arxiv.org/html/2401.05856v1
- https://unstract.com/blog/pdf-hell-and-practical-rag-applications/
- https://towardsdatascience.com/your-chunks-failed-your-rag-in-production/
- https://unstructured.io/blog/unstructured-leads-in-document-parsing-quality-benchmarks-tell-the-full-story
- https://procycons.com/en/blogs/pdf-data-extraction-benchmark/
- https://codecut.ai/docling-vs-marker-vs-llamaparse/
- https://developer.nvidia.com/blog/approaches-to-pdf-data-extraction-for-information-retrieval/
- https://dev.to/aws-builders/rag-is-a-data-engineering-problem-disguised-as-ai-39b2
- https://medium.com/@dataenthusiast.io/rag-in-production-the-data-pipeline-nobody-talks-about-059106ded910
