Corpus Curation at Scale: Why Your RAG Quality Ceiling Is Your Document Quality Floor

April 14, 2026 · 10 min read

Software Engineer

There's a belief embedded in most RAG architectures that goes something like this: if retrieval returns the right chunks, the LLM will produce correct answers. Teams invest heavily in embedding model selection, hybrid retrieval strategies, and reranking pipelines. Then, three months after deploying to production, answer quality quietly degrades — not because the model changed, not because query patterns shifted dramatically, but because the underlying corpus rotted.

Enterprise RAG implementations fail at a roughly 40% rate, and the failure mode that practitioners underestimate most isn't hallucination or poor retrieval recall. It's document quality. One analysis found that a single implementation improved search accuracy from 62% to 89% by introducing document quality scoring — with no changes to the embedding model or retrieval algorithm. The corpus was the variable. The corpus was always the variable.

The Garbage-In Problem Is More Subtle Than It Sounds

"Garbage in, garbage out" implies that bad documents produce obviously wrong answers. In practice, the failure mode is more insidious. Bad documents produce confidently wrong answers that pass basic retrieval metrics.

Consider how embedding similarity works: a document from 18 months ago that covered the same topic as a current user query will score nearly identically to a current document in cosine distance. The vector space has no concept of time, accuracy, or authority. A deprecated API reference and its replacement live at essentially the same distance from the query how do I authenticate with this service. The retriever has no basis to prefer one over the other.

The same applies to authorship. A well-written internal document and a stale copy of that document with two conflicting policy edits will produce similar embeddings. Retrieval surfaces both. The LLM attempts synthesis and produces a compromised answer that's wrong in ways that are difficult to attribute.

This is the more precise version of the garbage-in principle: low-quality documents don't produce garbage outputs proportionally; they introduce systematic, low-variance errors that look like correct behavior until someone cross-references the answer against ground truth.

Scoring Documents for Retrieval Suitability

Before a document enters your index, it should be evaluated on multiple dimensions. Not all of these dimensions are equally tractable, but skipping them entirely is how teams end up with poisoned corpora.

Information density is the ratio of semantically distinct claims per unit of text. A 2,000-word procedural runbook with step-by-step instructions scores high; a 500-word FAQ where 400 words are introductory boilerplate scores low. Sparse documents waste context window space and dilute the signal-to-noise ratio of retrieved chunks. You can approximate this with compression ratio — high information density text compresses less aggressively than repetitive boilerplate.

Chunking hostility describes how well a document's structure survives the chunking process. PDFs are the canonical example: PyPDF parses documents by the storage order of characters rather than reading order, which produces chaotic output for any multi-column or table-heavy document. When a table is chunked mid-row, the resulting chunk contains a fragment with no interpretable meaning. Scanned PDFs with OCR errors compound this — OCR error rates of 20% or higher remain common, and when those errors enter a chunking pipeline, they don't distribute uniformly; they cluster at boundaries, destroying the most retrieval-relevant text.

Authorship ambiguity matters because the LLM cannot weight sources it doesn't know exist. If your corpus contains a mix of authoritative primary sources and lightly edited secondary summaries of those sources, the retriever will sometimes return the summary in preference to the original. The summary may contain subtle reframings or errors. Tracking provenance at ingest time — original source URL, last verified date, author or team — lets you apply authority weighting at retrieval time.

Cross-document contradiction is the hardest problem to score at ingest, because it requires comparing a document against others already in the index. A practical heuristic: when re-indexing an updated document, run NLI-based contradiction detection against the top-5 semantically similar existing documents. Flag pairs that produce entailment scores below threshold. This catches the most dangerous case — an update that contradicts the document it replaced, where both versions remain in the index.

Deduplication That Preserves Semantic Coverage

The naive deduplication instinct is to remove near-duplicates. This is wrong for corpora where the same information exists across multiple document types. A product specification, the support documentation derived from it, and a customer-facing FAQ may all say nearly the same thing, but they serve different query intents. Removing all but one collapses coverage for queries that match specific phrasings in the deduplicated documents.

The correct framing is coverage-preserving deduplication: remove documents that add zero marginal coverage relative to existing indexed content, while preserving documents that cover the same topic with distinct vocabulary, framing, or level of detail.

At scale, this requires approximate methods. MinHash LSH is the standard approach for near-duplicate detection at tens of millions of documents — it estimates Jaccard similarity without computing pairwise distances, reducing the quadratic cost of exact comparison. The practical threshold sits around 0.85 Jaccard similarity for strict deduplication; below that, the documents typically differ enough to justify both.

For semantic deduplication (paraphrases and rewrites rather than near-copies), dense embedding comparison is more accurate but computationally expensive. A common production pattern runs MinHash LSH as a first-pass filter, then applies embedding-based comparison only within MinHash-identified candidate pairs. At corpus sizes above 10 million documents, naive pairwise embedding comparison becomes infeasible — the MinHash pass is what makes the embedding pass tractable.

One important nuance: deduplication decisions made at ingest time persist until the next full reindex. If you remove a document that appeared redundant at ingest and later the document it was redundant with gets deleted, you've lost coverage permanently. Log deduplication decisions with references to the documents that caused each removal, so they can be reinstated when the primary document is removed.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Corpus Curation at Scale: Why Your RAG Quality Ceiling Is Your Document Quality Floor

The Garbage-In Problem Is More Subtle Than It Sounds

Scoring Documents for Retrieval Suitability

Deduplication That Preserves Semantic Coverage

Recommended Reading

About Tian Pan

The Garbage-In Problem Is More Subtle Than It Sounds​

Scoring Documents for Retrieval Suitability​

Deduplication That Preserves Semantic Coverage​

Recommended Reading

About Tian Pan

The Garbage-In Problem Is More Subtle Than It Sounds

Scoring Documents for Retrieval Suitability

Deduplication That Preserves Semantic Coverage