Skip to main content

Corpus Curation at Scale: Why Your RAG Quality Ceiling Is Your Document Quality Floor

· 10 min read
Tian Pan
Software Engineer

There's a belief embedded in most RAG architectures that goes something like this: if retrieval returns the right chunks, the LLM will produce correct answers. Teams invest heavily in embedding model selection, hybrid retrieval strategies, and reranking pipelines. Then, three months after deploying to production, answer quality quietly degrades — not because the model changed, not because query patterns shifted dramatically, but because the underlying corpus rotted.

Enterprise RAG implementations fail at a roughly 40% rate, and the failure mode that practitioners underestimate most isn't hallucination or poor retrieval recall. It's document quality. One analysis found that a single implementation improved search accuracy from 62% to 89% by introducing document quality scoring — with no changes to the embedding model or retrieval algorithm. The corpus was the variable. The corpus was always the variable.

The Garbage-In Problem Is More Subtle Than It Sounds

"Garbage in, garbage out" implies that bad documents produce obviously wrong answers. In practice, the failure mode is more insidious. Bad documents produce confidently wrong answers that pass basic retrieval metrics.

Consider how embedding similarity works: a document from 18 months ago that covered the same topic as a current user query will score nearly identically to a current document in cosine distance. The vector space has no concept of time, accuracy, or authority. A deprecated API reference and its replacement live at essentially the same distance from the query how do I authenticate with this service. The retriever has no basis to prefer one over the other.

The same applies to authorship. A well-written internal document and a stale copy of that document with two conflicting policy edits will produce similar embeddings. Retrieval surfaces both. The LLM attempts synthesis and produces a compromised answer that's wrong in ways that are difficult to attribute.

This is the more precise version of the garbage-in principle: low-quality documents don't produce garbage outputs proportionally; they introduce systematic, low-variance errors that look like correct behavior until someone cross-references the answer against ground truth.

Scoring Documents for Retrieval Suitability

Before a document enters your index, it should be evaluated on multiple dimensions. Not all of these dimensions are equally tractable, but skipping them entirely is how teams end up with poisoned corpora.

Information density is the ratio of semantically distinct claims per unit of text. A 2,000-word procedural runbook with step-by-step instructions scores high; a 500-word FAQ where 400 words are introductory boilerplate scores low. Sparse documents waste context window space and dilute the signal-to-noise ratio of retrieved chunks. You can approximate this with compression ratio — high information density text compresses less aggressively than repetitive boilerplate.

Chunking hostility describes how well a document's structure survives the chunking process. PDFs are the canonical example: PyPDF parses documents by the storage order of characters rather than reading order, which produces chaotic output for any multi-column or table-heavy document. When a table is chunked mid-row, the resulting chunk contains a fragment with no interpretable meaning. Scanned PDFs with OCR errors compound this — OCR error rates of 20% or higher remain common, and when those errors enter a chunking pipeline, they don't distribute uniformly; they cluster at boundaries, destroying the most retrieval-relevant text.

Authorship ambiguity matters because the LLM cannot weight sources it doesn't know exist. If your corpus contains a mix of authoritative primary sources and lightly edited secondary summaries of those sources, the retriever will sometimes return the summary in preference to the original. The summary may contain subtle reframings or errors. Tracking provenance at ingest time — original source URL, last verified date, author or team — lets you apply authority weighting at retrieval time.

Cross-document contradiction is the hardest problem to score at ingest, because it requires comparing a document against others already in the index. A practical heuristic: when re-indexing an updated document, run NLI-based contradiction detection against the top-5 semantically similar existing documents. Flag pairs that produce entailment scores below threshold. This catches the most dangerous case — an update that contradicts the document it replaced, where both versions remain in the index.

Deduplication That Preserves Semantic Coverage

The naive deduplication instinct is to remove near-duplicates. This is wrong for corpora where the same information exists across multiple document types. A product specification, the support documentation derived from it, and a customer-facing FAQ may all say nearly the same thing, but they serve different query intents. Removing all but one collapses coverage for queries that match specific phrasings in the deduplicated documents.

The correct framing is coverage-preserving deduplication: remove documents that add zero marginal coverage relative to existing indexed content, while preserving documents that cover the same topic with distinct vocabulary, framing, or level of detail.

At scale, this requires approximate methods. MinHash LSH is the standard approach for near-duplicate detection at tens of millions of documents — it estimates Jaccard similarity without computing pairwise distances, reducing the quadratic cost of exact comparison. The practical threshold sits around 0.85 Jaccard similarity for strict deduplication; below that, the documents typically differ enough to justify both.

For semantic deduplication (paraphrases and rewrites rather than near-copies), dense embedding comparison is more accurate but computationally expensive. A common production pattern runs MinHash LSH as a first-pass filter, then applies embedding-based comparison only within MinHash-identified candidate pairs. At corpus sizes above 10 million documents, naive pairwise embedding comparison becomes infeasible — the MinHash pass is what makes the embedding pass tractable.

One important nuance: deduplication decisions made at ingest time persist until the next full reindex. If you remove a document that appeared redundant at ingest and later the document it was redundant with gets deleted, you've lost coverage permanently. Log deduplication decisions with references to the documents that caused each removal, so they can be reinstated when the primary document is removed.

Quality-Weighted Indexing

Most vector stores support metadata filtering, but teams rarely extend this to retrieval-time quality weighting. The pattern that consistently outperforms pure similarity ranking is combining semantic similarity score with a quality weight at retrieval time.

A simple version: final_score = (semantic_similarity × 0.7) + (freshness_score × 0.3). The freshness score is 1.0 for documents updated within their freshness class window, decaying linearly to 0 as they age beyond it. This doesn't require changing your embedding model or retrieval architecture — it's a post-retrieval reranking step applied before passing chunks to the LLM.

Document freshness classes need to be assigned at ingest based on content type. API references decay in weeks; architectural design docs decay in months; foundational concepts may remain valid for years. Applying a single freshness window uniformly either discards still-valid old content or retains dangerously stale new content. The classification should be as coarse as needed to be maintainable: three to five classes covering most document types is sufficient.

Quality weighting can also incorporate author authority, source canonicality (is this the primary source or a derived summary?), and retrieval history (documents consistently retrieved but ignored by the LLM in the final answer may warrant lower weight). The last signal is only available post-deployment but is highly predictive of retrieval waste.

The Corpus Hygiene Workflow Teams Don't Build Until Things Break

Most teams discover they need a corpus hygiene workflow when retrieval quality degrades on questions they've been monitoring and they cannot identify any model or infrastructure change that explains it. By then, the corpus has weeks or months of accumulated staleness, duplication, and quiet contradiction.

The operational model that prevents this treats corpus health as a reliability concern, not a cleanup task. Concretely, it consists of:

  • Ingest-time quality gates: Minimum information density threshold, chunking suitability check, required metadata fields. Documents that fail quality gates enter a review queue rather than the index.
  • Staleness audits on a per-class schedule: Documents in fast-decay classes (release notes, changelogs, support articles) get a re-verification trigger every two to four weeks. Documents in slow-decay classes get it every six months. A document that cannot be re-verified gets tagged as unverified in its metadata, and that tag is used to apply a freshness penalty at retrieval time.
  • Drift detection on held-out eval sets: Run a fixed set of representative queries monthly against the live corpus. Track answer quality scores over time. A drop that doesn't correlate with model changes is a corpus signal.
  • Contradiction monitoring on updates: When any document is updated or added, compare it against semantically similar existing documents for contradiction. Flag but don't automatically remove — the flagged pairs go to a human review queue.

The ownership question is as important as the technical architecture. Data freshness rot is not primarily a technical failure — it's an ownership failure. Without an explicit owner for each document class, freshness monitoring becomes nobody's job, and hygiene degrades to a periodic emergency cleanup cycle rather than a continuous steady state.

What Metrics Actually Tell You

The metrics teams typically track — retrieval Precision@K, NDCG, answer relevance — are necessary but insufficient for diagnosing corpus quality problems. They tell you whether the retriever is returning relevant content, but they don't tell you whether the relevant content is accurate.

Useful corpus health metrics that complement standard retrieval metrics:

  • Staleness ratio: Percentage of indexed documents exceeding their freshness class threshold. Alert threshold at 15%, critical at 30%.
  • Contradiction pair count: Number of document pairs flagged for contradictory content in the last 30 days.
  • Quality gate rejection rate: Percentage of attempted ingest documents failing quality checks. A spike in rejections is an upstream signal (a source started generating lower-quality content) not just an ingest failure.
  • Per-query answer drift: Track a fixed eval set longitudinally. A 5-point drop in answer quality on time-sensitive questions without model changes indicates corpus staleness.

One data point worth internalizing: a corpus managing 1,000 documents can maintain sub-hour staleness with minimal operational overhead, but the same architecture at 100,000 documents without explicit freshness management operates with 12-hour staleness by default. The corpus hygiene debt is superlinear in corpus size.

The Floor Is Always the Documents

RAG improvements tend to follow a predictable roadmap: better embedding models, hybrid retrieval, cross-encoder reranking, and increasingly sophisticated query transformations. Each step produces measurable gains. But each step operates on the same underlying constraint: if the corpus contains stale, contradictory, or structurally unusable documents, every retrieval improvement surfaces those documents more efficiently.

The retrieval system's job is to find the best available document. It cannot make a bad document good. The quality ceiling of a RAG system is set at ingest time, by the documents you accept into the index and the metadata you attach to them. Building the operational workflow to maintain that ceiling — quality gates, freshness tracking, contradiction monitoring — is not glamorous infrastructure work. But it's the work that determines whether retrieval improvements translate into accuracy improvements, or whether they just surface the same bad documents faster.

References:Let's stay in touch and Follow me for more thoughts and updates