Skip to main content

Corpus Architecture for RAG: The Indexing Decisions That Determine Quality Before Retrieval Starts

· 12 min read
Tian Pan
Software Engineer

When a RAG system returns the wrong answer, the post-mortem almost always focuses on the same suspects: the retrieval query, the similarity threshold, the reranker, the prompt. Teams spend days tuning these components while the actual cause sits untouched in the indexing pipeline. The failure happened weeks ago when someone decided on a chunk size.

Most RAG quality problems are architectural, not operational. They stem from decisions made at index time that silently shape what the LLM will ever be allowed to see. By the time a user complains, the retrieval system is doing exactly what it was designed to do — it's just that the design was wrong.

This is a guide to the index-time decisions that matter most, what the 2024–2026 benchmarks actually show, and the patterns that consistently distinguish high-quality RAG systems from ones that look fine in demos and fall apart in production.

The Indexing Trap: How Symptoms Migrate Downstream

The naive mental model of RAG treats retrieval as a query-time problem. You have a vector store, you send a query, you get chunks back. If the chunks are wrong, you tune the query. If the model hallucinates, you check the prompts.

The actual failure model is different. Index architecture creates a ceiling that no amount of query-time tuning can break through. If your chunks split a key passage at a sentence boundary, the embedding for each half is semantically weakened. If you never injected section headers as metadata, the reranker has no breadcrumbs to follow. If you chose the same chunk size for both narrative prose and financial tables, you've made both representations worse.

These failures compound quietly. A corrupted extraction from a PDF creates noisy chunks. Noisy chunks produce misleading embeddings. Misleading embeddings return irrelevant context. Irrelevant context causes hallucinations. Every blame lands on "the model" — but the model is just reasoning over what you gave it.

The practical implication: instrument your indexing pipeline as aggressively as your query pipeline. Log chunk quality distributions. Test on representative queries from day one. Your eval ground truth should be grounded in real user queries, not synthetic ones built from documents you already indexed cleanly.

Chunk Size Is a Query-Type Decision

The most common index-time mistake is treating chunk size as a single parameter to tune globally. There is no universally optimal chunk size. The right size depends on the query distribution your system faces.

Factual, lookup-style queries perform best with small chunks (128–256 tokens). The smaller unit produces a more focused, specific embedding that matches tightly phrased questions. When a user asks "what is the capital gains tax rate for assets held under one year," you want a chunk that contains precisely that sentence, not one that also covers estate planning, gift taxes, and step-up basis.

Analytical queries — synthesis, comparison, multi-hop reasoning — need more context. A 512–1024 token chunk that spans a full section of a document gives the model enough connective tissue to reason across related ideas. Fragmenting that into 128-token pieces produces embeddings with weaker representations of the conceptual relationship.

NVIDIA's 2024 benchmarks showed page-level chunking achieving the highest accuracy (64.8%) with the lowest variance. But this finding is narrow: it applies to well-structured paginated documents where a page genuinely represents a coherent semantic unit. Applied to long-form narrative text or FAQ documents, page-level chunking is arbitrary rather than principled.

The practical baseline: start with recursive character splitting at 400–512 tokens. A 2025 Chroma Research study found this approach delivers 85–90% recall without the computational overhead of semantic chunking. Adjust based on your actual query distribution after you have real traffic to measure.

The Overlap Paradox

Ten years of advice from RAG tutorials says to add overlap between chunks. The reasoning is intuitive: if a key passage straddles a chunk boundary, overlap ensures neither half is isolated.

A January 2026 systematic analysis using SPLADE retrieval over the Natural Questions dataset found that overlap provided no measurable benefit while adding indexing cost. The intuition fails for a specific reason: overlapping chunks are near-duplicates of each other, and when both are retrieved, they consume context window without providing additional evidence. You get redundancy, not coverage.

This doesn't mean overlap is always wrong. For OCR-processed text, scraped web pages, and other sources where whitespace and structure are unreliable, a 10–15% overlap can recover boundaries that naive splitting would otherwise destroy. For clean, well-structured documents, it often adds noise.

The practical rule: default to no overlap, measure retrieval quality on boundary-crossing queries, then add overlap only if you can observe improvement. Ten percent is the appropriate ceiling for most scenarios.

Parent-Child Hierarchies and the Small-to-Big Pattern

One of the most durable patterns in production RAG systems is the parent-child retrieval hierarchy. Documents are split into small child chunks (100–300 tokens) for retrieval and larger parent chunks (500–2000 tokens) for synthesis. The retrieval system fetches children because their compact embeddings encode more precise semantics. The generation step receives the parent because broader context produces better answers.

Why this works is worth understanding precisely. When you embed a 1500-token chunk that spans several sub-topics, the resulting vector is a weighted average across all of those topics. The embedding is vague. When you embed a 200-token sentence-window chunk, the vector captures a much more focused semantic unit — and retrieval similarity scores improve correspondingly.

At synthesis time, the narrow chunk is often insufficient. The model needs surrounding context to resolve pronouns, follow logical chains, and understand relationships between claims. The parent delivers that context.

This pattern is especially effective for technical documentation, research papers, and any corpus where documents have clear hierarchical structure (sections within chapters, subsections within sections). The hierarchy in the index mirrors the hierarchy in the source material, and retrieval benefits from that alignment.

Metadata Enrichment: The Decision Teams Consistently Skip

The single most underrated index-time decision is what metadata to attach to each chunk before it enters the vector store.

A content-only vector store enables one-dimensional retrieval: semantic similarity. A metadata-rich store enables multi-dimensional filtered search. You can retrieve by semantic similarity and source document, and publication date, and section type, and document classification. Each filter dimension you add changes "find similar" into "find relevant."

Snowflake's 2024 finance RAG benchmark showed metadata-enriched retrieval achieving 82.5% precision and an NDCG score of 0.813, substantially outperforming content-only baselines. The metadata that drove the gap included company name, filing date, form type, and section headers — all fields that would require the model to reason from content if not injected as structured fields.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates