Skip to main content

Corpus Architecture for RAG: The Indexing Decisions That Determine Quality Before Retrieval Starts

· 12 min read
Tian Pan
Software Engineer

When a RAG system returns the wrong answer, the post-mortem almost always focuses on the same suspects: the retrieval query, the similarity threshold, the reranker, the prompt. Teams spend days tuning these components while the actual cause sits untouched in the indexing pipeline. The failure happened weeks ago when someone decided on a chunk size.

Most RAG quality problems are architectural, not operational. They stem from decisions made at index time that silently shape what the LLM will ever be allowed to see. By the time a user complains, the retrieval system is doing exactly what it was designed to do — it's just that the design was wrong.

This is a guide to the index-time decisions that matter most, what the 2024–2026 benchmarks actually show, and the patterns that consistently distinguish high-quality RAG systems from ones that look fine in demos and fall apart in production.

The Indexing Trap: How Symptoms Migrate Downstream

The naive mental model of RAG treats retrieval as a query-time problem. You have a vector store, you send a query, you get chunks back. If the chunks are wrong, you tune the query. If the model hallucinates, you check the prompts.

The actual failure model is different. Index architecture creates a ceiling that no amount of query-time tuning can break through. If your chunks split a key passage at a sentence boundary, the embedding for each half is semantically weakened. If you never injected section headers as metadata, the reranker has no breadcrumbs to follow. If you chose the same chunk size for both narrative prose and financial tables, you've made both representations worse.

These failures compound quietly. A corrupted extraction from a PDF creates noisy chunks. Noisy chunks produce misleading embeddings. Misleading embeddings return irrelevant context. Irrelevant context causes hallucinations. Every blame lands on "the model" — but the model is just reasoning over what you gave it.

The practical implication: instrument your indexing pipeline as aggressively as your query pipeline. Log chunk quality distributions. Test on representative queries from day one. Your eval ground truth should be grounded in real user queries, not synthetic ones built from documents you already indexed cleanly.

Chunk Size Is a Query-Type Decision

The most common index-time mistake is treating chunk size as a single parameter to tune globally. There is no universally optimal chunk size. The right size depends on the query distribution your system faces.

Factual, lookup-style queries perform best with small chunks (128–256 tokens). The smaller unit produces a more focused, specific embedding that matches tightly phrased questions. When a user asks "what is the capital gains tax rate for assets held under one year," you want a chunk that contains precisely that sentence, not one that also covers estate planning, gift taxes, and step-up basis.

Analytical queries — synthesis, comparison, multi-hop reasoning — need more context. A 512–1024 token chunk that spans a full section of a document gives the model enough connective tissue to reason across related ideas. Fragmenting that into 128-token pieces produces embeddings with weaker representations of the conceptual relationship.

NVIDIA's 2024 benchmarks showed page-level chunking achieving the highest accuracy (64.8%) with the lowest variance. But this finding is narrow: it applies to well-structured paginated documents where a page genuinely represents a coherent semantic unit. Applied to long-form narrative text or FAQ documents, page-level chunking is arbitrary rather than principled.

The practical baseline: start with recursive character splitting at 400–512 tokens. A 2025 Chroma Research study found this approach delivers 85–90% recall without the computational overhead of semantic chunking. Adjust based on your actual query distribution after you have real traffic to measure.

The Overlap Paradox

Ten years of advice from RAG tutorials says to add overlap between chunks. The reasoning is intuitive: if a key passage straddles a chunk boundary, overlap ensures neither half is isolated.

A January 2026 systematic analysis using SPLADE retrieval over the Natural Questions dataset found that overlap provided no measurable benefit while adding indexing cost. The intuition fails for a specific reason: overlapping chunks are near-duplicates of each other, and when both are retrieved, they consume context window without providing additional evidence. You get redundancy, not coverage.

This doesn't mean overlap is always wrong. For OCR-processed text, scraped web pages, and other sources where whitespace and structure are unreliable, a 10–15% overlap can recover boundaries that naive splitting would otherwise destroy. For clean, well-structured documents, it often adds noise.

The practical rule: default to no overlap, measure retrieval quality on boundary-crossing queries, then add overlap only if you can observe improvement. Ten percent is the appropriate ceiling for most scenarios.

Parent-Child Hierarchies and the Small-to-Big Pattern

One of the most durable patterns in production RAG systems is the parent-child retrieval hierarchy. Documents are split into small child chunks (100–300 tokens) for retrieval and larger parent chunks (500–2000 tokens) for synthesis. The retrieval system fetches children because their compact embeddings encode more precise semantics. The generation step receives the parent because broader context produces better answers.

Why this works is worth understanding precisely. When you embed a 1500-token chunk that spans several sub-topics, the resulting vector is a weighted average across all of those topics. The embedding is vague. When you embed a 200-token sentence-window chunk, the vector captures a much more focused semantic unit — and retrieval similarity scores improve correspondingly.

At synthesis time, the narrow chunk is often insufficient. The model needs surrounding context to resolve pronouns, follow logical chains, and understand relationships between claims. The parent delivers that context.

This pattern is especially effective for technical documentation, research papers, and any corpus where documents have clear hierarchical structure (sections within chapters, subsections within sections). The hierarchy in the index mirrors the hierarchy in the source material, and retrieval benefits from that alignment.

Metadata Enrichment: The Decision Teams Consistently Skip

The single most underrated index-time decision is what metadata to attach to each chunk before it enters the vector store.

A content-only vector store enables one-dimensional retrieval: semantic similarity. A metadata-rich store enables multi-dimensional filtered search. You can retrieve by semantic similarity and source document, and publication date, and section type, and document classification. Each filter dimension you add changes "find similar" into "find relevant."

Snowflake's 2024 finance RAG benchmark showed metadata-enriched retrieval achieving 82.5% precision and an NDCG score of 0.813, substantially outperforming content-only baselines. The metadata that drove the gap included company name, filing date, form type, and section headers — all fields that would require the model to reason from content if not injected as structured fields.

The metadata fields that consistently matter:

  • Section headers and hierarchy. A chunk that says "the interest rate is 5%" is ambiguous. A chunk tagged with section: "Federal Funds Rate as of Q4 2025" is not. Header path acts as a breadcrumb trail that disambiguates both retrieval and generation.
  • Document identifiers and source URLs. Enables deduplication, provenance tracking, and confidence scoring based on source authority.
  • Chunk sequence index. Allows retrieval systems to fetch preceding and following chunks when a match is found — essential for reconstructing narrative coherence.
  • Document-level summary or abstract. Prepending a brief document description to each chunk (e.g., "This section is from a 2025 SEC filing discussing revenue projections for the semiconductor segment") resolves pronoun references and implicit context that would otherwise force the model to guess.

A 2024 MDKeyChunker study found that LLM-generated multi-level metadata — including thematic clusters, associated entities, and generated Q&A pairs per chunk — improved downstream retrieval quality significantly. The additional cost at index time paid dividends at query time.

Structure-Aware Chunking: Tables and Code Are Not Prose

The single largest failure in corpus architecture for mixed-content documents is applying narrative chunking rules to non-narrative content.

Tables in a vector store require atomic treatment. A table split across two chunks is worse than useless — each chunk contains partial columns with no interpretable meaning. The embedding for a half-table bears no resemblance to any query a user would write. At minimum, tables should be treated as semantic units and chunked with their headers preserved. Better still: generate a natural-language description of the table's content and store both the description and the structured markdown together as a single chunk.

Code blocks have a parallel problem. Code semantics depend on indentation, variable scope, and structural completeness. A function split at an arbitrary character count produces an embedding that represents half a function — which matches nothing. Structure-aware splitting at function or class boundaries produces chunks whose embeddings are interpretable and retrievable. Several common text splitters also corrupt code by stripping whitespace or introducing unwanted newlines, which destroys the embedding even when the chunk boundaries were chosen correctly.

The practical pattern for heterogeneous documents: identify content type at extraction time, then route to type-specific chunking. Narrative prose uses recursive character splitting. Tables are extracted as atomic units with generated descriptions. Code uses AST-aware splitting at logical boundaries.

Late Chunking: Embedding Context You Don't Have at Chunk Time

A 2024 development from Jina AI addresses a fundamental limitation of conventional chunking: when you chunk first and embed later, each chunk loses access to surrounding context at embedding time.

Late chunking reverses the order. The full document is passed through a long-context embedding model first, producing token-level embeddings that capture the document's full context. Chunks are then formed by mean-pooling across token ranges within those contextually-rich embeddings.

The result is chunk embeddings that encode each passage's meaning as it appears in context, not in isolation. A pronoun like "it" in a chunk has an embedding that reflects what "it" refers to in the surrounding text. A claim in a subsection has an embedding that reflects the broader argument it supports.

Empirical comparisons in 2025 showed late chunking producing stronger retrieval performance across diverse tasks. The trade-off is that it requires a long-context embedding model and imposes the latency of full-document encoding. For high-value document corpora where retrieval precision matters more than index throughput, it's the right approach. For high-volume commodity content, conventional chunking with good metadata enrichment is often sufficient.

Diagnosing Index Failures Before Users Do

The most productive shift in RAG development posture is treating the indexing pipeline as a first-class system with its own evaluation discipline — not as a preprocessing step that runs once and is forgotten.

Configuration drift is the most common silent failure mode. The embedding model used at query time must be identical to the one used at index time. Dimension count, tokenization, normalization behavior — all must match. Model updates that happen transparently in hosted embedding APIs can misalign your store without warning. Version-pin your embedding model and make mismatches an explicit error, not a silent degradation.

Stale indices are another class of invisible failure. When source documents change and the vector store does not, users encounter retrieval that returns outdated, contradicted, or deleted content. The hallucination this produces is attributed to the model, not to the stale index. Change-data-capture pipelines that propagate document updates into the index are not optional for live knowledge bases.

Quality checks at ingestion time should catch extraction failures before they propagate. PDF parsing, OCR, and HTML extraction all fail silently in ways that produce malformed chunks. A basic structural validity check (expected character count, encoding validity, absence of extraction artifacts) run at ingestion time catches failures when they're cheap to fix rather than after they've contaminated hundreds of downstream queries.

The evaluation methodology that closes the loop: build a retrieval eval suite from real user queries — not from documents you already know are indexed cleanly. For each query, measure whether the ground-truth passage appears in the top-K results. Low recall on this suite is an index architecture problem, not a query problem. Treat retrieval quality as a deployable metric with its own regression tests, version history, and rollback criteria.

The Baseline That Beats Sophistication

The 2026 Vectara benchmark tested chunking strategies across realistic document corpora. Recursive 512-token splitting achieved 69% accuracy. Semantic chunking — which sounds more principled — landed at 54%, because the semantic algorithm over-fragmented documents into 43-token chunks that were too narrow to be useful.

This is a consistent pattern. Teams are drawn to more sophisticated approaches because "semantic" sounds better than "recursive." The empirical result is that recursive splitting, properly parameterized, outperforms semantic splitting in the majority of real-world scenarios because semantic algorithms optimize for boundary elegance rather than retrieval utility.

The practical baseline for any new RAG system:

  • Recursive character splitting at 400–512 tokens
  • 0% overlap as a default (add 10–15% only if boundary-crossing queries degrade)
  • Metadata enrichment at minimum including section headers and document context
  • Type-specific handling for tables and code blocks
  • Evaluation on real queries before declaring the index complete

This is not the most interesting set of choices. It is, consistently, the set of choices that produces working systems. Optimize toward sophistication only after measuring that the baseline is insufficient for your specific corpus and query distribution.

The teams that build the best RAG systems are not the ones that apply the most advanced techniques. They're the ones that instrument their indexing pipeline aggressively, evaluate on real queries from day one, and treat index architecture as a domain with its own failure modes — not as a solved problem that runs before the interesting work begins.

References:Let's stay in touch and Follow me for more thoughts and updates