Skip to main content

Your RAG Chunker Is a Database Schema Nobody Code-Reviewed

· 11 min read
Tian Pan
Software Engineer

The first time a retrieval quality regression lands in your on-call channel, the debugging path almost always leads somewhere surprising. Not the embedding model. Not the reranker. Not the prompt. The culprit is a one-line change to the chunker — a tokenizer swap, a boundary rule tweak, a stride adjustment — that someone merged into a preprocessing notebook three sprints ago. The fix touched zero lines of production code. It rebuilt the index overnight. And now accuracy is down four points across every tenant.

The chunker is a database schema. Every field you extract, every boundary you draw, every stride you pick defines the shape of the rows that land in your vector index. Change any of them and you have altered the schema of an index that other parts of your system — retrieval logic, reranker features, evaluation harnesses, downstream prompts — depend on as if it were stable. But because the chunker usually lives in a notebook or a small Python module that nobody labels as "infrastructure," these changes ship with the rigor of a config tweak and the blast radius of an ALTER TABLE.

This article is about taking that framing seriously. If the chunker is a schema, then every chunker change is a migration. Every re-embed is an index rebuild. Every boundary rule is a contract. And the org structure that ships chunker changes through notebook PRs while routing retrieval evals through a separate team is a governance gap, not a workflow preference. Treating chunking as preprocessing is how good teams quietly lose a month of retrieval quality and never trace it back to the commit.

Chunking Decisions Are Schema Decisions

Walk through the parameters a chunking pipeline actually commits to, and the schema framing starts to feel obvious rather than clever.

Chunk size sets the row width. Overlap determines how much redundant data lives across rows. Boundary rules — paragraph, sentence, Markdown heading, semantic break, fixed-token — decide the shape of each row's content. Stride governs whether rows are dense or sparse across the source. Metadata fields (section titles, page numbers, doc IDs, timestamps, author, sensitivity) define indexed columns that your filters and reranker will rely on. Preprocessing transforms (whitespace normalization, table flattening, footnote stripping) define the canonical text that everything downstream sees.

None of these are cosmetic. Chroma's 2026 chunking evaluation showed that overlap beyond roughly 20% produces steep precision losses while barely improving recall, and Vectara's benchmark found recursive 512-token splitting hitting 69% accuracy while naively applied semantic chunking fell to 54% because it produced 43-token fragments too short for the generator to reason over. The chunker does not just decide what gets indexed; it decides what retrieval can possibly recover and what the model has room to think with. A 43-token fragment is not a smaller chunk of the same schema. It is a different schema.

The schema framing also explains the most frustrating RAG incident class: a chunker change ships, retrieval recall looks fine in the offline eval, and yet user-reported accuracy tanks. The chunks retrieved cleanly. The chunks did not answer anything. The schema changed underneath a downstream consumer (the generator) that was tuned for a different row width.

Every Chunker Change Is a Migration

Once you accept that the chunker defines the schema, the operational implications follow with brutal clarity.

You cannot change the chunker without re-embedding the corpus. The old vectors were produced from old chunks; the new chunks need new vectors; the boundary change means the old and new rows do not correspond one-to-one, so you cannot diff them by ID. You need a full rebuild, a dual-index window, a quality audit against held-out queries, and a cutover plan. The dominant pattern in production — documented by Google Cloud, Langchain, and nearly every vector-database vendor — is blue/green indexing: stand up the new index in the shadow of the old, dual-write during ingestion, run queries against both and compare, then route a percentage of traffic to the new index and ramp up as confidence builds.

This is not optional rigor. This is the only way to tell whether the new chunker is better, worse, or shaped differently enough that "better on average" hides "catastrophically worse on your top-5 query types." A silent re-embed overnight gives you none of that signal. You discover the regression when a customer complains, three days later, and you have already deleted the old index to save storage.

Costs and timing matter too. Re-embedding a multi-million-document corpus is not free — compute, storage for the shadow index, dual-read query cost during comparison, and human time to audit results. Teams that treat chunker changes as preprocessing consistently under-budget these costs, which leads to the worst failure mode: a chunker migration that gets half-deployed, leaves the old and new indexes coexisting in an unintended way, and corrupts retrieval for weeks before someone notices that two documents about the same policy return different chunks depending on the shard.

The Observability Gap: Chunk Bugs Surface Three Hops Away

The hardest part of treating the chunker like a schema is that chunker bugs rarely present as chunker bugs.

Consider a common failure mode: the chunk boundary falls between a general rule and its exception. "Refunds are processed within seven business days." is in one chunk. "Refunds for enterprise accounts require CFO approval and may take up to thirty days." is in the next. Retrieval for the query "how long do enterprise refunds take" surfaces the first chunk, the generator confidently answers "seven business days," and a customer gets the wrong answer. Where does this show up? In LLM evaluation. The trace will read: retrieval recall looks OK, generation looks OK, ground truth mismatch — probably needs better prompting.

No. It needs the exception to live in the same chunk as the rule. The bug is a chunker bug, but the signal is a generation-quality signal, three indirections away from the cause. This is the observability gap that makes chunking so operationally hostile. You instrument retrieval recall, answer faithfulness, citation correctness, and latency — and none of those metrics ping when a boundary rule produces semantically broken rows. Teams that invest in RAG observability stacks (Langfuse, DeepEval, Braintrust-style traces) still rarely have a "chunk boundary sanity" metric.

The practical fix is to add chunker-level tests that the rest of your RAG pipeline does not know it needs:

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates