The RAG Dedup Step That Broke Silently and Flooded Your Top-K With Near-Duplicates

June 3, 2026 · 10 min read

Software Engineer

A retrieval-augmented generation pipeline can degrade for weeks without a single metric noticing. The relevance scores look fine. The retrieval latency is unchanged. The eval slice that touches the broken topic moves a quarter of a point in the wrong direction, and your weekly review chalks it up to noise. Then someone reads the actual context window the model received for a customer ticket and sees the same paragraph three times — once in title case, once in lowercase, once with the punctuation stripped — and you understand that your top-five has secretly been a top-two for a month.

This is the class of failure where the system is doing exactly what it was told to do. The retriever is returning the most similar vectors to the query. Each of those vectors is genuinely about the right topic. The index has no idea that three of them came from the same paragraph indexed three ways, because the ingestion-time dedup pass that was supposed to catch that case is silently skipping it.

The Refactor That Dropped a Lowercase Pass

The version of the story I have seen most often goes like this. The ingestion pipeline runs a normalize_and_hash step before the embedding call. It strips whitespace, lowercases the text, removes a handful of zero-width characters, hashes the result, and skips any chunk whose hash matches one already indexed. The hash table is the cheap, correct mechanism that keeps near-duplicates out of the vector store.

Six months in, a platform engineer cleans up the normalization module. They rename an import, move a helper to a different file, and update three call sites. The lowercase step gets dropped from one of those call sites by mistake — it was a method on a builder object, the builder is now constructed differently, and the test that would have caught the regression was a coverage-only test that asserted "the function returns a string." The function returns a string. The string is just not lowercased anymore.

The pipeline keeps running. Documents keep ingesting. The hash table fills up with two entries for every paragraph that exists in mixed case anywhere in the corpus — one for the title-cased version, one for the lowercased version. The vector store ends up with twice the vectors it should have, and nobody notices because the per-document chunk count is within a reasonable range of last quarter's number.

The vectors are not identical. Their embeddings differ by a small amount because the tokenizer is case-sensitive in places the embedding model was trained to treat as semantically equivalent. So at retrieval time, when a query comes in that lexically matches the underlying paragraph, the cosine similarity ranks both copies in the top of the list — and ranks them next to each other because their embeddings are almost the same. Two of your top-five slots are now spent on the same content.

Why Relevance Scores Reward Redundancy

The core architectural failure here is that the retrieval scoring function and the dedup function are operating on different definitions of "the same thing."

The retriever ranks by similarity to the query. Each of the three duplicate chunks is genuinely similar to the query. From the retriever's perspective, returning three near-identical hits is the correct behavior — they are all relevant, and ranking them next to each other is what cosine similarity does when their vectors are almost equal. The relevance score is not broken. It is doing what it was designed to do.

The dedup step was supposed to enforce a different invariant: that the index does not contain redundant content. That invariant lives outside the retrieval scoring function. When the dedup step silently regresses, no retrieval metric flags it, because every retrieval metric is computed under the assumption that the index has already been deduplicated. The metrics are measuring the right thing for the wrong index.

This is why "monitor your retrieval scores" does not catch the problem. The scores are fine. The problem is that the answer-quality eval, which sits two stages downstream of retrieval, is the first place where the duplication actually changes anything observable — and by the time the signal reaches the eval, it has been smeared across every query that happens to touch a duplicated topic. Production teams have reported that roughly 80 percent of accuracy issues in RAG can be traced back to data-quality problems including duplication, with the failure mode you cannot easily attribute to a single retrieval call.

The Model's Over-Weighting Problem

Once near-duplicates flood the context, the model has its own way of compounding the failure. Language models read the context window as evidence. When three chunks say roughly the same thing in slightly different ways, the model treats that as three independent confirmations of the same claim — and ranks that claim higher in its answer.

This is a classical confound that gets dressed up in modern clothes. The model is not double-counting because it is bad at probability. It is double-counting because the context window does not contain a "these are the same source" annotation. The chunks arrived from the retriever as independent items. The model has no reason to treat them otherwise.

The result is a subtle and consistent bias. Queries that hit duplicated content get answers that are systematically more confident in whatever the duplicated passage said. If the duplicated passage is correct, your eval might trend slightly up on those queries. If it is wrong or partial, your eval trends down. In either case, the bias is invisible until you read individual contexts and notice the same sentence three times.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The RAG Dedup Step That Broke Silently and Flooded Your Top-K With Near-Duplicates

The Refactor That Dropped a Lowercase Pass

Why Relevance Scores Reward Redundancy

The Model's Over-Weighting Problem

Recommended Reading

About Tian Pan

The Refactor That Dropped a Lowercase Pass​

Why Relevance Scores Reward Redundancy​

The Model's Over-Weighting Problem​

Recommended Reading

About Tian Pan

The Refactor That Dropped a Lowercase Pass

Why Relevance Scores Reward Redundancy

The Model's Over-Weighting Problem