Skip to main content

One post tagged with "datasets"

View all tags

The Near-Duplicate Filter That Took Your Only Hard Example With It

· 10 min read
Tian Pan
Software Engineer

Your dedup step reported a corpus shrink of 28% and the training run finished six hours faster. The eval numbers came in flat-to-slightly-better. Nobody opened the diff of what got removed. Three weeks later support starts paging about a class of refund-reversal tickets the model used to handle and now flatly mishandles. There are eleven training rows that touched that exact pattern. Nine of them are gone — collapsed into a single representative that kept the shortest, cleanest phrasing and dropped the messy hostile-tone variants where the model had actually learned to de-escalate. Your dedup pipeline did that, and your evals did not catch it, because by the time the eval set was built, those examples were already gone from the train set the eval was sampled from.

This is the failure mode that bothers me about deduplication as a pipeline step: it presents itself as hygiene and it is actually distribution editing. Removing exact duplicates of boilerplate is hygiene. Removing near-duplicates by a similarity threshold is a sampling decision dressed up as one. The threshold picks which slices of your training distribution survive, and the slices most likely to lose are the ones where you have the fewest examples to begin with — which are also, almost by definition, the ones you were keeping for coverage rather than count.