Skip to main content

One post tagged with "ann"

View all tags

Popularity Bias in Vector Retrieval: Why the Same Five Chunks Dominate Every Query

· 10 min read
Tian Pan
Software Engineer

Pull a week of retrieval logs from any mature RAG system and sort chunks by how often they were returned. The shape is almost always the same: a small cluster of chunks appears in thousands of queries while the vast majority of your corpus shows up a handful of times or never at all. The system isn't broken. It's doing exactly what its index was built to do — and that is the problem.

This is popularity bias in vector retrieval, and it gets worse as your corpus grows. A few chunks become gravity wells that win retrieval across queries that have little to do with each other, while your long tail quietly disappears below the top-k cutoff. Your RAG system starts feeling "generic" — users ask specific questions and get answers that sound like they were written for someone else. By the time product complains, the distribution has already been lopsided for weeks.