Skip to main content

2 posts tagged with "ann"

View all tags

Your Vector Store Has Hot Keys: Why ANN Indexes Lie About Production Cost

· 10 min read
Tian Pan
Software Engineer

The vector index your team picked was benchmarked on a workload that doesn't exist in production. Every public ANN benchmark — VIBE, ann-benchmarks, the comparison table on the database vendor's landing page — runs queries sampled uniformly from the corpus, so every neighbor lookup costs roughly the same and every shard sees roughly equal load. Real retrieval traffic does not look like that. It looks Zipfian: a small fraction of queries (today's news, the trending product, the recurring support intent, the few hundred questions a customer support team gets all day) hits a small fraction of embeddings a hundred times more often than the median. The benchmark says HNSW recall is 0.97 at 50ms p99. Production says one shard is melting and the rest are bored.

The mismatch is not a tuning problem. It's that vector retrieval inherits the access-skew profile of every other database workload, and the indexes the field has standardized on were not designed with that profile in mind. The cache layer your KV store gets for free — the OS page cache warming up the rows you read most often, the LRU on a hot key — does not exist for ANN, because the graph is walked in graph order, not access order. The hot embeddings stay cold in memory because the search algorithm's traversal pattern looks random to the page cache, and your "popular" cluster lives on a single shard whose CPU runs hot while the rest of the fleet idles.

Popularity Bias in Vector Retrieval: Why the Same Five Chunks Dominate Every Query

· 10 min read
Tian Pan
Software Engineer

Pull a week of retrieval logs from any mature RAG system and sort chunks by how often they were returned. The shape is almost always the same: a small cluster of chunks appears in thousands of queries while the vast majority of your corpus shows up a handful of times or never at all. The system isn't broken. It's doing exactly what its index was built to do — and that is the problem.

This is popularity bias in vector retrieval, and it gets worse as your corpus grows. A few chunks become gravity wells that win retrieval across queries that have little to do with each other, while your long tail quietly disappears below the top-k cutoff. Your RAG system starts feeling "generic" — users ask specific questions and get answers that sound like they were written for someone else. By the time product complains, the distribution has already been lopsided for weeks.