Your Vector Store Has Hot Keys: Why ANN Indexes Lie About Production Cost
The vector index your team picked was benchmarked on a workload that doesn't exist in production. Every public ANN benchmark — VIBE, ann-benchmarks, the comparison table on the database vendor's landing page — runs queries sampled uniformly from the corpus, so every neighbor lookup costs roughly the same and every shard sees roughly equal load. Real retrieval traffic does not look like that. It looks Zipfian: a small fraction of queries (today's news, the trending product, the recurring support intent, the few hundred questions a customer support team gets all day) hits a small fraction of embeddings a hundred times more often than the median. The benchmark says HNSW recall is 0.97 at 50ms p99. Production says one shard is melting and the rest are bored.
The mismatch is not a tuning problem. It's that vector retrieval inherits the access-skew profile of every other database workload, and the indexes the field has standardized on were not designed with that profile in mind. The cache layer your KV store gets for free — the OS page cache warming up the rows you read most often, the LRU on a hot key — does not exist for ANN, because the graph is walked in graph order, not access order. The hot embeddings stay cold in memory because the search algorithm's traversal pattern looks random to the page cache, and your "popular" cluster lives on a single shard whose CPU runs hot while the rest of the fleet idles.
This post is about the gap between how ANN indexes are evaluated and how they get used, and the operational discipline that has to land in production before that gap turns into a postmortem.
The Benchmark Is Uniform; Your Traffic Is Not
When researchers compare HNSW, IVF, ScaNN, DiskANN, or the latest graph variant, they sample queries uniformly from the dataset (or from a held-out test split with the same distribution). Recall, latency, and throughput are reported as means over that sample. The VIBE paper makes a careful effort to use modern embeddings and out-of-distribution query sets, but even that work is comparing index quality, not operational behavior under skewed traffic.
Production retrieval has at least three sources of skew that the benchmark erases:
- Temporal skew. A news app's queries today are about today's news. A code-search tool spikes on whatever framework just shipped a major release. A support bot sees 70% of its volume against the same 200 intent embeddings.
- Tenant skew. In multi-tenant SaaS, a few large customers contribute disproportionate traffic, and their corpus subsets concentrate hits on a small slice of the index.
- Graph topology skew. Inside a graph index like HNSW, a small set of high-degree "hub" nodes get traversed on almost every query. This is intrinsic to the data structure, not to your workload — and it shapes which parts of the index dominate cache pressure.
The first two are properties of your traffic. The third is a property of the algorithm. Both interact with the cache hierarchy, and neither shows up in the benchmark.
Cache Behavior Is the Hidden Cost Model
Flat indexes scan vectors sequentially, so the CPU prefetcher works and cache misses are predictable. Graph indexes — HNSW and the family of disk-augmented variants like DiskANN — walk pointers in an order that depends on the query, and that order looks random to the prefetcher. Distance computations during graph traversal incur frequent cache misses, and modern systems like VSAG have started introducing software prefetching specifically because the access pattern is hostile to hardware caches.
Now layer skewed traffic on top. The hot embeddings — the ones every popular query touches — should live in L2/L3 or at least in the OS page cache. They don't, because the graph orders nodes by distance proximity, not access frequency, and the sections of the graph that hold popular vectors are interleaved with neighbors that are rarely visited. The result: under heavy skew, your "in-memory" index is paying RAM-fetch cost on every hop, and your "disk-augmented" index is paying SSD-fetch cost on the same hot vectors over and over.
This is why teams running DiskANN at scale care so much about caching the entry-point neighborhood: the entry points are the densest hubs, and pinning them in cache is the difference between a healthy p99 and a queue. It's also why tiered architectures have started winning. Milvus 2.6 ships an explicit hot/cold separation; the production pattern around S3 Vectors is to keep cold embeddings in object storage and promote hot ones to OpenSearch HNSW. Less than 10% of data tends to take 80%+ of query traffic, and treating those tiers identically is paying for memory you can't use efficiently.
The Per-Embedding Access Histogram You Don't Have
Every operational story about hot keys starts with a dashboard the team wishes they'd built earlier. For a relational database it's the slow-query log plus an access-frequency view by row or partition. For a KV store it's a hot-shard metric. For a vector store, the equivalent is a per-embedding (or per-cluster) access-frequency histogram, and almost no team builds it until something falls over.
The reason it's missing is that the index abstraction hides it. You query for a vector and get back a list of nearest neighbors. The system doesn't naturally tell you which embeddings are getting returned 100x more often than the median. You have to instrument the search call to log neighbor IDs (or hash buckets, if PII is a concern), aggregate over a rolling window, and visualize the distribution. Once you do, three things become legible:
- The hot tail. A small set of embeddings dominate the result distribution. Sometimes they're high-quality canonical answers, sometimes they're a content bug — a duplicate that's slightly closer to a popular query than the intended canonical.
- The shard imbalance. If the index is sharded by ID range or by clustering centroid, the hot embeddings often cluster on one or two shards. Capacity-planning by mean traffic underprovisions those shards.
- https://milvus.io/blog/milvus-tiered-storage-80-less-vector-search-cost-with-on-demand-hot%E2%80%93cold-data-loading.md
- https://arxiv.org/pdf/2404.12457
- https://arxiv.org/html/2505.17810v1
- https://www.tigerdata.com/blog/vector-database-basics-hnsw
- https://docs.databricks.com/aws/en/vector-search/vector-search-best-practices
- https://milvus.io/ai-quick-reference/how-do-memory-access-patterns-and-cache-misses-influence-the-latency-and-throughput-of-vector-search-algorithms-especially-in-graphbased-vs-flat-indexes
- https://bigdataboutique.com/blog/opensearch-with-s3-vectors-cost-efficient-hybrid-search
- https://brain.co/blog/semantic-caching-accelerating-beyond-basic-rag
- https://arxiv.org/pdf/2508.15694
- https://www.vldb.org/pvldb/vol18/p5017-cheng.pdf
