Skip to main content

Your Vector Store Has Hot Keys: Why ANN Indexes Lie About Production Cost

· 10 min read
Tian Pan
Software Engineer

The vector index your team picked was benchmarked on a workload that doesn't exist in production. Every public ANN benchmark — VIBE, ann-benchmarks, the comparison table on the database vendor's landing page — runs queries sampled uniformly from the corpus, so every neighbor lookup costs roughly the same and every shard sees roughly equal load. Real retrieval traffic does not look like that. It looks Zipfian: a small fraction of queries (today's news, the trending product, the recurring support intent, the few hundred questions a customer support team gets all day) hits a small fraction of embeddings a hundred times more often than the median. The benchmark says HNSW recall is 0.97 at 50ms p99. Production says one shard is melting and the rest are bored.

The mismatch is not a tuning problem. It's that vector retrieval inherits the access-skew profile of every other database workload, and the indexes the field has standardized on were not designed with that profile in mind. The cache layer your KV store gets for free — the OS page cache warming up the rows you read most often, the LRU on a hot key — does not exist for ANN, because the graph is walked in graph order, not access order. The hot embeddings stay cold in memory because the search algorithm's traversal pattern looks random to the page cache, and your "popular" cluster lives on a single shard whose CPU runs hot while the rest of the fleet idles.

This post is about the gap between how ANN indexes are evaluated and how they get used, and the operational discipline that has to land in production before that gap turns into a postmortem.

The Benchmark Is Uniform; Your Traffic Is Not

When researchers compare HNSW, IVF, ScaNN, DiskANN, or the latest graph variant, they sample queries uniformly from the dataset (or from a held-out test split with the same distribution). Recall, latency, and throughput are reported as means over that sample. The VIBE paper makes a careful effort to use modern embeddings and out-of-distribution query sets, but even that work is comparing index quality, not operational behavior under skewed traffic.

Production retrieval has at least three sources of skew that the benchmark erases:

  • Temporal skew. A news app's queries today are about today's news. A code-search tool spikes on whatever framework just shipped a major release. A support bot sees 70% of its volume against the same 200 intent embeddings.
  • Tenant skew. In multi-tenant SaaS, a few large customers contribute disproportionate traffic, and their corpus subsets concentrate hits on a small slice of the index.
  • Graph topology skew. Inside a graph index like HNSW, a small set of high-degree "hub" nodes get traversed on almost every query. This is intrinsic to the data structure, not to your workload — and it shapes which parts of the index dominate cache pressure.

The first two are properties of your traffic. The third is a property of the algorithm. Both interact with the cache hierarchy, and neither shows up in the benchmark.

Cache Behavior Is the Hidden Cost Model

Flat indexes scan vectors sequentially, so the CPU prefetcher works and cache misses are predictable. Graph indexes — HNSW and the family of disk-augmented variants like DiskANN — walk pointers in an order that depends on the query, and that order looks random to the prefetcher. Distance computations during graph traversal incur frequent cache misses, and modern systems like VSAG have started introducing software prefetching specifically because the access pattern is hostile to hardware caches.

Now layer skewed traffic on top. The hot embeddings — the ones every popular query touches — should live in L2/L3 or at least in the OS page cache. They don't, because the graph orders nodes by distance proximity, not access frequency, and the sections of the graph that hold popular vectors are interleaved with neighbors that are rarely visited. The result: under heavy skew, your "in-memory" index is paying RAM-fetch cost on every hop, and your "disk-augmented" index is paying SSD-fetch cost on the same hot vectors over and over.

This is why teams running DiskANN at scale care so much about caching the entry-point neighborhood: the entry points are the densest hubs, and pinning them in cache is the difference between a healthy p99 and a queue. It's also why tiered architectures have started winning. Milvus 2.6 ships an explicit hot/cold separation; the production pattern around S3 Vectors is to keep cold embeddings in object storage and promote hot ones to OpenSearch HNSW. Less than 10% of data tends to take 80%+ of query traffic, and treating those tiers identically is paying for memory you can't use efficiently.

The Per-Embedding Access Histogram You Don't Have

Every operational story about hot keys starts with a dashboard the team wishes they'd built earlier. For a relational database it's the slow-query log plus an access-frequency view by row or partition. For a KV store it's a hot-shard metric. For a vector store, the equivalent is a per-embedding (or per-cluster) access-frequency histogram, and almost no team builds it until something falls over.

The reason it's missing is that the index abstraction hides it. You query for a vector and get back a list of nearest neighbors. The system doesn't naturally tell you which embeddings are getting returned 100x more often than the median. You have to instrument the search call to log neighbor IDs (or hash buckets, if PII is a concern), aggregate over a rolling window, and visualize the distribution. Once you do, three things become legible:

  • The hot tail. A small set of embeddings dominate the result distribution. Sometimes they're high-quality canonical answers, sometimes they're a content bug — a duplicate that's slightly closer to a popular query than the intended canonical.
  • The shard imbalance. If the index is sharded by ID range or by clustering centroid, the hot embeddings often cluster on one or two shards. Capacity-planning by mean traffic underprovisions those shards.
  • The cache-fit signal. The size of the hot tail tells you whether a per-query-result cache, a tiered hot index, or a full memory upgrade is the right intervention.

A team without this view is operating a vector store the way nobody operates a database: blind to which keys carry the load.

Semantic Caching Is the ANN Equivalent of a Page Cache

Semantic caching — caching responses keyed on the query embedding rather than the literal query string — is the cache layer ANN doesn't have built in. The economics are striking when the workload is skewed: production teams report 40–70% of queries are semantic duplicates of earlier queries (different words, same intent), and well-tuned systems hit 45–65% cache hit rates inside the first week, climbing to 60–80% as the cache builds coverage. RAGCache, the academic system in this space, gets 1.2–4× lower TTFT on RAG workloads by caching retrieved-knowledge KV state and using a prefix-aware GDSF replacement policy that takes order, frequency, size, and recency into account.

The right way to think about semantic caching is that it's the missing OS page cache for embeddings. The interesting design choices:

  • Key on the embedding's neighborhood, not the literal query. If you key on the exact query string you miss almost everything. If you key on the embedding and accept any cached response within a similarity threshold, you absorb most of the duplicate traffic.
  • Set the threshold by domain. Customer support intents tolerate aggressive thresholds because the canonical answer is the same. Medical or legal retrieval tolerates almost none.
  • Per-tenant caches. Skew is sharper inside a single tenant than across the whole dataset; a global cache mostly trains on the largest tenant.
  • Treat it as an index, not a sidecar. A cache that doesn't track recall against the underlying index drifts. RAGCache-style approaches measure hit-rate against ground truth retrieval, not against itself.

Semantic caching does not replace the index. It absorbs the popular-query class so the index gets to spend its time on the long tail, which is the workload it was actually benchmarked on.

The default capacity-planning move for a vector index is to size memory to the working set and CPU to mean QPS. Both numbers lie under skew. The working set has to fit the hot tail with headroom for graph traversal overhead, and "mean QPS" averages a quiet shard with a saturated one. The right model has three numbers:

  • Hot-tail working set. The bytes occupied by the embeddings hit by the top decile of queries, plus the graph structure that links them, plus a margin for the high-degree hubs that get touched even on cold-tail queries.
  • p99 of the popular query class. Slice your latency histogram by query class (popular, midfreq, long-tail) before reporting p99. The composite number hides which class is suffering, and the popular class is where users notice — those are the queries the same people make every day.
  • Per-shard saturation, not fleet average. Hot keys mean one or two shards hit CPU or memory saturation while fleet-average utilization looks fine. Provision against the saturated shard.

A team that plans against the mean is buying capacity that doesn't go where the load is. A team that plans against p99 of the popular class — and re-checks shard balance after every index rebuild — keeps the latency curve flat where users actually live.

Designing for the Workload You Have

The architectural shift is straightforward to state and uncomfortable to execute: stop operating your vector index like a uniform numerical workload, and start operating it like the skewed, cache-sensitive database it actually is. Concretely:

  • Build the per-embedding access histogram before you tune the index. Tuning ef_construction or nlist before you know which 5% of vectors carry 80% of traffic is optimizing the wrong parameter.
  • Adopt tiered storage when the hot/cold ratio is sharp. A hot HNSW tier in memory plus a cold tier in object storage or DiskANN is cheaper and faster than uniformly provisioning the whole index for the hot subset.
  • Layer semantic caching above the index. It's the cheapest absorber of duplicate traffic and the only one the index itself doesn't get for free.
  • Plan capacity by query class, not by mean. p99 of the popular class is the latency that ships in a customer's perception of your product.
  • Re-balance after index rebuilds. Centroid drift after a bulk re-embed can move the hot keys to different shards, and the planning above silently goes stale.

The deeper realization is that ANN benchmarks measure a property of the algorithm; production cost is a property of the workload. The team that picks an index from a leaderboard is making an algorithmic choice that has to be paid for with operational discipline — the histograms, the tiering, the capacity model — that the benchmark didn't measure. That discipline is the difference between a retrieval system whose cost model matches its traffic and one that quietly burns RAM, SSD, and CPU on a graph walk that was never going to win against a hot-key workload.

References:Let's stay in touch and Follow me for more thoughts and updates