Skip to main content

Vector DB Sharding: Why HNSW Breaks at Partition Boundaries and What to Do About It

· 9 min read
Tian Pan
Software Engineer

Most vector database tutorials show you how to insert a million embeddings and run a query. What they don't show you is what happens six months later, when your corpus has grown past what a single node can hold, and you're trying to shard the HNSW index your entire retrieval pipeline depends on. The answer, which vendors leave out of the marketing copy, is that HNSW graphs resist partitioning in ways that cause silent recall degradation — and the operational patterns needed to recover that quality add real complexity.

This post covers the technical reasons HNSW sharding breaks down, what recall loss looks like in practice, and the operational patterns teams use to maintain retrieval accuracy when they've outgrown a single node.

Why HNSW Doesn't Partition Cleanly

HNSW (Hierarchical Navigable Small World) is the index structure underlying most production vector databases — Qdrant, Weaviate, Milvus, and pgvector all default to it. The algorithm builds a multi-layer graph where each vector connects to a small set of neighbors, with top layers providing long-range traversal and lower layers enabling precise local search.

The critical property of HNSW is that neighbor relationships are defined by position in high-dimensional space, not by any key or partition scheme you control. Two vectors that are nearest neighbors are likely stored on different nodes once you've distributed your corpus. When you shard, you're forced to remove the edges that cross shard boundaries — and those edges are exactly the ones the graph relies on for accurate search.

The consequence shows up in benchmarks from graph partitioning research: over 80% of search steps in a distributed HNSW query constitute remote procedure calls across shard boundaries. The graph doesn't "nearly work" when partitioned; it fundamentally changes from a coherent global structure into a collection of isolated local graphs that each think they're the whole picture.

There's a subtler failure mode too: unreachable points. When an HNSW graph is partitioned under real-time update conditions, some vectors lose all their in-graph connections and become effectively invisible to queries. Research on this problem found that in-degree defects — vectors with poor connectivity after partitioning — account for 84.6% of total recall loss in distributed scenarios.

What Recall Loss Actually Looks Like

The intuitive mental model of recall degradation is gradual and proportional: more shards, slightly worse results. The actual pattern is less predictable.

Under careful controlled conditions — enforcing balanced partitions by reassigning vectors to nearest clusters — recall drops by 0.2% to 3.6% depending on the partitioning strategy. That range sounds small until you're running a recommendation system at scale and every 1% recall drop is a measurable reduction in engagement.

What practitioners actually encounter in production is more discontinuous. Performance remains acceptable until your corpus crosses a threshold where HNSW's working set no longer fits in RAM, then drops sharply. Systems like PipeANN and SPANN couldn't operate at all below a 30% memory-to-index ratio. DiskANN saw steep throughput drops below 20% memory availability. The cliff isn't predicted by the benchmark numbers vendors publish, which are measured on hardware configured to keep the full index in memory.

Cross-shard neighbors are the root cause. When a query vector's true nearest neighbors are split across shards, each shard's local HNSW search returns its best local results — which may have no overlap with the globally correct answer. The query coordinator merges results that are each locally optimal but collectively wrong.

The Three Operational Patterns That Compensate

Teams that have scaled past the single-node threshold have converged on three patterns that partially recover the recall quality that partitioning destroys.

Shard-aware query routing avoids the brute-force scatter-gather approach (send query to all shards, merge results) by using coarse-grained index structures to route queries only to relevant shards. The dominant implementation uses centroid-based routing: maintain a small index of cluster centroids, compare incoming queries to that index first, then search only the shards that contain nearby centroids. This can reduce fan-out from hundreds of shards to single digits for well-clustered corpora. The tradeoff is that queries near cluster boundaries still require multi-shard access, and maintaining the centroid index adds a coordination layer that needs its own operational care.

Cross-shard re-ranking addresses the inconsistent ranking problem from per-shard HNSW graphs. Each shard returns its local top-K candidates with distance scores. A coordinator collects all shard results, merges the candidate lists, and re-ranks globally. The implementation detail that matters: the "fan-out dominates latency" property means the slowest shard determines query latency regardless of how fast the others respond. Tail latency from a single overloaded shard is worse than the average latency of all shards. Re-ranking itself is cheap — it's the wait for the last shard that kills P99.

Partition-size monitoring is the unglamorous pattern that prevents the other two from becoming necessary more often than they should be. Industry experience has converged on 10–30 million vectors per shard as a practical working range, with 4–8 shards as the upper bound before inter-worker communication overhead overtakes parallelization gains. What's operationally important: shard count cannot be modified after collection creation in most systems, so the partition strategy you choose at day one becomes load-bearing. Monitoring how fast each shard grows — and triggering re-clustering before any shard reaches the memory cliff — is what keeps the other two patterns from doing more work than they need to.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates