The Vector Index That Was Sharded by Ingestion Date
There is a specific kind of recall lie that hides inside time-partitioned vector indexes, and the people who built the offline eval are usually the last to find it. The dashboard says recall@10 is 0.94. The retriever is shipping the right snippet 94% of the time. The product team is shipping more retrieval-grounded features on the back of that number. And then the support tickets arrive: "the assistant cited a guide that does not match the answer," "the assistant linked to last week's version of the policy," "the assistant could not find a document I uploaded two months ago." None of those tickets contradict the 0.94. They are evidence that the 0.94 is measuring the wrong thing.
The mechanism is simple and easy to miss. The vector index is sharded by ingestion date because that is the easiest way to keep write throughput high, retire old data, and keep the hot working set in fast memory. The offline test set is generated nightly from production logs, which means the queries are drawn from the same recent window that the freshest shard happens to hold. Recall is measured against ground truth that lives one or two shards deep. The retriever performs beautifully on those queries because, in production, those queries are the ones the routing layer keeps inside the same shard.
Then a real user asks something that requires crossing shards. The router fans out, the top-k is merged across partitions, and the partitions return their own local top-k before the merge. The merge is correct in the textbook sense — the highest-scoring vectors from each shard get combined — but the highest-scoring vector inside the freshest shard is almost always a better match than the second-best vector inside a stale shard, because the freshest shard contains the recently rewritten version of the document the user is actually looking for. The "right" snippet, the one that would have answered the question two months ago, sits unfindable inside a partition the merge step deprioritized.
Why the offline number stays green
Vector search does not partition cleanly the way relational data does. A nearest-neighbor query is global by definition — the right answer can live anywhere in the embedding space — so any sharding scheme is a bet about which neighbors a typical query will need. Centroid-based sharding makes that bet on geometry: vectors that are close in embedding space land on the same shard, and routing sends each query to a small number of probable clusters. The bet is a good one because the topology of the query distribution and the topology of the data distribution tend to overlap. Ingestion-date sharding makes a completely different bet. It assumes the time a document was written predicts the time someone will ask about it. That assumption is true for news, partially true for product catalogs, and false for documentation, policy, and reference content — exactly the corpora most RAG systems are built on.
The offline eval cannot detect this because the eval set is sampled with the same temporal bias the architecture imposes. If your ingestion-date sharding is doing its job, the most recent shard is also the shard that holds the most recently embedded version of the document that answers today's query. Your log-sampling pipeline scrapes queries from yesterday's traffic, computes ground-truth nearest neighbors with a full-corpus scan, and then evaluates the production retriever, which happens to fan out across all shards anyway during offline mode because there is no latency pressure. The benchmark turns into a measure of how often the answer lives in the freshest slice — and because of the temporal bias of the sample, the answer almost always does. Recall@10 looks like a property of the retriever. It is really a property of the data your eval happens to ask about.
The cross-shard merge that loses information
The pathology lives in how distributed top-k is computed. Each shard independently produces its k highest-scoring vectors, the router collects (number of shards × k) candidates, and the merge step picks the global top-k by score. This is the only practical way to do it at scale — you cannot afford to ship every vector back to a central node for re-ranking — but it is mathematically the same as ANN already: an approximation, with a different failure mode.
The standard ANN failure is that the wrong neighbor wins inside a single shard. The cross-shard failure is more insidious: the right neighbor never makes it out of its shard because the local top-k cut it. Imagine a query whose true nearest neighbor lives in a six-month-old shard with low query density. The cosine similarity of the right document is 0.81. The freshest shard, with the recently updated version of a different document, returns a snippet at 0.83. The right answer is dominated locally because the stale shard contains many near-duplicates of the original — and "near-duplicate" in embedding space is enough to keep the right one out of the local top-k. The merge step never sees the 0.81 candidate. The dashboard, which evaluates against a query distribution drawn from the same recent shard, never sees the failure.
This is why "increase k" feels like the obvious fix and almost never works. Raising the per-shard k raises latency linearly and cost quadratically; the recall improvement is sublinear because the stale shard's local distribution is still dominated by the same near-duplicate cluster. The latency budget you bought back by sharding is the same budget you have to spend to compensate.
Diagnostics that actually detect this
The signal you want is not aggregate recall@10. It is recall conditioned on which shard the true neighbor lives in. To get that number, you need an eval set whose ground-truth neighbors are deliberately distributed across the shard population in proportions that match production — not in proportions that match your log sampler.
A workable construction:
- Build a query set by stratifying on document age, not on traffic volume. If 30% of your queries in production need an answer older than 90 days, your eval set should be at least 30% from that bucket.
- Compute ground truth with a full-corpus brute-force scan. Note the shard each top-1 lands in.
- Run the production retriever over the same queries. Measure recall conditional on the ground-truth shard. If recall@10 is 0.97 for queries whose answer lives in the freshest shard and 0.62 for queries whose answer lives in shards older than 90 days, the aggregate number is hiding a 35-point gap.
The same construction tells you whether the gap is closing or opening over time. If your corpus has aged for six months since the original sharding decision, the population that was once 80% in the freshest shard might now be 50%, and your aggregate recall will silently drift even though no parameter has changed.
A second diagnostic worth running: replay the same query across each shard in isolation, capture each shard's local top-k, and ask whether the global top-k that the merge step produced would have changed if the locally-cut candidates were re-included. The fraction of queries where the answer changes is the recall debt you owe the merge step. In well-conditioned indexes it is below 2%. In an ingestion-date-sharded index serving documentation queries, it can be 15 to 25%.
What to do once you can see it
The first instinct is to re-shard. Sometimes that is the right call: if the corpus and query distribution both have stable temporal locality (a news search product, a recent-feed recommender), ingestion-date sharding survives this analysis and can stay. For most RAG systems on a slow-moving corpus, centroid-based or hybrid sharding is the better fit. The cost of re-sharding is one-time and large; the cost of living with the wrong scheme is silent and continuous.
The second instinct is to fan out wider — more probes, larger per-shard k, more shards searched per query. This is sometimes the right intermediate step, especially when re-sharding is blocked by an ingestion pipeline you do not control. The trick is to make the fan-out adaptive to the query rather than uniform across queries. If you can cheaply estimate the age of the document a query is likely to need — using a small auxiliary classifier on the query embedding, a rules-based date extractor, or a per-tenant prior — you can route deep into older shards only for the queries that need it. Most queries pay the original latency cost; the long tail pays for itself.
The third move is at the eval layer. Even if the index stays the same, the eval should change. Aggregate recall@10 should be reported alongside recall@10 stratified by ground-truth shard, by document age, and by query age. The dashboards should refuse to display a single number when the conditional numbers diverge by more than a few points. A single recall number on a sharded index is almost always lying about something; the discipline is to force the eval to expose what the architecture chose to hide.
The broader pattern
This failure mode is one instance of a more general one: any time the data layout and the eval data have a shared bias, the eval becomes a measurement of the bias. Sampling queries from production traffic is the natural default, but production traffic is shaped by the system you are evaluating — which means the eval inherits the system's blind spots. The data that does not get retrieved does not generate queries that get sampled. The shard that does not get probed does not contribute to the test set. The corpus drift that opens a gap between freshest and oldest is the same drift that quietly reweights the test set toward what already works.
The fix is structural, not metric. Construct evaluation populations against a model of where the answers should come from, not where the queries happen to live. Stratify by the dimensions of the architecture you suspect are uneven — shards, tenants, time buckets, embedding versions, language, document type. Aggregate metrics survive intact when the strata are balanced. When they are not, the aggregate is a story the architecture is telling itself.
The retrieval team's job is to refuse to let that story stay coherent. Recall is not 0.94. Recall is 0.97 on the slice your sampler liked and 0.62 on the slice that ships the support tickets. Anyone who confuses those two numbers will keep being surprised by the same class of bug.
- https://sarthakai.substack.com/p/a-vectordb-doesnt-actually-work-the
- https://aakashsharan.com/distributed-vector-database-architecture-sharding-routing/
- https://apxml.com/courses/advanced-vector-search-llms/chapter-4-scaling-vector-search-production/sharding-strategies-vector-indexes
- https://apxml.com/courses/advanced-vector-search-llms/chapter-5-advanced-tuning-evaluation/online-offline-evaluation
- https://medium.com/@Nexumo_/the-8-retrieval-benchmarks-lying-to-your-rag-5811ca3ee057
- https://redis.io/blog/common-challenges-working-with-vector-databases/
- https://opensourceconnections.com/blog/2025/02/27/vector-search-navigating-recall-and-performance/
- https://dev.to/kuldeep_paul/ten-failure-modes-of-rag-nobody-talks-about-and-how-to-detect-them-systematically-7i4
