Skip to main content

Choosing a Vector Database for Production: What Benchmarks Won't Tell You

· 10 min read
Tian Pan
Software Engineer

When engineers evaluate vector databases, they typically load ANN benchmarks and pick whoever tops the recall-at-10 chart. Three months later, they're filing migration tickets. The benchmarks measured query throughput on a static, perfectly-indexed dataset with a single client. Production looks nothing like that.

This guide covers the five dimensions that predict whether a vector database holds up under real workloads — and a decision framework for matching those dimensions to your stack.

The Benchmark Problem

The most widely cited vector database benchmarks — ANN Benchmarks, VectorDBBench — share a fundamental flaw: they test immediately after ingestion, when indexes are fully optimized and no concurrent writes are happening. Production systems insert, delete, and update vectors continuously while queries run. Indexes need constant re-optimization in real time, which is where most vendors' out-of-memory failures are hidden.

A related problem: these benchmarks test with a single client. Production means 100+ concurrent clients hitting different metadata subsets simultaneously. The latency numbers fall apart under concurrency.

There's also a less-discussed legal wrinkle. Roughly 30% of major vector database vendors include "DeWitt Clause" provisions in their EULAs, prohibiting customers from publishing independent benchmarks without permission. If a vendor's product performs poorly under real conditions, you may be legally prevented from sharing that publicly.

The takeaway: benchmark position correlates weakly with production outcomes. The dimensions that actually matter are filtering accuracy, update latency, tenant isolation, hybrid search quality, and total cost of ownership — and most benchmarks measure none of them.

Filtering Accuracy: The Silent Killer at Scale

Almost every production RAG system needs metadata filtering. You're not searching all vectors — you're searching vectors belonging to a specific user, document type, or time range. How the database handles this filter is more consequential than raw recall.

There are three approaches:

Post-filtering retrieves a candidate set via vector search, then discards non-matching results. For low-cardinality filters (say, filtering to 2% of your data), you discard 98% of what you retrieved. You're wasting computation in direct proportion to how selective your filter is.

Pre-filtering applies the metadata filter first, then runs vector search on the remaining subset. This works for large, high-cardinality result sets but breaks HNSW graph connectivity for small subsets, causing recall to collapse.

Filterable HNSW (Qdrant's approach) integrates the filter into graph traversal itself. It creates subgraphs per payload value, merged back into the full graph, and uses adaptive query planning: if a filter matches many points, it does standard HNSW traversal skipping non-matching nodes; if the filter is highly selective, it may skip HNSW entirely and do a direct scan. The result is 1.6x faster queries than post-filtering at equivalent recall, with roughly 6% memory overhead for filterable fields.

Reddit's production deployment is instructive. Managing 340M+ vectors, they found metadata filtering — not similarity computation — became the primary performance bottleneck as concurrent users scaled. P99 latency jumped 10x due to disk I/O overhead from moving data between the vector graph and relational metadata store. The lesson: if filtering is central to your query pattern, evaluate it explicitly under concurrent load.

Update Latency: When Does a New Vector Become Searchable?

This question almost never appears in benchmarks, but it's critical for recommendation systems, live inventory search, and any application where the corpus changes frequently.

The approaches differ significantly:

  • Zero-delay freshness (GaussDB-Vector): newly inserted vectors are immediately visible upon commit via incremental on-disk index updates. Achieves sub-50ms latency with 95%+ recall on billion-scale datasets.
  • CDC-based eventual consistency (Vespa, some Milvus configurations): writes append to a change log that the index consumes asynchronously. Suitable when slight lag is acceptable and throughput is paramount.
  • Bulk reindexing: ingest off-peak and rebuild. Appropriate only for static or slowly changing corpora.

For most RAG pipelines, eventual consistency is acceptable — a new document becoming searchable within seconds is fine. But for product search or real-time recommendations, the delta between write and search visibility directly affects user experience. Clarify which model your candidate database uses before assuming freshness.

Multi-Tenancy: Three Approaches, Three Tradeoffs

If you're building a multi-tenant application — one where each customer's data must be logically or physically isolated — your tenant isolation architecture shapes every subsequent performance and cost decision.

Per-tenant indexes give each tenant their own index. Search is unfiltered, and the index structure can be tuned to each tenant's vector distribution. The cost: memory grows linearly with tenant count, and vectors are duplicated across indexes when shared documents exist. Weaviate tested this pattern at 100,000 collections (one per tenant), managing 100M+ vectors — it's viable but the memory overhead is substantial.

Shared index with metadata filtering puts all vectors in one index with per-tenant identifiers. Memory-efficient and flexible. The cost: runtime permission checks add latency, and a shared index structure may perform poorly for tenants with unusual vector distributions.

Namespaces (Pinecone's recommended approach) partition an index into isolated segments. Simpler than full per-tenant indexes, cheaper than per-tenant memory overhead. The cost: namespace management becomes its own operational complexity at large tenant counts.

The right choice depends on your isolation requirements and scale. If regulatory compliance mandates strict data isolation per tenant, per-tenant indexes may be unavoidable regardless of cost. If isolation is logical rather than legal, shared-index filtering is usually sufficient.

Hybrid Search Quality: Dense + Sparse at Scale

Most production search benefits from combining semantic (dense vector) and keyword (sparse BM25) retrieval. Code queries, named entities, product SKUs, and technical jargon are all cases where pure semantic similarity fails.

Hybrid search runs both retrieval methods in parallel, then fuses the ranked lists — typically with Reciprocal Rank Fusion (RRF): for each document, it accumulates 1/(k + rank) across methods, where k is a smoothing constant. An alpha parameter controls weighting between methods.

At scale (100M+ vectors), maintaining both sparse and dense indexes simultaneously adds storage and computational overhead. The quality of fusion matters as much as the quality of individual retrievals. Weaviate supports two fusion algorithms with different quality-latency tradeoffs and recommends evaluating both against your specific query distribution.

Product Quantization (PQ) enables single-server deployment of billion-scale datasets by reducing vector memory by up to 20x — you store compressed vectors for candidate retrieval, then refine with exact distances. At this scale, you need enough precision to find candidates, not perfect precision for every comparison.

The Full Comparison: Picking the Right Architecture

With those dimensions in mind, the main architectural categories split as follows:

pgvector (PostgreSQL extension) is compelling for teams that can't justify running a separate system. At single-digit million scale, Supabase benchmarks show pgvector's HNSW index matching or beating dedicated databases at 99% accuracy. The 60-80% cost savings over managed vector DBs are real, and keeping vector and relational data in one ACID-consistent store eliminates a category of consistency bugs. The practical limits: HNSW indexing runs on your primary database server, competing with application queries. Heavy filtering degrades as the dataset grows. Plan for a migration around 50-100M vectors.

Pinecone offers zero operational overhead, which is genuinely valuable. The production gotcha is P99 latency: tail latency randomly spikes 1-2 seconds during internal scaling events even when average latency looks healthy. The pricing model scales linearly with no volume efficiency — at 100M vectors, monthly bills can rival your entire remaining infrastructure. Observability is limited to what Pinecone surfaces in their console; you have no access to shard-level metrics or internal logs.

Qdrant is the strongest choice when metadata filtering is central to your workload. The filterable HNSW implementation solves the pre/post filtering tradeoff at the algorithm level. Self-hosted, so operational burden is real, but the filtering performance advantage compounds as your filter cardinality varies.

Weaviate provides full observability through Prometheus metrics, dashboards, and tracing — you can see resource contention at the shard level. Hybrid search is well-implemented. The tradeoff: setup and maintenance of the observability stack is your responsibility.

Milvus is the established choice for 100M-1B+ scale deployments that need distributed architecture. It runs on Kafka, MinIO, and etcd — significant operational complexity that requires Kubernetes expertise. Teams at ByteDance, e-commerce platforms, and genomics companies run Milvus in production at this scale. The operational burden is justifiable when cost savings from self-hosting at scale outweigh the infrastructure investment.

TurboPuffer (object-storage native) costs a fraction of managed alternatives by storing vectors on S3 and pulling data up cache hierarchies on demand. Its JIT compilation model means query latency improves as data gets cached. The surprise is S3 request costs: 1M GETs/day adds roughly $12/month — acceptable, but teams that benchmark query count without modeling request costs discover this late. A separate concern: embeddings are lossy compressions of source text, meaning unencrypted vectors in S3 expose more recoverable data than teams expect.

Vespa handles high-velocity updates well — up to 100K writes/second per node with consistent freshness — and supports complex hybrid queries at billion scale. It's less commonly used in web application contexts but excellent for search infrastructure that needs to combine text ranking signals with semantic similarity.

The Decision Framework

Rather than a head-to-head feature matrix, the more useful framing is: what's the dominant constraint?

  • < 10M vectors, team has PostgreSQL expertise: start with pgvector. Zero incremental infrastructure, ACID semantics, familiar tooling. Revisit when filtering or scale forces the conversation.
  • Filtering cardinality is variable and unpredictable: evaluate Qdrant. The filterable HNSW advantage compounds when filter selectivity varies widely across queries.
  • Compliance requires data residency in your VPC: self-hosted only (Qdrant, Weaviate, Milvus). Pinecone and most managed services are non-starters.
  • Scale > 100M, team can absorb operational complexity: Milvus or Vespa. The cost savings at scale justify the Kubernetes investment.
  • Variable or bursty workloads, cost is primary constraint: TurboPuffer or pgvector. Avoid Pinecone's linear pricing model.
  • Zero operational overhead is mandatory: Pinecone — but load test against your concurrent workload specifically and measure P99, not P50.

What Ages Poorly

The retrieval system is more likely to become a liability than the model layer. Models improve and can be swapped; a vector database with 200M embeddings built on a proprietary format creates meaningful migration friction.

Three things degrade silently:

Index freshness signals: without monitoring for lag between writes and searchability, production systems drift toward serving stale results that look correct but aren't.

Filtering correctness under schema change: if your metadata schema evolves — a new field, a changed enum value — indexes built on old field definitions can return incorrect results without raising errors.

Embedding compatibility: upgrading your embedding model invalidates your entire index. Every major vector database requires a full reindex when you change embedding models. The retrieval system that "just works" today has a latency-critical migration event hidden somewhere in its future.

None of these are reasons to avoid building — but they're reasons to treat your vector infrastructure with the same architectural care you'd give your primary database. The retrieval layer is becoming load-bearing infrastructure, and the teams that treat it that way early avoid the 3 a.m. incidents later.

References:Let's stay in touch and Follow me for more thoughts and updates