Reranking Is the Real Work: Why Your Retrieval System's Bottleneck Is Never the Index
Teams building RAG systems almost universally hit the same wall: they spend a week tuning their HNSW index parameters, add product quantization, push recall@100 from 0.81 to 0.87 — and then watch LLM output quality barely budge. The assumption baked into months of effort is that a better index equals better answers. It doesn't. The bottleneck was never the index.
The actual chokepoint is the ranking step between your candidate set and your context window. What you put into the LLM determines what comes out, and the job of ranking is to ensure that the most genuinely relevant documents, not just the most semantically similar ones, make it through. That distinction matters more than any HNSW configuration you'll ever tune.
Why Embeddings Lie About Relevance
Bi-encoder embedding models encode the query and each document independently, then score by cosine similarity. The model has no awareness of the query while encoding a document — it encodes each text as a standalone point in vector space and then measures geometric proximity. This works well enough for topically related content. It fails reliably whenever relevance is contextual.
Consider a legal corpus where a query asks "penalties for late payment under UCC §2-709." The top retrieved chunks by cosine similarity might include: a general overview of UCC Article 2, a passage about payment terms in contract law, and a paragraph about remedy limitations. All semantically adjacent. None of them answer the question. The right passage — buried at rank 23 — mentions §2-709 explicitly but in a context the embedder mapped far from the query's centroid.
This isn't a pathological edge case. It's the rule in professional domains. Embeddings capture topic proximity; relevance is about intent, specificity, and context. Index tuning cannot fix a model that lacks awareness of what the query actually needs.
The structural gap: bi-encoders score query and document independently, so there's no interaction signal. A cross-encoder model takes the query-document pair as a combined input and runs full attention across both. It can see that a document answers a specific question, even when its topical fingerprint is generic. That's the insight that makes reranking powerful.
The Two-Stage Architecture and Why It Exists
The reason you can't just run a cross-encoder against your full corpus is latency. A cross-encoder requires a full forward pass per candidate document at query time — there's no precomputation. At 40 QPS with a cross-encoder scoring 10,000 documents per query, your p99 latency collapses. Cross-encoder overhead at that scale pushes p99.9 above 21 seconds.
The two-stage pipeline resolves this:
Stage 1 — Candidate generation: Run fast approximate nearest-neighbor retrieval (ANN with HNSW) or BM25 to generate 100–500 candidates in 5–30ms. This stage optimizes for recall — the goal is to not miss anything relevant, even at the cost of including irrelevant documents.
Stage 2 — Reranking: Apply an expensive, accurate model to only the shortlist. Cross-encoders working on 50–100 documents add 100–200ms, keeping total latency under 300ms for most production use cases.
The key design constraint: Stage 1's job is recall, not precision. You don't need it to be great at ranking. You need it to ensure that the truly relevant documents appear somewhere in the top 500. Then Stage 2 handles the precision work.
What this means in practice: stop measuring Stage 1 with NDCG. Measure it with recall@k. You want "did any relevant document appear in the top 200?" and you're optimizing for "yes." NDCG penalizes ranking failures within the candidate set, which is irrelevant when you're handing off to a reranker.
How Cross-Encoders, ColBERT, and Sparse Models Each Fit
Not all rerankers behave the same way. The choice between them is a tradeoff between latency, throughput, and accuracy.
Cross-encoders (e.g., BGE Reranker, SBERT cross-encoders, Cohere Rerank) encode query-document pairs jointly. They achieve state-of-the-art accuracy — top cross-encoders hit MRR@10 above 0.40 on MS MARCO — but carry high per-query cost. They cannot precompute document representations, so every request requires scoring all candidates from scratch. Right for applications where quality matters more than throughput and candidate sets are bounded to ~50–100 documents.
ColBERT (Contextualized Late Interaction) takes a different approach: it precomputes token-level embeddings for documents offline, then scores at query time using MaxSim operations over token interactions rather than a full forward pass. This yields p50 latency around 23ms at 40 QPS — roughly 10× faster than cross-encoders at comparable quality. It's the right call when you need near-cross-encoder quality at higher throughput.
Learned sparse models like SPLADE function differently from both. Rather than dense vectors, they produce sparse representations over vocabulary space, identifying which terms matter and at what weight. The result: 71% less index memory than dense models, CPU-compatible inverted indexes at query time, and built-in term expansion that catches variants the original query missed. SPLADE doesn't replace a reranker but changes Stage 1 architecture: sparse retrieval produces candidate sets that are semantically richer than BM25 but cheaper to operate than dense ANN at scale.
Hybrid first stage + cross-encoder reranker is where the performance ceiling currently sits. Combining BM25 with dense ANN retrieval (using Reciprocal Rank Fusion to merge their ranked lists), then applying a cross-encoder to the merged top-50, produces 12% NDCG improvement over either approach alone on BEIR benchmarks, and 24–48% improvement on TREC evaluations compared to dense-only retrieval.
What Happens When You Only Optimize the Index
The failure mode from over-investing in vector index tuning is subtle and dangerous: infrastructure metrics look healthy while output quality degrades.
- https://pinecone.io/learn/series/rag/rerankers/
- https://weaviate.io/blog/cross-encoders-as-reranker
- https://superlinked.com/vectorhub/articles/optimizing-rag-with-hybrid-search-reranking
- https://towardsdatascience.com/hnsw-at-scale-why-your-rag-system-gets-worse-as-the-vector-database-grows/
- https://www.pinecone.io/learn/splade/
- https://arxiv.org/html/2403.10407v1
- https://arxiv.org/html/2504.02921v1
- https://arxiv.org/html/2603.04816v1
- https://pinecone.io/blog/cascading-retrieval/
- https://medium.com/@aimichael/cross-encoders-colbert-and-llm-based-re-rankers-a-practical-guide-a23570d88548
- https://galileo.ai/blog/mastering-rag-how-to-select-a-reranking-model
- https://infiniflow.org/blog/best-hybrid-search-solution
