Skip to main content

The Reranker Gap: Why Most RAG Pipelines Skip the Most Important Layer

· 8 min read
Tian Pan
Software Engineer

Most RAG pipelines have an invisible accuracy ceiling, and the engineers who built them don't know it's there. You tune your chunking strategy, upgrade your embedding model, swap vector databases — and the system still returns plausible but subtly wrong documents for a stubborn class of queries. The retrieval looks reasonable. The LLM sounds confident. But downstream accuracy has quietly plateaued at a level that no amount of prompt engineering will break through.

The gap almost always traces to the same missing piece: a reranker. Specifically, the absence of a cross-encoder in a second retrieval stage. It's the layer that's technically optional, practically expensive to skip, and systematically omitted from the canonical "embed, index, query" tutorials that most RAG pipelines are built from.

Why Bi-Encoders Are the Wrong Tool for the Job (Alone)

A bi-encoder — the embedding model at the core of every vector search — does one thing well: it projects text into a high-dimensional space where semantically similar content lands nearby. This is useful at scale. You encode your entire corpus once at index time, then at query time you encode the query and run an approximate nearest-neighbor lookup. The whole operation costs O(1) regardless of corpus size. It's why vector search can scan millions of documents in under 100ms.

But this speed comes from an architectural constraint that fundamentally limits precision: the query and document never see each other during encoding. A bi-encoder encodes the query and each document in complete isolation. The similarity score is just cosine distance between two independently generated vectors.

This creates a specific failure mode. The model must compress every possible meaning of a document into a single fixed-size vector before it knows what the query will be. Domain-specific terminology, negations, quantifiers, and subtle phrasing differences — the things that often determine whether a document is actually relevant — get blurred or lost during that compression. A document about "the risks of not using encryption" and one about "the benefits of encryption" can land very close to each other in vector space. For a bi-encoder, they look nearly identical.

The result: vector search retrieves plausible documents. Not necessarily correct ones.

What a Reranker Actually Does

A cross-encoder reranker takes a fundamentally different approach. Instead of encoding the query and document independently, it concatenates them and runs both through a transformer together. The model sees the full query text alongside the full document text in a single forward pass and outputs a relevance score for that pair.

This joint encoding enables interaction signals that bi-encoders structurally can't capture. The attention mechanism can correlate specific words in the query against specific words in the document. Negation in the query actually affects the score. Rare domain terminology gets weighted against its appearance in context. The reranker is doing something much closer to reading comprehension than pattern matching.

The practical consequence is accuracy that consistently beats bi-encoders on hard queries. Benchmarks across production deployments show reranking improves retrieval precision by 15–48% over embedding-only approaches, with consistent NDCG@10 gains across diverse domains. In one representative evaluation, top-K accuracy without a reranker plateaued at 0.83. With a cross-encoder reranker over the same initial retrieval set, it reached 0.93.

For RAG specifically, this gap compounds. When your retriever sends slightly wrong context to the LLM, the LLM hallucinates to fill the gap — confidently, fluently, and incorrectly. Several production evaluations have demonstrated 28–40% reductions in hallucination rate when a reranker is inserted into the pipeline, without changing the LLM, the prompt, or the chunking strategy.

The Computational Reason Everyone Skips It

Cross-encoders can't precompute anything. Because the query isn't known until request time, there's no way to batch encode pairs ahead of time. Every query requires a full transformer forward pass for each candidate document. If you retrieve 50 candidates and then rerank them, you're running 50 sequential inference passes.

On CPU, reranking 30 candidates takes 100–150ms. On GPU it's 30–50ms. Push past 100 candidates and you're over 300ms just for the reranking step. For a LLM-based reranker (using a large generative model rather than a specialized cross-encoder), latency jumps to 1–6 seconds per query.

This cost is real, but it's frequently overestimated as a reason to skip reranking entirely. The key insight is that you don't rerank your full corpus — you rerank a small candidate set. The two-stage pattern is:

  1. Stage 1: Fast bi-encoder or BM25 retrieval with high recall. Retrieve 50–100 candidates in milliseconds. Optimize for not missing relevant documents, accepting that you'll include some irrelevant ones.
  2. Stage 2: Cross-encoder reranker over those candidates. Reorder by true relevance. Pass the top 5–10 to the LLM.

Stage 2 adds 100–200ms for the cross-encoder case. That's the real cost of not skipping it.

When Skipping Reranking Is Defensible

There are legitimate scenarios where skipping the reranker makes sense, and conflating them with the general case is how teams end up under-investing in retrieval quality.

Sub-100ms latency requirements at all costs. If your application sits in a latency-critical path — real-time autocomplete, synchronous API with strict SLAs — and you have no budget for an additional 100ms, a well-tuned bi-encoder plus a large initial retrieval window is a reasonable tradeoff. Retrieve 20 candidates, take the top 5, accept the accuracy penalty.

Queries that are inherently simple. Document lookup by category, FAQ matching against a small fixed corpus, or retrieval where the query vocabulary closely mirrors the document vocabulary — these cases don't stress the limitations of bi-encoders. The semantic compression artifacts that hurt cross-domain or nuanced queries don't appear. If your internal evaluation shows bi-encoder recall is already above 0.95 for your actual query distribution, you may not need the extra layer.

Index quality is the actual bottleneck. If your documents are poorly chunked, inconsistently formatted, or contain large amounts of boilerplate noise, a reranker will reorder bad candidates more accurately and still return bad context to the LLM. Reranking doesn't fix index quality — it assumes the first-stage retriever already has the answer somewhere in its candidate set. If you're seeing systematically missing relevant documents (not just wrong ordering), better chunking or a domain-specific embedding model will return more than a reranker will.

The diagnostic distinction matters: wrong ordering is a reranking problem; missing relevant documents is a recall or index quality problem.

Choosing a Reranker

The reranker landscape has matured significantly. Three categories cover most production use cases:

Specialized cross-encoders (e.g., Cohere Rerank, BGE-reranker, Jina Reranker) are purpose-built for relevance scoring. They're fast relative to LLM-based approaches — 30–100ms per batch of 50 candidates on GPU — and consistently outperform embedding models on retrieval benchmarks. For most teams, this is the starting point.

Newer learned rerankers (e.g., zerank-1, RankLLM) incorporate instruction-following and domain adaptation. The ZeroEntropy zerank-1 delivers +28% NDCG@10 over baseline retrievers in benchmarks and shows measurable correlation with lower downstream hallucination rates. These are worth evaluating if you operate in a specialized domain where generic cross-encoders underperform.

LLM-based rerankers (using a large generative model to score pairs) achieve the highest accuracy on complex queries but add 1–6 seconds of latency and nontrivial cost. At scale, the unit economics are usually prohibitive. They're most defensible in async pipelines — document pre-processing, research workflows, or batch enrichment — where latency doesn't compound.

The reranker benchmark leaderboard on MTEB is the most reliable public reference for comparing options on your query type. Don't rely on vendor-reported numbers; run your actual queries against candidates.

Making the Decision

If you're building or auditing a RAG pipeline, the reranker question is straightforward to resolve:

First, measure retrieval recall separately from end-to-end accuracy. Sample 100 representative queries from your production distribution, retrieve your normal top-K, and manually verify how often the truly relevant document appears in that set. If recall is already failing — the right document isn't in your candidate set — fix retrieval before adding a reranker.

Second, if the right document is in the candidate set but often ranked below position 3 or 5, you have a reranking problem. Add a cross-encoder over your existing retrieval candidates and benchmark the accuracy change. The implementation is typically under a day of engineering work. The accuracy improvement is usually immediate and measurable.

Third, instrument latency before and after. Most teams find that 100–150ms of reranking latency is invisible to users in a conversational or search interface where LLM generation already takes 1–3 seconds. If it's not, optimize candidate set size — reranking 20 candidates instead of 50 often preserves 80% of the accuracy gain at half the latency cost.

The engineers who skip reranking because it "adds complexity" are accepting a permanent accuracy ceiling in exchange for a simpler architecture that's easier to explain. The engineers who skip it because they've measured their specific recall and latency constraints and found it unnecessary have made a defensible tradeoff. The gap between those two groups — in production accuracy and in understanding what's actually limiting their system — is exactly what the name implies.

References:Let's stay in touch and Follow me for more thoughts and updates