Cross-Encoder Reranking in Practice: What Cosine Similarity Misses

April 19, 2026 · 10 min read

Software Engineer

Your RAG pipeline retrieves the top 10 documents and your LLM still gives a wrong answer. You increase the retrieval count to 50. Still wrong. The frustrating part: the correct document was in your vector store the whole time—it was just ranked 23rd. This is not a recall problem. It's a ranking problem, and cosine similarity is the culprit.

Vector search does a decent job of finding semantically adjacent content. But "semantically adjacent" and "most useful for this specific query" are not the same thing. Cosine similarity measures the angle between two vectors in embedding space, and that angle only captures a coarse notion of topical proximity. What it cannot capture is the fine-grained interaction between the specific words in your query and the specific words in a document—the difference between "how to prevent buffer overflows" and "buffer overflow exploit techniques" is subtle at the vector level but critical for your retrieval system.

Reranking is the architectural fix. This post covers how cross-encoder reranking works, when to combine it with BM25 hybrid search and MMR diversity, and the latency math that tells you whether adding it to your pipeline is actually worth it.

What Cosine Similarity Actually Computes (and What It Doesn't)

To understand why reranking is necessary, it helps to be precise about what cosine similarity measures. A bi-encoder model—the kind that powers most embedding-based vector search—encodes the query and each document into separate fixed-dimension vectors, then measures the angle between them. The semantic content of a 512-dimensional vector is frozen at encoding time, with no knowledge of what query it will eventually face.

This creates a structural blind spot. The cosine of the angle between two vectors tells you they're "talking about similar things," but it doesn't tell you whether the document actually answers the query. A document that extensively covers buffer overflow theory might be geometrically close to a query about prevention—but a document with one specific paragraph about mitigation techniques might score lower despite being far more useful.

Two additional failure modes make this worse in practice:

Hubness in high dimensions. In high-dimensional embedding spaces, certain vectors become "hubs"—they appear as nearest neighbors to a disproportionately large fraction of query vectors, regardless of actual content relevance. These hubs pollute top-k results consistently.

Magnitude discarding. Cosine similarity normalizes all vectors to unit length, discarding information encoded in vector magnitude. Some embedding models encode confidence or informativeness in magnitude; cosine throws it away by design.

The core issue is that bi-encoders optimize for efficient retrieval over millions of documents. That efficiency requires encoding documents independently of queries. Cross-encoders give that up and get much better ranking in return.

How Cross-Encoders Actually Work

A cross-encoder takes the query and a candidate document as a joint input, feeding them together through a transformer's attention layers. Every token in the query can attend to every token in the document. The model outputs a single relevance score.

This is fundamentally different from computing similarity between two pre-computed vectors. The cross-encoder can detect that "overflow prevention" in the query specifically aligns with "stack canary implementation" in the document, even if the embedding model would have mapped both to nearly the same vector region as "memory safety in C++."

The tradeoff is compute. You cannot encode documents in advance and cache them—each query-document pair requires a fresh forward pass. On CPU with a standard MiniLM-size model (~22M parameters), scoring 50 document pairs takes 100–150ms. That's fine when you're reranking a small candidate set, but it's why cross-encoders exist in the second stage of a two-stage pipeline—not the first.

The quality improvement is substantial. Cross-encoders consistently outperform bi-encoder retrieval by 5–10 nDCG points on standard benchmarks. On harder benchmarks, independent evaluations show accuracy improvements in the 33–40% range for only ~120ms of added latency. That ratio holds across query types: cross-encoders are particularly valuable on multi-hop reasoning queries, where the relevance signal depends on subtle interactions between query intent and document content.

Building the Two-Stage Pipeline

The standard production setup combines three components: BM25 sparse retrieval, dense vector search, and cross-encoder reranking.

Stage 1: Hybrid retrieval with Reciprocal Rank Fusion

Running both BM25 and vector search in parallel gives you complementary signals. BM25 excels at exact keyword matching—if your user searches for a specific function name or error code, BM25 finds it reliably. Vector search handles paraphrase and semantic variation. Neither alone is sufficient.

The standard way to merge these result sets is Reciprocal Rank Fusion (RRF). Rather than trying to normalize and combine incompatible score scales, RRF converts each result list into rank positions and combines them:

RRF_score(doc) = 1/(rank_BM25 + 60) + 1/(rank_vector + 60)

The constant 60 was determined empirically to produce stable rankings; it prevents top-ranked documents in one list from overwhelming the signal from the other. Documents that appear highly ranked in both systems bubble up reliably. Documents that only appear in one list get partial credit. The result is a merged list of ~50–100 candidates ready for the second stage.

Stage 2: Cross-encoder reranking

Pass your top 50–100 RRF candidates through a cross-encoder. The cross-encoder scores each query-document pair and returns a ranked list. The document your LLM actually reads should now be in position 1 or 2 much more often.

For most teams, the go-to open-source model is cross-encoder/ms-marco-MiniLM-L-6-v2. It's fast (~50ms for 100 pairs on CPU), stable, and delivers ~35% accuracy improvement over pure vector retrieval with minimal infrastructure. If you need higher quality and have the latency budget, Jina Reranker v3 achieves 61.94 nDCG@10 on the BEIR benchmark (best in class among open-source models) at under 200ms for typical candidate sets. Cohere Rerank API leads on quality (ELO rating of 1627 on public benchmarks) but adds API cost and latency.

Adding MMR When Diversity Matters

Reranking gets you the most relevant document at position 1. But sometimes you want the top 5 positions to be different from each other—covering different aspects of a multi-part question, or reducing the risk that your LLM context window contains five documents that all say the same thing.

Maximal Marginal Relevance (MMR) is the standard fix. After reranking, apply MMR as a post-processing step:

MMR_score(doc) = (1 − λ) × relevance_score − λ × max_similarity_to_already_selected

The λ parameter controls the diversity-relevance tradeoff. At λ = 0, you get the pure relevance ranking from the cross-encoder. At λ = 1, you maximize diversity. Values between 0.3 and 0.7 give useful intermediate behavior. The algorithm greedily selects documents one at a time, penalizing each remaining candidate by its similarity to what's already been selected.

MMR adds negligible latency—it's simple arithmetic applied after scoring is complete. Apply it when your queries have multiple sub-components or when users consistently complain about redundant results. Skip it when queries are focused enough that the top-k documents naturally cover different aspects.

The Latency Budget Math

Here's a concrete calculation for a 200ms total budget:

Hybrid retrieval (BM25 + vector search, parallel): 40–60ms
Cross-encoder reranking, top 50 candidates on CPU: 100–150ms
Context preparation for LLM: 10–20ms
Total: ~160–230ms

That's at the edge of a 200ms budget on CPU. Two ways to stay within it:

Reduce the reranking window. Top 30 candidates instead of 50 cuts reranking time roughly 40% with minimal quality loss—the marginal value of candidates 31–50 is low.
Use GPU inference. A T4 GPU cuts cross-encoder latency to 30–50ms for 50 candidates, making 200ms comfortably achievable even with a higher-quality model.

The mistake teams make is computing the latency cost in isolation rather than asking what they're trading it for. Retrieval costs 40ms no matter what. The question is whether the additional 100ms buys you enough ranking improvement. For most use cases where retrieval quality has a measurable business impact—customer support accuracy, code search, legal document retrieval—it does.

The case where it doesn't: your queries are simple and your vector search already returns the right document at rank 1 most of the time. Adding reranking on top of already-good retrieval is overhead without payoff. Instrument first; add reranking only once you can demonstrate the ranking problem exists.

Production Failure Modes to Watch For

Domain mismatch. Off-the-shelf cross-encoders are trained on MS MARCO (web search queries and passages). If your corpus is medical documentation, legal filings, or internal engineering specs, the relevance judgments embedded in the model may not transfer. Symptom: the reranker confidently assigns high scores to unhelpful documents. Fix: fine-tune on domain-specific query-document pairs with human relevance labels.

Silent regression on model updates. A reranker model version update—even a minor one—can shift scoring distributions enough to degrade downstream LLM output without any obvious error signal. Pin exact model versions in production and maintain a regression suite of 20–50 query-document pairs with known expected rankings. Run this suite before any model update goes live.

Cascade failure without fallback. If your cross-encoder is an external API call and it times out, your entire retrieval pipeline fails unless you have a fallback. Implement circuit-breaker logic: on cross-encoder failure, fall back to the BM25+vector RRF ranking and serve that. It's worse than reranked results, but it's not a 500 error.

Reranking can't fix upstream retrieval gaps. Reranking improves ranking precision on candidates that were retrieved. If the correct document is not in your top 100 candidate set at all, reranking does nothing. Before investing in reranking infrastructure, verify that recall@100 is high for your query distribution. If it isn't, your problem is retrieval, not ranking—fix your chunking strategy and embedding model first.

Decision Framework

Add reranking to your pipeline when: your retrieval recall@100 is acceptable but recall@5 is poor (documents are being retrieved but ranked low), your queries are complex enough to exhibit ranking failures, and your latency budget allows 100–200ms of overhead.

Skip it when: your precision problem is actually a recall problem (right documents not retrieved at all), your queries are simple with obvious keyword signals, or your scale makes per-query inference cost prohibitive.

The default starting point: deploy ms-marco-MiniLM-L-6-v2, rerank top 50 candidates, measure the delta in your recall@1 and recall@3 metrics against a holdout set of known-good queries. If you see a meaningful improvement, the 100ms overhead is almost certainly worth paying. If you don't see improvement, you've diagnosed that your problem is elsewhere—which is equally valuable information.

Vector search solves the hard part of retrieval: finding candidates across millions of documents in milliseconds. Reranking solves the part that matters most to users: making sure the right answer is what they actually see.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Cross-Encoder Reranking in Practice: What Cosine Similarity Misses

What Cosine Similarity Actually Computes (and What It Doesn't)

How Cross-Encoders Actually Work

Building the Two-Stage Pipeline

Adding MMR When Diversity Matters

The Latency Budget Math

Production Failure Modes to Watch For

Decision Framework

Recommended Reading

About Tian Pan

What Cosine Similarity Actually Computes (and What It Doesn't)​

How Cross-Encoders Actually Work​

Building the Two-Stage Pipeline​

Adding MMR When Diversity Matters​

The Latency Budget Math​

Production Failure Modes to Watch For​

Decision Framework​

Recommended Reading

About Tian Pan

What Cosine Similarity Actually Computes (and What It Doesn't)

How Cross-Encoders Actually Work

Building the Two-Stage Pipeline

Adding MMR When Diversity Matters

The Latency Budget Math

Production Failure Modes to Watch For

Decision Framework