Skip to main content

The Reranker You Added That Slowed Recall More Than It Improved Precision

· 11 min read
Tian Pan
Software Engineer

The offline eval was unambiguous. After bolting a cross-encoder on top of the top-50 from vector search, nDCG@5 went up four points. The team shipped it on a Tuesday. By Thursday, p99 retrieval latency had crossed the SLO by 700 milliseconds, and customer success was forwarding screenshots of empty results pages that the old pipeline would have populated. The graph that mattered — user-perceived answer quality — was down. The reranker was a regression that the team had branded as an improvement, and the eval rubric was the thing that hid the regression in plain sight.

This is one of the most common failure modes in production retrieval, and it is rarely described as what it actually is: an evaluation bug. The reranker did what it was advertised to do. It reordered the top-50 with finer-grained precision. The problem is that the metric used to justify it — offline nDCG, computed at infinite budget, against the full reranked list — describes a world the production system does not live in. In production, the answer that ships is not the best-scored reranked list. It is whatever the system can return before the request deadline. And once you write the metric that way, the reranker's contribution is no longer a four-point lift. It is a curve.

Offline Eval Lives at Infinite Budget. Production Does Not.

Most reranker evals are computed in a notebook. The dataset is loaded, the candidate set is generated, the reranker scores every candidate, and nDCG@5 is computed against the final ranking. There is no clock. There is no timeout. There is no notion of "what would the system have returned if the reranker took 900ms and the request budget was 800."

The production answer path has all three of those properties. Vector search returns in 40ms. The cross-encoder, batching 50 candidates, takes another 180ms under steady state and 900ms when the batch window queues during burst load. The orchestrator has a deadline — usually a few hundred milliseconds — beyond which it must either truncate, fall back to the unreranked list, or fail open. Each of those branches has a precision number. None of them is the offline nDCG.

This gap has a name in the literature now: latency-conditional precision. The idea is to compute precision not as a single point but as a function of the time budget you were willing to wait. At 50ms the reranker contributes nothing because it has not produced output yet. At 200ms it might be returning a partial reordering. At 500ms it is returning the full reranked list. Each of those points has a different precision number, and the production system is sampling from a distribution of which point gets reached on which request. The offline metric is the right-edge of that distribution, evaluated as if every request reached it. The production reality is the weighted average across the distribution, with the timeout-induced fallback path eating the left tail.

The team that ships on offline nDCG without modeling this distribution is making a specific implicit assumption: that the production system always reaches the right edge. In bursty traffic, against a model with a batch window that scales superlinearly with queue depth, that assumption is false in a way that gets worse as you succeed. The more users you have, the more often the cross-encoder is hot under contention, and the more often the answer that ships came from the fallback path the eval never measured.

The Batch Window Is the Architecture

The single most under-discussed property of a cross-encoder reranker in production is its batching behavior. A cross-encoder runs query and candidate through a transformer together. To make this efficient on a GPU, you batch — 32, 64, or 128 query-candidate pairs per forward pass. The batch window is the time the inference server waits to assemble a batch before firing.

At low QPS, the batch window is the dominant tail. The server waits for the window to close before processing, so a query that arrives just after a batch fires gets the worst-case wait. At medium QPS, the window saturates and latency stabilizes. At burst load, the window saturates while the GPU is still processing the previous batch, and the queue depth grows. Now the marginal request waits not just for its own batch to close but for every batch ahead of it to finish first. The latency curve goes vertical.

This is the failure mode behind "p99 latency exceeds SLO by 700ms." The mean is fine. The median is fine. The mode is fine. The thing that breaks is the upper tail under burst, which is exactly the cohort of requests where the user is least patient. A reranker tuned on average-case offline data cannot tell you whether the curve it lives on has a knee at your peak QPS. You have to load-test it. Almost nobody does.

Falling Back Is Not Failing Gracefully

When the cross-encoder misses the deadline, the orchestrator needs a story. The three common stories are: return the unreranked top-K from vector search (the "raw recall" fallback), return whatever partial reranking has completed (the "torn list" fallback), or return an error (the "fail closed" path). Each one has a different production precision number, and most teams never measured any of them.

The raw recall fallback is what most teams assume happens. It is fine in expectation: vector search alone was the baseline precision before the reranker existed, so falling back returns the system to that baseline on the cohort of requests that timed out. The hidden problem is that the team's pre-reranker baseline was tuned with the assumption that the reranker would clean up the top-50. With the reranker in place, the upstream system was often re-tuned to retrieve more candidates, return looser matches, or relax its similarity threshold — because the reranker would catch any precision misses. When the reranker times out, the raw recall fallback now serves a looser set of candidates than the pre-reranker baseline would have. Precision on the fallback path is worse than the pre-reranker world.

The torn list fallback is worse. Partial reranking returns documents that have been scored in arbitrary order. Some of the top-10 reranked candidates may not have been processed yet. The ranking is no longer monotonic. The team that does this without warning the downstream LLM is feeding it a candidate set whose ordering signal is meaningless — the LLM weights position, and the position is now noise. This is the failure mode that produces the empty-results screenshots: the answer model, fed a torn list, can't synthesize anything coherent and bails to a no-results response.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates