Skip to main content

The Reranker Is the Silent Second Model Your RAG Eval Never Measures

· 10 min read
Tian Pan
Software Engineer

A typical RAG pipeline ships with two models, not one. The retriever pulls 50 to 100 candidates from the vector store, and a reranker — a cross-encoder, an LLM-as-judge prompt, or a hybrid — re-scores those candidates and hands the top 5 to the answer model. Your eval suite measures end-to-end answer quality. It measures retriever recall@k. It does not measure the reranker. So when the reranker quietly drifts, the dashboard renders "answer quality dropped 4 points" with no causal arrow, and the team spends three days debugging a prompt that is not the problem.

The reranker is the silent second model. It sits between the retriever and the generator, it has its own scoring distribution, its own prompt (if it's LLM-based) or its own weights (if it's a cross-encoder), and it can regress independently of every other component. Most teams never grade it in isolation. The eval suite they wrote treats the pipeline like one model with a long context window, when it's actually two models in series with an interface neither team owns.

This is the failure mode worth designing against. RAG quality regressions are not always retriever regressions or generator regressions; sometimes they are reranker regressions wearing the costume of a generator regression, because the only metric exposed to the dashboard is the generator's output. The fix is not more end-to-end eval. It is a separate eval suite for the reranker, with separate metrics, separate ownership, and separate CI gates.

Why the reranker hides

The retriever's failure modes are visible because retriever evaluation is a solved problem. Recall@k, precision@k, and the question "did the gold chunk make it into the candidate set" are first-class metrics on every RAG dashboard built in the last two years. The generator's failure modes are visible because answer-quality eval — LLM-as-judge, human grading, faithfulness scores — is the metric the product team cares about and the metric the executive dashboard tracks.

The reranker sits between them. Its output is an ordering, not an answer and not a recall number. If the gold chunk was in the retriever's top-50 and the generator's final answer is correct, the reranker did its job — even if the score it assigned the gold chunk dropped from 0.92 to 0.61 between releases. The dashboard never sees the score. The team only sees the symptom when the reranker drifts far enough that the gold chunk falls out of the top-5 the generator actually reads.

That symptom looks like a generator regression. The team will spend the first day adjusting the answer-generation prompt. The second day re-tuning the temperature. The third day combing through the retriever to see if recall regressed. None of those are the problem. The problem is that the reranker's score distribution shifted, the gold chunk's relative rank dropped from #2 to #7, and the top-5 the generator now reads is a different five than the top-5 it read last week.

The three drift sources

Reranker drift has three distinct causes, and the eval suite needs to detect each of them.

Model upgrades to the cross-encoder. Teams using a hosted reranker — Cohere Rerank, Voyage, Jina, ZeroEntropy — are exposed to the vendor's training schedule. A point release on the cross-encoder shifts the score distribution; the model that scored a borderline chunk at 0.65 now scores it at 0.48, which sounds like noise but is the difference between top-3 and top-8 on a typical candidate set. The vendor's release notes say "5% improvement on BEIR." Your traffic is not BEIR.

Prompt edits to the LLM-as-reranker. Teams using an LLM as a reranker — pointwise scoring, listwise reranking, or a "rate this passage's relevance from 1 to 10" pattern — own the reranker prompt themselves. That prompt is checked into the same repo as the answer-generation prompt and reviewed by the same engineers, who will edit it for the same reasons (clarity, formatting, length) without realizing the edit changes the score distribution on the long tail. Pointwise LLM rerankers in particular are known to exhibit score drift across prompts and candidate sets — the same passage, scored under two prompt variants, can land at 0.7 and 0.4 respectively. The dashboards never show that the prompt changed; they show that the answer quality dropped.

Upstream retriever changes. A change to the retriever — a new embedding model, a chunking-strategy edit, a vector-index parameter sweep — alters the distribution of candidates the reranker sees. The reranker's job got harder or easier depending on which way the retriever shifted. A reranker tuned against a retriever that returned dense, semantically similar candidates may underperform when the retriever is upgraded to return more diverse candidates (or vice versa). The reranker did not change. The world it lives in changed.

In all three cases, the dashboard that tracks end-to-end answer quality is the wrong instrument. It is too far downstream to localize the regression, and the noise floor of LLM-as-judge eval is large enough that small reranker drifts get smoothed away in the aggregate score.

What a reranker eval looks like

The reranker needs its own eval set, distinct from the retriever's and distinct from the generator's. The eval set is a set of queries, each paired with a candidate list and a golden ordinal ranking — not a binary "relevant or not," but a "for this query, the correct order of these 20 candidates is X, Y, Z, ...". The golden ranking is expensive to produce; it requires human judges to put passages in order, not just label them. But it is the only artifact that lets you grade the reranker on the axis the reranker actually optimizes.

Standard metrics apply: nDCG@10 measures ranking quality across the full top-10, MRR measures how well the first relevant document is positioned. The community has converged on nDCG@10 as the gold standard for reranker evaluation in production, because it captures the long-tail ranking quality that MRR alone misses. A reranker that always puts the right answer at #1 but ranks #2 through #10 randomly will have a perfect MRR and a mediocre nDCG — and will fail in production the moment the generator's prompt asks for "context from the top three passages."

The eval set should be sliced. A single global nDCG number tells you nothing about which query types are regressing. Slice by domain (technical docs vs. policy docs vs. user-generated content), by query length, by ambiguity, by whether the gold answer is in a single chunk or spans multiple. Reranker regressions are usually localized to one or two slices; the global score moves a fraction of a point while a specific slice cratered.

This eval should run in CI on every PR that touches the reranker prompt, the reranker model version, the cross-encoder weights file, or any preprocessing that affects the candidate list the reranker sees. It should not run only when the answer-generation prompt changes; the reranker has its own change frequency, and gating it on the answer-generation cadence misses the drift that happens between answer-prompt releases.

The score distribution dashboard

Independent of the eval suite, the production reranker needs a score-distribution dashboard. For every request, log the reranker's scores on the candidates it ranked, and aggregate the distribution over a rolling window. What you want to see is a stable distribution: roughly the same shape, the same mean, the same variance, the same tail behavior, day over day.

What you actually want to alert on is shape change. If the reranker historically produces a bimodal score distribution (clear winners and clear losers) and that distribution suddenly compresses into a single mode around 0.5, the reranker has stopped discriminating. The end-to-end answer quality might still be acceptable today because the gold chunk is still squeaking into the top-5 by the thinnest margin — but the next retriever change, or the next prompt edit, will push it out. The dashboard catches the regression before it becomes user-visible.

A good baseline: log the reranker's score for the chunk that ultimately got cited in the answer. Track that score's distribution over time. When the cited chunk's reranker score starts trending toward the score of the rejected chunks, the reranker is no longer separating signal from noise, and you have weeks of warning before the answer-quality dashboard catches up.

Ownership is where this breaks

The reason this rarely gets done is organizational, not technical. The retriever is owned by the search or ML platform team. The generator is owned by the product or AI team. The reranker is owned by whoever shipped it last quarter, which is usually the same team that owns the generator — and that team is rewarded for end-to-end answer quality, not for reranker quality.

The result: prompt edits to the reranker get reviewed in the same PRs as prompt edits to the generator, by the same reviewers, against the same eval. The reranker prompt is one heading inside prompts.yaml and the answer-generation prompt is another. They look the same. They are reviewed the same. They regress at different rates and the eval suite only sees one of them.

The discipline that has to land is a separate ownership boundary for the reranker, even if the same engineer is on both sides. The reranker prompt should live in a different file (or at least a different CODEOWNERS path), require a reviewer fluent in retrieval metrics rather than answer-quality metrics, and be gated by its own eval — independent of any change to the answer model. When the reranker is a hosted vendor model, the team needs an explicit policy on vendor version pinning: which version are we on, when do we evaluate the next version against our eval set, who decides to upgrade.

This is the same pattern the embedding team has already learned. Nobody upgrades their embedding model in production without re-indexing and re-evaluating; the cost of the wrong model has been beaten into the field by enough quality regressions. The reranker has not yet had its embedding-upgrade moment, but it will, and the teams that pre-build the eval discipline will be the ones who survive it without a postmortem.

A two-model pipeline needs two-model eval

The architectural realization at the heart of this is simple. A RAG pipeline is not one model with retrieved context; it is two models in series, with an interface — the candidate list and its scores — that is itself a critical artifact. The team that evals only the terminal model is debugging the system one model short of the truth. The team that evals only the retriever is grading the warmup, not the closer.

The fix scales beyond rerankers. Any time a system has a "silent second model" — a query-rewriting LLM in front of the retriever, a router that picks which RAG corpus to query, a post-hoc filter that drops low-confidence answers — that model needs its own eval. The pattern is general: components with their own learned behavior need their own metric, their own dashboard, and their own gate. End-to-end eval is necessary but not sufficient; it diagnoses the existence of a regression and obscures its location.

Build the reranker eval before you need it. The PR that takes you from "answer quality dropped, no idea why" to "reranker nDCG@10 dropped 6 points on the technical-docs slice between v3.1 and v3.2 of the cross-encoder" is the PR that turns a three-day mystery into a thirty-minute fix. That delta — three days versus thirty minutes — is the cost of treating the second model in your pipeline like it isn't one.

References:Let's stay in touch and Follow me for more thoughts and updates