The RAG Threshold Pinned to an Absolute Score the Embedding Upgrade Silently Moved
A RAG pipeline ships with a reranker score threshold of 0.4. Anything below gets dropped from the prompt. Six months in, a routine index rebuild swaps the embedding model for a newer checkpoint in the same family — a transparent upgrade, the change log says. Two days later answer relevance falls 6%. The team blames the LLM, runs a model bake-off, finds no candidate that recovers the loss, and spends a quarter chasing a regression that lives in none of the models they were comparing.
The regression lives in the gate. The reranker — untouched, same checkpoint, same weights — is now scoring a different candidate set. The new embeddings pull different chunks into the top-50, the reranker scores them lower on its own calibration, and the gate at 0.4 drops 37% more candidates than it did the week before. The number 0.4 didn't change. What 0.4 meant changed.
A Constant Is Only a Constant if Its Inputs Are
A threshold is a coupling. When you write if score > 0.4, you are asserting that the distribution of score is stable enough that 0.4 will continue to mean what it meant when you tuned it. That assertion holds for as long as nothing upstream of score changes.
In a two-stage retrieval pipeline, score is produced by a cross-encoder reranker. The reranker's input is the candidate set produced by the first-stage retriever. The first-stage retriever's behavior depends on the embedding model. The embedding model is updated on a schedule that the reranker team does not own. The threshold is tuned by a third team that owns neither.
The dependency chain looks like this: embedding model → candidate set composition → reranker input distribution → reranker output distribution → gate pass rate. A change at any node propagates downstream. The constant at the end is a constant only as long as every node above it is.
Most teams know this in the abstract. The failure happens when the abstract knowledge does not translate into a deploy-time check. The embedding upgrade lands as a routine index rebuild. The release note says "we swapped to a newer checkpoint, scores are higher-quality." The reranker team does not get paged. The threshold owner does not get paged. The gate's pass rate falls silently because nothing in the pipeline emits a metric for "fraction of candidates above the gate," and the alert that would have caught it does not exist.
Why Score Distributions Move Even When the Model Doesn't
The intuition that "I didn't change the reranker, so its scores can't change" is almost right and dangerously wrong. The reranker as a function does not change. The inputs do.
A cross-encoder takes a (query, document) pair and returns a relevance score in some range. The score depends on the pair. When the embedding model changes, the top-50 candidate set for a given query shifts. Some of those new candidates are chunks the old embeddings would have ranked lower — chunks that are topically adjacent rather than topically central, or chunks from longer documents whose embeddings now cluster differently in vector space.
The reranker scores these new candidates on its own calibration. If the new candidates skew toward harder cases — pairs the reranker is less confident about — the per-candidate score distribution shifts down. The median drops. The histogram tilts. The chunks that used to score 0.55 are still in the top-50; the new chunks that join them score 0.25. The pass rate at 0.4 collapses.
Practitioners describe this in milder forms when they tune thresholds for production. Cohere's documentation advises picking 30–50 representative queries with borderline-relevant documents and averaging the rerank scores to derive a threshold — a procedure that produces a value calibrated to a specific corpus and retrieval setup. The procedure has no notion of "what to do when the retrieval setup changes." It produces a number, the team writes the number into a config file, and the number becomes a constant in the codebase regardless of whether the conditions that produced it still hold.
Recent work like MAIN-RAG (Liu et al., 2025) builds adaptive filtering directly into the retrieval loop, dynamically adjusting the relevance threshold based on score distributions per query. The framing is the same: the threshold is a function of the distribution, not a number you carve into the pipeline.
The Diagnosis Is Hard Because the Metric Is Missing
When answer relevance falls and the team starts an investigation, the obvious suspect is the model. The eval set runs against the LLM. The LLM scores look fine on the eval set because the eval set has its own canonical contexts. The eval set does not exercise the retrieval path, so a retrieval regression does not surface there.
The next suspect is the embedding model. Someone notices the index was rebuilt and runs a retrieval-only eval — measure recall@50 against a labeled query-document set. Recall@50 looks fine. The new embeddings are pulling roughly the same number of relevant documents into the top-50 as the old ones did. The first-stage retriever is doing its job.
The reranker is interrogated last because it was not the thing that changed. The team pulls the reranker's average score for the past month, sees it drop from 0.51 to 0.36 on the day of the index rebuild, and the cause becomes obvious. The diagnosis takes weeks because no dashboard surfaces the pre-gate score histogram. No alert fires when the histogram moves. The team's monitoring covers latency, error rate, and end-to-end answer quality. The intermediate metric that would have caught this — distribution of reranker scores over a fixed time window — is a metric nobody thought to instrument because nothing seemed likely to move it.
This is the deeper failure: a pipeline's instrumentation reflects the failure modes its designers anticipated. The designers anticipated that the reranker might be swapped, the embedding model might be swapped, the threshold might be tuned wrong. They did not anticipate that one of those three would silently change another. The instrumentation has a gap that maps exactly to the dependency the designers did not name.
What to Build Instead of a Number in a Config File
The patterns that close the gap are not exotic. They are mostly the discipline of treating thresholds as derived rather than declared.
Percentile-based gates instead of absolute scores. Replace score > 0.4 with score in top-X% of the candidate set or score > p70 of the last N queries' scores. A percentile gate adapts to distribution shift by construction. The strictness of the gate stays constant relative to the current data even as the underlying numbers drift. The cost is that the gate no longer commits to a fixed quality floor — if all candidates are bad, the gate still lets the best fraction through — so a percentile gate often pairs with an absolute floor as a safety net.
Re-calibrate the threshold on every upstream change. Bake the threshold tuning procedure into the deployment of any upstream component. If the embedding model is updated, the reranker threshold is automatically retuned against a held-out set as part of the deploy. The threshold lives in a config file that gets rewritten by a CI step, not in a config file that gets edited by hand. This requires the held-out set to exist, to be representative, and to be cheap to score — three preconditions that often quietly fail.
Monitor the pre-gate score distribution. Emit the reranker's pre-gate score histogram as a first-class metric. Run a Kolmogorov-Smirnov or PSI comparison against a rolling baseline. Alert when the KS statistic exceeds a configured value between deploys. KS detects shifts in the cumulative distribution function and is well-suited to comparing two empirical histograms; PSI is a coarser bucketed comparison that is easier to alert on but less sensitive to subtle shifts. Pick one, instrument it, and treat its alert as a deploy-gating signal rather than a paging signal — the goal is to catch the shift before the symptom surfaces, not to react to it.
Regression test the gate's pass rate. Maintain a fixed set of representative queries that get run end-to-end through the pipeline on every deploy. Assert that the count of chunks above the gate stays within a band — say, ±10% of the baseline. A test that asserts on the gate's behavior catches the case where an upstream change moves the distribution under the gate even when neither the gate nor the upstream component reports an obvious problem.
The Architectural Realization
The lesson here is bigger than rerankers. Any pipeline where a downstream constant depends on an upstream distribution has the same shape. A latency budget pinned to a tail percentile of last quarter's traffic. A confidence threshold on a fraud model trained on a snapshot of last year's labels. A retry policy tuned to a provider's old failure rate. A cost cap calibrated to a model whose tokenizer has since been upgraded. In every case, a number that looks like a constant in the config file is actually a function of a distribution that lives outside the config file.
The architectural move is to make this dependency explicit. Either remove the constant — replace it with a function that recomputes from current data — or make the upstream changes that would invalidate it impossible to deploy without retuning. The middle path, "we'll watch for it and react," is the path that produces the six-week regression. The team that watches always watches the things it expected might move. The thing that moved was the thing that was supposed to be a constant.
In a RAG pipeline, the constant is almost always the gate. The thing that moves it is almost always something the team upgraded without flagging downstream. The patch is to stop writing absolute scores into config files, or — if you must — to wire the deploy of every upstream component to a retune of every downstream threshold. Neither is glamorous. Both are cheaper than spending a quarter chasing a regression that lives in a number nobody changed.
- https://docs.cohere.com/docs/reranking-best-practices
- https://nickberens.me/blog/understanding-rag-score-thresholds/
- https://fireworks.ai/blog/Understanding-Embeddings-and-Reranking-at-Scale
- https://galileo.ai/blog/mastering-rag-how-to-select-a-reranking-model
- https://machinelearningmastery.com/detecting-handling-data-drift-in-production/
- https://www.evidentlyai.com/blog/data-drift-detection-large-datasets
- https://towardsdatascience.com/understanding-kolmogorov-smirnov-ks-tests-for-data-drift-on-profiled-data-5c8317796f78/
- https://arxiv.org/pdf/2501.00332
- https://arxiv.org/html/2511.09803v2
