The RAG Threshold Pinned to an Absolute Score the Embedding Upgrade Silently Moved
A RAG pipeline ships with a reranker score threshold of 0.4. Anything below gets dropped from the prompt. Six months in, a routine index rebuild swaps the embedding model for a newer checkpoint in the same family — a transparent upgrade, the change log says. Two days later answer relevance falls 6%. The team blames the LLM, runs a model bake-off, finds no candidate that recovers the loss, and spends a quarter chasing a regression that lives in none of the models they were comparing.
The regression lives in the gate. The reranker — untouched, same checkpoint, same weights — is now scoring a different candidate set. The new embeddings pull different chunks into the top-50, the reranker scores them lower on its own calibration, and the gate at 0.4 drops 37% more candidates than it did the week before. The number 0.4 didn't change. What 0.4 meant changed.
