Skip to main content

The Freshness-Relevance Tradeoff in RAG: Why You Can't Optimize Both at Query Time

· 11 min read
Tian Pan
Software Engineer

A user asks your assistant what the company's parental leave policy is. The bot returns 12 weeks, with a citation. The cited document was the right answer in 2023; HR posted an update last quarter that took it to 16. Both versions are in your knowledge base. Cosine similarity scored the 2023 version 0.87 and the 2024 version 0.84, because the older page has the cleaner phrasing and fewer hedges. The fresher document loses by three percentage points and the user gets a wrong answer that looks audited.

This is the freshness-relevance tradeoff, and the uncomfortable part is that it has no clean solution at query time. If you weight recency, you bias retrieval toward whatever was edited yesterday — which in most knowledge bases is the noisy, high-churn surface area that should not be the source of truth. If you don't weight recency, you ship answers grounded in documents that were superseded months ago. There is no single global knob that gets both right, and most teams discover this only after a few embarrassing answers leak past their eval suite.

The default architecture makes the problem invisible. Retrieval scores documents on semantic similarity, returns the top-k, and lets the model write fluent prose around whatever shows up. The query "what's our parental leave policy" looks identical to "what was our parental leave policy in 2023" to a vanilla embedder; both score high against any document that uses those words. The model sees grounded context, the trace looks healthy, and the wrong answer ships with a citation pointing at a stale URL.

Why Pure Semantic Retrieval Fails on Time

Embeddings are trained to capture meaning, not chronology. Two versions of the same policy page will be near-neighbors in vector space precisely because they cover the same topic. Whichever phrasing scored a hair higher on cosine similarity wins, and that ranking is stable across queries — meaning a single bad document with crisp prose can dominate retrieval indefinitely once it's in the index.

Recent academic work makes this concrete. The ChronoQA benchmark, built from 300,000+ news articles published between 2019 and 2024, evaluates retrievers on questions with explicit temporal constraints. Standard semantic retrievers score poorly because their embeddings can't distinguish "the current CEO" from "the CEO three administrations ago" when both documents discuss the same role. The relevance signal is correct; the temporal signal is missing entirely.

The corollary is that adding a recency prior, alone, doesn't fix things either. Recency-weighted retrieval will happily surface a document edited yesterday by an intern fixing a typo, ranked above the canonical reference that hasn't been touched in eight months because nothing about it needed to change. Edit recency is a poor proxy for content recency, and treating "recently modified" as "more relevant" rewards activity over correctness.

The Recency Prior — When It Works

The simplest mitigation is a fused score that blends semantic similarity with a temporal decay factor. A common form looks like:

score(q, d, t) = α · cos(q, d) + (1 − α) · 0.5^(age_days(t) / h)

where α weights relevance versus recency (default ~0.7) and h is the half-life in days (a common starting point is 14). Documents lose half their recency contribution every h days, and the absolute floor is bounded so that ancient documents don't get scored at zero.

This works well — sometimes spectacularly well — on queries where freshness is the answer. A 2025 paper benchmarking this approach reported near-perfect accuracy on a "latest-document retrieval" task, where pure semantic retrieval scored zero. If the user asks "what's the most recent earnings number," the recency prior makes the right document outrank older but slightly more semantically tight matches.

The catch is that this only works if you apply it on the right queries. Apply the same prior to "what was the founding mission of the company" and you'll bias toward whichever recent document happens to mention the mission, ahead of the actual founding document that hasn't been touched in five years. The recency prior is correct for time-sensitive queries and actively harmful for time-stable ones. A single global α chosen for the average query will be wrong half the time on the queries that matter.

The Time-Sensitivity Classifier

The next layer is recognizing time-sensitivity at query time and applying the recency prior selectively. Think of it as routing: for a query like "what is our refund window," you want freshness pushed to maximum because the answer is genuinely time-varying. For a query like "explain the architecture of our pricing model," you want recency to barely register because the canonical explanation is old by design.

The classifier doesn't need to be sophisticated. A small model — even a few-shot LLM call — can reliably tag queries with a time-sensitivity score. The signals are well-defined: explicit temporal markers ("now," "current," "this quarter"), entity types that are inherently volatile (prices, headcount, leadership), and content domains where ground truth shifts (policy, pricing, inventory, news) versus those where it doesn't (architecture, history, philosophy, brand). Production systems often combine an LLM classifier with a rule layer for the high-confidence cases, falling back to the model for the ambiguous ones.

What this buys you is per-query α tuning. A volatile query gets α=0.4 and a 7-day half-life. A reference query gets α=0.95 and a 365-day half-life. The static-knob problem dissolves because there is no static knob; the score function is parameterized by the query, not the system.

This routing layer is also where you get to refuse certain combinations gracefully. If the classifier says the query is highly time-sensitive but no document in the corpus has been updated in 200+ days, that's a signal to add a "this information may be out of date" hedge to the response, rather than silently serving stale content as authoritative.

Source Volatility Scoring

The complement to query-side time-sensitivity is corpus-side volatility. Not every document in your knowledge base ages at the same rate. A page describing the company's mission decays slowly. A page listing current pricing tiers decays fast. Treating both with the same recency curve is the wrong shape of decay.

Volatility scores can be set manually for a small corpus or learned from edit history for a large one. The signals are in the version control: a document that gets edited every six weeks is high-volatility; a document that has had three commits in five years is low-volatility. A "knowledge entropy" score per document, attached as metadata at index time, lets the retrieval layer apply different decay curves to different content types.

The practical effect is that the retrieval scoring becomes:

score(q, d, t) = α(q) · cos(q, d) + (1 − α(q)) · 0.5^(age_days(t) / h(d))

where α(q) is the per-query weight from the time-sensitivity classifier and h(d) is the per-document half-life from the volatility score. The semantic similarity is unchanged; the temporal half of the score is now query-aware and document-aware. The pricing page gets a 7-day half-life, the architecture overview gets a 365-day half-life, and the user query decides how much either side matters.

A second-order benefit of explicit volatility scoring is that it surfaces the staleness audit. Once every document has a volatility tag and an age, you can compute how many "high-volatility, high-age" documents exist in your corpus — and that number is a leading indicator of failure. A knowledge base where 12% of high-volatility documents exceed their expected refresh window is a knowledge base whose answer quality is decaying right now. Most teams discover this only after a complaint; the audit lets you discover it from the index.

Where the Tradeoff Becomes Architectural

Once you accept that no static formula works for every query, the architecture has to change to support per-query parameterization. The retrieval layer is no longer "embed the query, return top-k." It's "classify the query, look up the classifier-derived parameters, compute a fused score with document-side metadata, return top-k, then optionally hedge in the response when the classifier and the corpus disagree."

This is more code than the textbook RAG diagram, and most teams underestimate it. The volatility metadata has to be maintained — either inferred from edit history during ingestion or curated as part of the document's frontmatter. The classifier has to be trained or prompted, and its output has to feed the scoring function rather than living off in a separate analytics dashboard. The eval suite has to include both time-sensitive and time-stable queries, because optimizing for one will silently regress the other.

The deeper realization is that the freshness-relevance tradeoff isn't a tuning problem; it's a modeling problem. The mistake is treating "relevance" as a single scalar that ranks documents against a query. Real relevance is at minimum two-dimensional: how well does this document answer the query, and how likely is the answer it gives still to be correct. Pure semantic retrieval optimizes the first and ignores the second. Recency-weighted retrieval optimizes a proxy of the second and contaminates the first. You can't fold them into one number that works everywhere; you have to score them separately and combine them with a weight that depends on the query and the document.

What This Means for Your Eval Suite

If your retrieval evals only score top-k accuracy on a static set of question-document pairs, your system can pass them all and still be wrong on time. Two evaluation gaps tend to bite teams here.

The first is that static eval sets age out of validity. A test case written when the policy was 60 days will keep marking the wrong answer correct after the policy changes to 14, because nobody updated the eval. The eval suite has to include a freshness check: when source documents change, regenerate the affected expected answers, or at least flag them for review.

The second is that average-accuracy metrics hide the time-sensitive failure mode. A retriever can be 88% accurate overall and 30% accurate on the subset of queries that depend on recent updates — and the dashboard reads as "things are fine" because the time-sensitive subset is a small fraction of total traffic. Slicing accuracy by time-sensitivity class makes the gap visible. So does adding "stale retrieval rate" as a first-class metric: the fraction of retrievals whose top result has been superseded by a more recent document the retriever did not surface.

These metrics are downstream of having a time-sensitivity classifier and per-document volatility scores in the first place. The same machinery that powers the per-query α also powers the eval slicing. Build it once, instrument it once, and the freshness behavior of the system becomes legible instead of a vibes-based debate triggered by user complaints.

The Honest Conclusion

The freshness-relevance tradeoff is one of the rare cases in retrieval engineering where the textbook architecture is structurally inadequate, not just under-tuned. A single similarity score across documents of varying ages and queries of varying time-sensitivities will be wrong on a meaningful slice of traffic, and no amount of prompt engineering downstream will repair retrieval that returned the wrong document.

The fix is to give up on the single global knob and parameterize the scoring on both sides — the query, via a time-sensitivity classifier, and the document, via a volatility score. The cost is a more complex retrieval layer with explicit metadata pipelines. The payoff is a system that's correct on both "what was our founding mission" and "what's our refund window today," from the same index, with the same model, and without the operator having to remember to tune α for each new query type.

If your RAG system's wrongness is concentrated on questions about anything that changed recently, the answer is not to retrain the embedder. It's to admit that retrieval is a two-axis problem and start scoring it that way.

References:Let's stay in touch and Follow me for more thoughts and updates