The Marketing Page Your RAG Cited as an Engineering Spec
A support engineer pastes a customer ticket into your internal assistant. The question is sharp: "Does our API support multi-region writes on the free tier?" The assistant comes back instantly, citing a chunk it retrieved with 0.91 cosine similarity. The answer is yes. The chunk is from a landing page written by marketing in 2023 to win a head-to-head against a competitor. Engineering removed multi-region writes from the free tier eighteen months ago and posted a terse internal RFC that nobody linked from a customer-facing page. The RFC is also in the vector store. It scored 0.74.
The assistant didn't hallucinate. It retrieved the highest-scoring document and faithfully grounded its answer in the text. The retriever did its job. The job was wrong.
This is the failure mode that no RAG benchmark catches and no faithfulness metric flags. Your assistant is citing its sources. The sources are real. The pipeline is doing exactly what it was designed to do. The bug is that retrieval relevance and source authority are two different dimensions, and the standard RAG stack collapses them into a single score.
Polished Prose Wins the Vector Race
Marketing copy is dense with high-signal vocabulary. It was written by people whose job is to make sentences memorable. Every noun is a product term. Every verb is an action verb. Every claim is phrased as a benefit. When you embed that text, you get a vector that points crisply at any query a user might ask, because the text was optimized to mirror how users describe their needs.
Engineering documentation is the opposite. It is conservative. It uses precise terms, qualifies every claim, and buries the actual capability statement three paragraphs into a context-setting preamble. The embedding of an engineering doc is diffuse. It scatters its signal across "deprecation timeline," "migration path," "service-level constraints," and "known limitations." A query asking "does X support Y" lights up the marketing chunk like a lighthouse and skims past the engineering chunk like fog.
This is not a tuning problem. It is a content-shape mismatch. The thing your retriever is most confident about is the thing your business is least sure about. Researchers studying retriever-generator alignment have found that generators ignore the retriever's top-ranked documents in 47% to 67% of queries, and rely on lower-ranked documents 48% to 66% of the time. The reason is not that the generator is broken. The reason is that the top-ranked document was wrong in a way that the generator could feel even when the scorer could not.
Relevance Is a Ranking Decision, Not a Truth Guarantee
A relevance score is a statement about geometric proximity in an embedding space. It tells you that the retrieved chunk is about the same topic as the query. It does not tell you that the chunk is current, that it was written by someone authorized to make the claim, or that the claim survived contact with the next sprint.
The standard RAG pipeline treats authority as if it were a property of the corpus rather than a property of each document. "We curated the knowledge base" becomes a once-a-quarter checkpoint, while documents leak in continuously from every team that has write access to Confluence. The vector store becomes a layered geological record: a Cambrian explosion of marketing pages from 2022, a Devonian fossil bed of deprecated wiki pages, an active topsoil of last week's RFCs. Cosine similarity is indifferent to the strata.
Worse, the moment your assistant cites a source, faithfulness metrics light up green. The output is grounded. The chain of evidence is clean. The eval suite is satisfied. The customer is misinformed. Studies of production RAG systems have documented this exact pattern: faithfulness scores rise because answers cite sources, regardless of whether those sources are relevant or correct. You built a system that confidently launders bad information through the veneer of citation.
What Authority Actually Looks Like in a Retrieval Pipeline
There is a class of fixes that treat authority as a first-class input to the scorer rather than a vibe applied to the corpus. They share a structural commitment: authority must travel with the document, not with the human who indexed it.
The first move is source-type weighting. Every document in the knowledge base gets a tag at ingestion time: engineering-rfc, marketing-page, support-macro, deprecated-wiki, customer-ticket. The retrieval scorer applies a multiplier to the cosine similarity based on the tag and the inferred query intent. Queries that look technical — entities like API names, version numbers, configuration keys — penalize marketing-tagged chunks. Queries that look like positioning questions penalize internal RFCs. The multiplier is small; you are not vetoing documents, you are nudging the ranking when authority and relevance disagree. The IEEE work on metadata-enriched retrieval shows roughly a 12-point precision lift from this kind of pre-filtering — not because the embedding got better, but because the embedding stopped being the only signal.
The second move is conflict detection as a retrieval primitive. Instead of returning the top three chunks and shipping them to the model, the retriever returns the top three and runs a structural check: do these chunks make compatible claims? If two of the three chunks say "supported" and one says "deprecated since v3.1," the retriever does not silently average the disagreement away. It surfaces the conflict to the orchestration layer, which decides whether to escalate, to re-rank with authority weights, or to ask the model to explicitly reason about which source to trust. The recent work on conflict-driven summarization formalizes this: treat contradictory evidence as a signal that the corpus has structural disagreement, not as noise to be averaged out by the generator.
The third move is the unglamorous one: content-pipeline discipline. Marketing copy does not belong in the engineering knowledge base. Sales decks do not belong in the support knowledge base. Deprecated wiki pages do not belong in any knowledge base. The fix is not a smarter retriever. The fix is an ingestion gate that asks, for every document entering the index, who owns the claim this document makes, and what happens to the document when the claim changes. If nobody owns the claim, the document is a liability. If nothing happens when the claim changes, the document is already a liability that just hasn't fired yet.
The Freshness Problem Is the Authority Problem in Disguise
Production teams often try to patch this with a freshness score: weight recent documents more heavily, decay older ones. This helps a little and misleads a lot. A 2024 marketing page describing a deprecated feature is still recent enough to dominate retrieval over a 2022 engineering RFC that correctly described the constraint. Freshness is a proxy for authority that breaks the moment the authoritative source is older than the misleading one.
The deeper structure is that authority is a graph, not a number. The engineering RFC is authoritative because the RFC owner is on-call for the system it describes, because changes to the system require updating the RFC, because the RFC links to the deployment pipeline that enforces its claims. Strip those edges and you have a document, not a source of truth. A vector store that treats every chunk as equally authoritative is treating every document as equally rootless. Tools like document-level provenance tracking and knowledge-graph-anchored RAG are early attempts to put those edges back, but they only work if your ingestion pipeline preserves them in the first place.
This is why the patch of "just re-rank with a cross-encoder" tends to disappoint. The cross-encoder is a more expensive relevance scorer. It is not an authority scorer. Throwing a stronger model at the wrong dimension gets you a more confident wrong ranking.
A Practical Pattern for the Next Time
If you are running a production RAG system today, the path forward is concrete:
- Tag every document at ingestion with its source type, the team that owns it, the date of the last review, and the claim-class it makes (capability, policy, pricing, status).
- Build a small intent classifier on incoming queries that routes them to the source types that should be allowed to answer. Technical questions retrieve from engineering-tagged sources first; pricing questions retrieve from finance-owned sources first.
- Surface conflicts. If the top-N chunks disagree on a factual claim, do not pick one. Show the model both, and let it reason explicitly about which to trust, citing the conflict in its answer.
- Quarantine the corpus by audience. The knowledge base your customer-facing assistant queries is not the knowledge base your engineering assistant queries. The temptation to unify them is a temptation to optimize for storage cost at the expense of trust.
- Run an eval slice that specifically tests conflicting-source scenarios. Pick ten cases where engineering and marketing disagree about the same feature. Measure whether the assistant gets it right when the conflict is present.
These patterns will not eliminate the problem. They will move it from invisible to observable, which is the only move that matters.
The Knowledge Strategy You Don't Have
"We put everything in the vector store" is not a knowledge strategy. It is the absence of one. It defers the question of what your organization considers authoritative to a similarity function that was never asked the question.
The assistants that are quietly more accurate in production are not the ones with the fanciest retrievers or the largest context windows. They are the ones whose teams treated their knowledge base as a curated artifact with owners, expiration dates, and conflict-resolution rules. The retrieval pipeline is downstream of those decisions. When the decisions are missing, the retriever fills the vacuum with whatever has the highest cosine similarity, and a marketing page from three years ago will beat a quietly correct RFC every time.
The next time a customer-facing answer looks confidently wrong, do not start by tuning the embedding model. Start by asking which source the model cited and what its claim-class is. The bug is rarely in the retriever. It is in the corpus the retriever was handed and the silent assumption that putting a document in a vector store was the same as endorsing it.
- https://medium.com/@duckweave/rag-retrieval-relevant-docs-wrong-answers-24f736b56386
- https://www.brainfishai.com/blog/rag-accuracy-degradation-in-production
- https://arxiv.org/pdf/2507.01281
- https://arxiv.org/pdf/2511.10375
- https://arxiv.org/pdf/2510.12460
- https://arxiv.org/pdf/2512.05411
- https://www.dataquest.io/blog/metadata-filtering-and-hybrid-search-for-vector-databases/
- https://docs.weaviate.io/weaviate/search/hybrid
- https://community.databricks.com/t5/technical-blog/the-ultimate-guide-to-chunking-strategies-for-rag-applications/ba-p/113089
- https://www.regal.ai/blog/rag-playbook-structuring-knowledge-bases
