Skip to main content

The Query Rewriting Layer Your RAG Pipeline Skipped

· 10 min read
Tian Pan
Software Engineer

When a RAG system answers wrong, the first instinct on most teams is to blame the encoder. Swap to a bigger embedding model. Try a domain-tuned one. Bump the dimension count. Three sprints later the recall curve has nudged a few points and the user complaints look the same.

The diagnosis was wrong. Most retrieval failures aren't embedding failures. They're query-shape failures — and no amount of vector tuning fixes a vocabulary mismatch that exists before the encoder ever runs.

A user types "how do I cancel." The relevant document is titled "Subscription Lifecycle Management" and uses words like "termination," "billing cycle close," and "service deactivation." There is no encoder in the world that pulls those two strings into the same neighborhood by lexical luck. The cosine similarity gap is real, and it lives in the input, not the model. The query rewriting layer that goes ahead of retrieval is the thing most pipelines skip and then spend a quarter trying to compensate for downstream.

The Vocabulary Gap You Keep Misdiagnosing

Standard RAG retrieval treats the user's question as a finished query. It encodes it, dots it against your index, and returns the top-k. This works on the demo dataset where someone wrote both the questions and the docs. It breaks in production because real users write the way people write to humans: short, underspecified, full of pronouns and implicit context, often phrased as a problem rather than a topic.

Documents, on the other hand, are written by people who think they're explaining something. They're long, structured, full of section headers and abstract nouns, and almost never start with "how do I." The geometry of the embedding space reflects this asymmetry. Short user queries cluster with other short user queries. Long expository passages cluster with other long expository passages. The average distance between a real query and the doc that actually answers it is larger than the average distance between two unrelated docs.

This is why fine-tuning the embedding model usually disappoints. You can squeeze 5–10% recall out of a domain-tuned encoder if you have labeled pairs and patience, but you're optimizing the wrong end of the pipeline. The cheaper move is to reshape the query so it lives in the same neighborhood as the documents. That's what query rewriting does.

HyDE: Embed the Answer You Wish You Had

Hypothetical Document Embeddings — HyDE — flips the retrieval problem on its head. Instead of embedding the question, you ask an LLM to write a fake answer to the question, then embed that and retrieve against it.

The intuition is geometric. A real answer document and a hallucinated one are written in the same register: same length, same vocabulary, same noun-phrase density, same expository tone. They sit in the same region of embedding space. A short user question does not. So the hallucinated answer acts as a translator that walks the query into the document neighborhood before the index ever gets touched.

HyDE doesn't need the hypothetical to be factually correct. It only needs to be stylistically and lexically representative. The LLM can confidently invent a wrong answer and the retrieval still improves, because what you're matching on is shape, not truth. Reported gains range from a few points to tens of points of precision and recall depending on the corpus, with the largest lifts on domains where the user vocabulary diverges most from the document vocabulary — exactly the case where fine-tuning the encoder would have been most expensive.

The cost is one extra LLM call before retrieval. On a small model that's 200–500ms. The trade is real, and on latency-critical paths you may want to gate HyDE behind a confidence check (run vanilla retrieval first; if the top-1 similarity score is below a threshold, fall back to HyDE and re-retrieve). For batch ingestion or analytical queries, you almost always pay the latency without thinking twice.

Decomposition: One Question Hides Three Sub-Questions

Multi-hop questions break retrieval in a different way. "Which of the engineering managers reporting to the CTO joined before the Series B?" cannot be answered by a single document. The answer requires three retrievals — who is the CTO, who reports to them, when did each join — and the embedding of the original question doesn't point at any of those individually. It points at a vague middle that contains pieces of all three and is the right document for none.

Decomposition rewrites the multi-hop query into a list of single-hop sub-queries, retrieves for each, then merges. Done in one shot ("split this into independent sub-questions"), it already beats single-query retrieval. Done iteratively — retrieve, look at what you found, generate the next sub-query conditioned on the partial answer — it can match the recall of much heavier multi-hop benchmarks while pulling fewer chunks per step.

The failure mode to watch for is over-decomposition. An LLM asked to decompose will happily split a single-hop question into three redundant sub-queries, tripling your retrieval cost for no gain. The cheap mitigation is a router: a small classifier that decides whether a question is single-hop or multi-hop before you reach for the decomposer. The cost-aware version of this routes simple questions through plain retrieval and only escalates the complex ones, which keeps p50 latency flat and p99 honest.

Multi-Query Fan-Out and Rank Fusion

Even on a single-hop question, the user's phrasing is one sample from a distribution of plausible phrasings. "How do I cancel" has siblings: "ending my subscription," "stop being billed," "remove my plan." Each sibling embeds to a slightly different point. Each retrieves a slightly different top-k. The intersection across phrasings is more reliable than any single retrieval.

Multi-query fan-out generates N paraphrases of the user's query, retrieves for each, and fuses the results. The standard fusion algorithm is Reciprocal Rank Fusion (RRF), which scores each document by the sum of 1/(k + rank) across the lists it appears in, with k typically set to 60. Documents that show up high in multiple lists win; documents that show up high in only one are filtered out as topic drift.

RRF is rank-based rather than score-based, which is why it works well when fusing retrievers that produce incomparable scores — dense vector similarity and BM25, for example. It also requires no training data. It is one constant and a sort. RAG-Fusion, the architecture that combines multi-query generation with RRF, has reported answer-quality lifts of around 8–10% in accuracy and 30–40% in comprehensiveness over vanilla RAG, which is more than most embedding swaps deliver.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates