Skip to main content

The Query Rewriting Layer Your RAG Pipeline Skipped

· 10 min read
Tian Pan
Software Engineer

When a RAG system answers wrong, the first instinct on most teams is to blame the encoder. Swap to a bigger embedding model. Try a domain-tuned one. Bump the dimension count. Three sprints later the recall curve has nudged a few points and the user complaints look the same.

The diagnosis was wrong. Most retrieval failures aren't embedding failures. They're query-shape failures — and no amount of vector tuning fixes a vocabulary mismatch that exists before the encoder ever runs.

A user types "how do I cancel." The relevant document is titled "Subscription Lifecycle Management" and uses words like "termination," "billing cycle close," and "service deactivation." There is no encoder in the world that pulls those two strings into the same neighborhood by lexical luck. The cosine similarity gap is real, and it lives in the input, not the model. The query rewriting layer that goes ahead of retrieval is the thing most pipelines skip and then spend a quarter trying to compensate for downstream.

The Vocabulary Gap You Keep Misdiagnosing

Standard RAG retrieval treats the user's question as a finished query. It encodes it, dots it against your index, and returns the top-k. This works on the demo dataset where someone wrote both the questions and the docs. It breaks in production because real users write the way people write to humans: short, underspecified, full of pronouns and implicit context, often phrased as a problem rather than a topic.

Documents, on the other hand, are written by people who think they're explaining something. They're long, structured, full of section headers and abstract nouns, and almost never start with "how do I." The geometry of the embedding space reflects this asymmetry. Short user queries cluster with other short user queries. Long expository passages cluster with other long expository passages. The average distance between a real query and the doc that actually answers it is larger than the average distance between two unrelated docs.

This is why fine-tuning the embedding model usually disappoints. You can squeeze 5–10% recall out of a domain-tuned encoder if you have labeled pairs and patience, but you're optimizing the wrong end of the pipeline. The cheaper move is to reshape the query so it lives in the same neighborhood as the documents. That's what query rewriting does.

HyDE: Embed the Answer You Wish You Had

Hypothetical Document Embeddings — HyDE — flips the retrieval problem on its head. Instead of embedding the question, you ask an LLM to write a fake answer to the question, then embed that and retrieve against it.

The intuition is geometric. A real answer document and a hallucinated one are written in the same register: same length, same vocabulary, same noun-phrase density, same expository tone. They sit in the same region of embedding space. A short user question does not. So the hallucinated answer acts as a translator that walks the query into the document neighborhood before the index ever gets touched.

HyDE doesn't need the hypothetical to be factually correct. It only needs to be stylistically and lexically representative. The LLM can confidently invent a wrong answer and the retrieval still improves, because what you're matching on is shape, not truth. Reported gains range from a few points to tens of points of precision and recall depending on the corpus, with the largest lifts on domains where the user vocabulary diverges most from the document vocabulary — exactly the case where fine-tuning the encoder would have been most expensive.

The cost is one extra LLM call before retrieval. On a small model that's 200–500ms. The trade is real, and on latency-critical paths you may want to gate HyDE behind a confidence check (run vanilla retrieval first; if the top-1 similarity score is below a threshold, fall back to HyDE and re-retrieve). For batch ingestion or analytical queries, you almost always pay the latency without thinking twice.

Decomposition: One Question Hides Three Sub-Questions

Multi-hop questions break retrieval in a different way. "Which of the engineering managers reporting to the CTO joined before the Series B?" cannot be answered by a single document. The answer requires three retrievals — who is the CTO, who reports to them, when did each join — and the embedding of the original question doesn't point at any of those individually. It points at a vague middle that contains pieces of all three and is the right document for none.

Decomposition rewrites the multi-hop query into a list of single-hop sub-queries, retrieves for each, then merges. Done in one shot ("split this into independent sub-questions"), it already beats single-query retrieval. Done iteratively — retrieve, look at what you found, generate the next sub-query conditioned on the partial answer — it can match the recall of much heavier multi-hop benchmarks while pulling fewer chunks per step.

The failure mode to watch for is over-decomposition. An LLM asked to decompose will happily split a single-hop question into three redundant sub-queries, tripling your retrieval cost for no gain. The cheap mitigation is a router: a small classifier that decides whether a question is single-hop or multi-hop before you reach for the decomposer. The cost-aware version of this routes simple questions through plain retrieval and only escalates the complex ones, which keeps p50 latency flat and p99 honest.

Multi-Query Fan-Out and Rank Fusion

Even on a single-hop question, the user's phrasing is one sample from a distribution of plausible phrasings. "How do I cancel" has siblings: "ending my subscription," "stop being billed," "remove my plan." Each sibling embeds to a slightly different point. Each retrieves a slightly different top-k. The intersection across phrasings is more reliable than any single retrieval.

Multi-query fan-out generates N paraphrases of the user's query, retrieves for each, and fuses the results. The standard fusion algorithm is Reciprocal Rank Fusion (RRF), which scores each document by the sum of 1/(k + rank) across the lists it appears in, with k typically set to 60. Documents that show up high in multiple lists win; documents that show up high in only one are filtered out as topic drift.

RRF is rank-based rather than score-based, which is why it works well when fusing retrievers that produce incomparable scores — dense vector similarity and BM25, for example. It also requires no training data. It is one constant and a sort. RAG-Fusion, the architecture that combines multi-query generation with RRF, has reported answer-quality lifts of around 8–10% in accuracy and 30–40% in comprehensiveness over vanilla RAG, which is more than most embedding swaps deliver.

The architectural lesson is that retrieval recall is not maximized by one perfect query. It is maximized by an ensemble of plausible queries whose intersection filters the noise. The query rewriter is the thing that creates the ensemble.

The Latency-Cost Tax — and Why It Pays Off

Every technique above adds an LLM call before retrieval, and engineers are right to be suspicious. Doubling p50 latency to add 20% recall is not free. The honest accounting:

  • HyDE adds one call. Use a small fast model (the rewriter doesn't need a frontier model). Cache by query hash for repeated questions.
  • Multi-query fan-out adds one call to generate paraphrases plus N parallel retrievals. The retrievals fan out, so wall-clock cost is one retrieval round, not N.
  • Decomposition adds one call to split, then sequential or parallel retrieval depending on dependency. The router that gates decomposition is the budget-saver.

For interactive chat, the budget is tight and gating is essential. For analytical, batch, or agent-driven retrieval where the tool call already takes seconds, the rewriter cost is a rounding error. The teams that report bad ROI on rewriting are usually the teams that turned it on globally without a router, billed the latency tax to every "what time is it" query, and concluded the technique was expensive. The technique was fine; the deployment was undisciplined.

The cost comparison that matters is rewriter vs. encoder fine-tune. Fine-tuning an embedding model requires labeled pairs, a training pipeline, an evaluation harness, and a re-embedding pass over the corpus every time you ship a new model. A query rewriter is fifty lines and a prompt. The rewriter ships in a week. The encoder fine-tune ships in a quarter. The recall they buy you is often comparable, and the rewriter is a strict prerequisite for the fine-tune to even be measurable — if your queries are misshapen, you're tuning the encoder against noise.

Evaluation: Separate "Wrong Doc Found" from "Right Doc Never Asked For"

The reason teams misdiagnose retrieval failures is that their evaluation lumps two distinct error classes together. A failed retrieval can mean:

  1. The retriever found the wrong doc. The query was reasonable; the index returned an irrelevant document. This is a retriever or encoder problem.
  2. The query never asked for the right doc. The user's phrasing is so far from the document vocabulary that no retriever could have bridged the gap. This is a query-shape problem.

A single recall@k metric can't tell these apart, which is why teams keep buying encoder upgrades when they need rewriters. Split the evaluation:

  • Run vanilla retrieval. Measure recall@k.
  • Run an oracle rewriter — a human or strong LLM rewriting each query into something close to the target document's vocabulary. Measure recall@k again.
  • The gap between the two is your query-shape ceiling. If the gap is large, your encoder is fine; your queries are misshapen and a rewriter will close most of the gap.
  • If the gap is small but absolute recall is still low, the encoder is genuinely the bottleneck and fine-tuning is justified.

This is the diagnostic that prevents the "three sprints fine-tuning the embedding model" failure. Most teams that run it discover their encoder was already adequate and a fifty-line query rewriter would have won the same gain in a week.

What to Build First

If you're sitting on a RAG pipeline that's plateaued and considering an encoder swap, work in this order:

  1. Build the diagnostic split above. It takes a day. It tells you whether you have a query-shape problem or a retrieval problem.
  2. If the gap is large, ship a rewriter. Start with multi-query fan-out plus RRF — it's the lowest-effort, highest-yield technique and requires no per-domain tuning. It also stacks cleanly with anything you do later.
  3. Add HyDE on the queries that vanilla retrieval scores low confidence on. Gate it behind a similarity threshold so you only pay the latency where it matters.
  4. Add a router and decomposer for multi-hop questions. This is the highest-effort step; do it last and only after you've confirmed multi-hop questions are a meaningful share of your traffic.
  5. Then, if the residual gap is real, fine-tune the encoder. You'll be tuning against well-shaped queries, which is the only condition under which the result is interpretable.

The query rewriting layer is the cheapest, most undervalued piece of RAG architecture. Skipping it doesn't make your pipeline simpler; it makes the rest of the pipeline carry weight it shouldn't have to. Build the rewriter first, measure honestly, and the encoder question usually answers itself.

References:Let's stay in touch and Follow me for more thoughts and updates