The Query Rewrite Layer Your RAG System Is Missing
Most teams tuning a RAG system focus on two levers: chunking strategy and embedding model selection. When retrieval quality degrades, they re-chunk. When recall numbers look bad, they upgrade the embedding model. Both are reasonable moves — but they're optimizing the middle of the pipeline while leaving the highest-leverage point untouched.
The user's query is almost never in the ideal form for vector retrieval. It's terse, colloquial, ambiguous, or assumes context that the index doesn't have. No matter how good your embeddings are, if you're searching with a poorly formed query, you're going to retrieve poorly. The fix isn't downstream — it's transforming the query before it reaches the vector index.
This is what a query rewrite layer does. It's an extra step (sometimes an extra LLM call) that transforms the raw user input into one or more better-formed retrieval queries. Teams that implement it consistently see retrieval recall improvements that dwarf what they gained from chunking experiments. The engineering tradeoff — the added latency and cost of an additional LLM call — is real, but it's often the wrong reason to skip it.
Why Raw User Queries Fail Vector Search
Vector search works by finding documents whose embeddings are close to the query embedding in semantic space. The implicit assumption is that the user's query and the relevant document share similar semantic territory. In practice, they often don't.
Users write questions conversationally. They reference prior context from a session. They use domain terms that differ from the terminology in your documents. They ask broad questions when they need specific answers, or specific questions when the relevant knowledge is framed at a higher level of abstraction.
The embedding model can't fix this. It faithfully encodes whatever query you give it. If your query is "how do I fix the auth thing from last week," the embedding will capture that meaning — and it won't match your documentation that says "resolving JWT expiration errors in the session middleware."
Query rewriting solves the vocabulary mismatch problem at the source.
HyDE: Search With the Answer, Not the Question
HyDE (Hypothetical Document Embeddings) is the most counterintuitive of the query rewriting techniques. Instead of embedding the user's question and searching for similar documents, you prompt an LLM to generate a hypothetical document that would answer the question — and then embed that hypothetical document for retrieval.
The insight is that questions and answers occupy different regions of semantic space. A document explaining JWT expiration errors looks much more like other documents about JWT expiration than it looks like a user question about authentication problems. By generating a plausible (not necessarily accurate) answer, you're searching in the same semantic region as the target documents.
Factual accuracy in the hypothetical document isn't required. HyDE works because it captures the form and vocabulary of an answer, not because it gets the content right. A hallucinated but plausible-sounding document still occupies semantically useful territory.
The performance gains are meaningful. On the BEIR benchmark, HyDE improves average recall by around 6 percentage points over standard dense retrieval. More recent variants push higher — HyPE (2025) reported precision improvements of up to 42 percentage points on specific datasets compared to baseline retrieval.
The tradeoff is latency. On smaller LLMs in the 1–4B parameter range, HyDE adds 43–60% to query-time latency. On larger models, the latency cost is real but more manageable if you can run inference on dedicated hardware. The practical implication: HyDE shines when your queries and documents have a significant vocabulary gap, and it's worth the latency cost when retrieval failure has high downstream impact (wrong answer, not just incomplete answer). It works best for factoid questions with direct answers. For personal, ambiguous, or highly context-dependent queries, the hypothetical document generation can veer off-target and hurt precision.
A production-safe approach: use similarity confidence thresholds. If the embedding similarity between a raw query and retrieved documents is above a threshold, skip HyDE. Only invoke it when baseline retrieval looks weak.
Step-Back Prompting: Retrieve the Background, Not Just the Answer
Step-back prompting takes the opposite approach from HyDE. Instead of generating a hypothetical answer, it generates a more abstract version of the question — a step up the ladder of abstraction — and retrieves background knowledge that frames the answer.
If a user asks "why did the authentication middleware start returning 503 errors after the Tuesday deploy?", step-back prompting might generate: "what are common causes of authentication service failures during deployments?" The retrieval then surfaces foundational knowledge about deployment-related auth failures — runbooks, architecture docs, known failure modes — before any attempt at a specific answer.
This is most useful for queries that require synthesis across multiple knowledge sources, or where the user's specific question can only be answered after establishing context. Complex diagnostic questions, architectural decisions, and questions that embed unstated assumptions are good candidates.
The risk is over-abstraction. Step-back works when the abstracted question genuinely covers the relevant territory. If the abstraction loses the specifics that matter — the particular middleware, the specific error code, the timestamp — it retrieves correct but irrelevant background and the generation step has to work harder to arrive at a useful answer.
Sub-Query Decomposition: Multi-Hop Retrieval Done Right
Complex questions often contain multiple independent information needs packed into one query. "What are the latency implications of enabling query rewriting in a RAG pipeline, and how do the costs compare to improving retrieval with a better embedding model?" is really two questions: one about latency, one about comparative costs. Treating it as a single retrieval query means the vector index has to find documents that address both simultaneously — a much harder constraint than finding documents for each independently.
Sub-query decomposition breaks the original query into atomic sub-queries, executes retrieval for each, and then synthesizes the results before generation. There are two modes:
Parallel decomposition runs all sub-queries simultaneously and merges results. It's faster (total latency ≈ single sub-query latency, not sum) and works when the sub-queries are independent — when answering one doesn't depend on knowing the answer to another.
Sequential decomposition uses the answer to one sub-query to inform the next. This handles dependent reasoning chains: "What version of the library are we using, and what are the known vulnerabilities in that version?" You need the answer to the first before you can answer the second. The tradeoff is latency proportional to chain depth — N sequential retrievals at N× the latency.
Sub-query decomposition is the right default for multi-hop questions and queries with multiple distinct facets. The main failure mode is over-decomposition: breaking a simple question into unnecessary sub-queries adds latency with no recall benefit. A lightweight pre-classifier to route queries — simple vs. complex, single-intent vs. multi-facet — prevents this and keeps the fast path fast.
Multi-Query Expansion: Recall at the Cost of Throughput
Multi-query expansion (sometimes called RAG-Fusion) is the brute-force option. Generate N rephrased versions of the query, run all N through retrieval, and fuse the results using Reciprocal Rank Fusion (RRF). RRF combines ranked lists in a way that surfaces documents appearing in multiple result sets — a rough consensus signal for relevance.
The recall improvement is substantial. Documents that would miss the top-k cutoff for any single query formulation often surface in the merged list. But the cost is linear in N: three query variants means three embedding calls, three vector searches, and N× the infrastructure cost. For a service handling significant query volume, multi-query expansion requires either cost acceptance or selective application.
The selective-application pattern: route queries above a complexity threshold to multi-query, keep simple queries on the fast path. A lightweight intent classifier — even a fine-tuned BERT-class model — can do this cheaply.
The Latency Math
The concern engineers raise about query rewriting is always latency. An additional LLM call before retrieval means an additional 100–500ms at p50, depending on model size and whether you're hitting an API or local inference.
ElevenLabs shared specific numbers from their production RAG refactor: query rewriting via a single externally-hosted LLM accounted for more than 80% of their total RAG latency. Their solution was to switch to parallel inference across self-hosted Qwen 3-4B and 3-30B instances with a one-second timeout fallback. Median latency dropped from 326ms to 155ms — the rewriting step got cheaper by going smaller and faster, not by being removed.
This is the right framing. The question isn't "should we pay the cost of query rewriting?" — for most systems with real retrieval quality problems, the answer is yes. The question is "how do we make the rewriting step fast enough?" The answers are:
- Use a smaller model for query rewriting than for generation. 7B-class models handle rewriting well at a fraction of the latency of frontier models.
- Cache rewritten queries. Users ask similar questions. A TTL cache on rewritten forms pays for itself quickly in high-traffic systems.
- Run rewriting in parallel with other pipeline steps where possible — embedding the original query while the rewritten form is being generated.
- Apply rewriting selectively. Route through the rewrite layer only when baseline retrieval confidence is below a threshold, or when query complexity signals suggest it's needed.
The extra LLM call pays for itself when retrieval failure costs more than the latency budget. For a customer support RAG system where a retrieval miss means a wrong answer to a paying customer, the math almost always works out in favor of rewriting.
A Decision Framework for Query Selection
Not every query needs every technique. Here's how to match the transformation to the query type:
Simple factoid queries ("what is the rate limit on the API?") — try HyDE first. The semantic space gap between a question and an answer is the core problem, and HyDE addresses it directly.
Multi-facet or multi-hop queries ("compare latency and cost between X and Y, then recommend an approach for our use case") — sub-query decomposition, parallel if the sub-queries are independent.
Queries requiring background context (complex diagnostic or architectural questions) — step-back prompting to retrieve foundational knowledge before attempting a specific answer.
Ambiguous or underspecified queries ("how do I fix authentication?") — multi-query expansion to explore the semantic neighborhood, combined with intent classification to narrow scope.
High-traffic systems with mixed query types — implement a router that classifies query complexity and applies the appropriate transformation, keeping the fast path (no rewriting) for simple queries.
Where This Leaves Chunking and Embeddings
Improving chunking and embedding models still matters. But those optimizations operate on fixed inputs — they make your index better at matching whatever queries you send it. Query rewriting changes the queries themselves, which shifts the distribution you're matching against. The two are complementary, not competing.
Teams that implement query rewriting consistently report that it surfaces retrieval improvements that chunking experiments couldn't explain. That's because the retrieval quality ceiling isn't set by the index — it's set by the query quality. Invest there first.
The engineering work is modest: one additional prompt, one LLM call (with a smaller model than your generator), and a routing layer to apply it selectively. The return — measured in retrieval recall, answer quality, and user-facing error rates — is usually the highest ROI change you can make to a RAG system that's already in production.
- https://www.emergentmind.com/topics/hypothetical-document-embeddings-hyde
- https://arxiv.org/html/2412.17558
- https://dev.to/jamesli/in-depth-understanding-of-rag-query-transformation-optimization-multi-query-problem-decomposition-and-step-back-27jg
- https://medium.com/@chinmayd49/rag-production-optimizations-and-trade-offs-a623e5834e65
- https://elevenlabs.io/blog/engineering-rag
- https://arxiv.org/html/2504.14891v1
- https://docs.nvidia.com/rag/latest/query_decomposition.html
- https://medium.com/theultimateinterviewhack/hyde-query-expansion-supercharging-retrieval-in-rag-pipelines-f200955929f1
