Skip to main content

Semantic Search as a Product: What Changes When Retrieval Understands Intent

· 11 min read
Tian Pan
Software Engineer

Most teams building semantic search start from a RAG proof-of-concept: chunk documents, embed them, store vectors, query with cosine similarity. It works well enough in demos. Then they ship it to users, and half the queries fail in ways that have nothing to do with retrieval quality.

The reason is that RAG and user-facing semantic search are solving different problems. RAG asks "given a question, retrieve context for an LLM to answer it." Semantic search asks "given a user's query, surface results that match what they actually want." The second problem has a layer of complexity that RAG benchmarks systematically ignore — and that complexity lives almost entirely before retrieval begins.

The Pre-Retrieval Problem Everyone Ignores

In a RAG pipeline, the query is typically well-formed. Someone asks a question, the system retrieves context, and the LLM answers. The query is rarely ambiguous because there's usually a human or another LLM composing it deliberately.

Real users don't compose queries deliberately. They type "python sort list reverse" when they want to know the difference between sort() and sorted(). They search "login not working" when they mean "OAuth refresh token expiry handling." They abbreviate, they misspell, they use internal jargon that doesn't appear in your documentation, and they reformulate three times in the same session without ever finding what they need.

The consequence: a large fraction of failures in user-facing semantic search happen before any vector is touched. If you instrument your system the same way you instrument a RAG pipeline — tracking retrieval recall, checking whether returned documents are relevant — you'll miss the failures that occur when the query itself was never a fair test of your index.

Pre-retrieval failures take a few specific forms:

  • Query-corpus vocabulary mismatch: The user's phrasing has no semantic overlap with how your corpus describes the same concept. Embedding models can handle paraphrases but struggle with domain-specific abbreviations, acronyms, or jargon that didn't appear in training data.
  • Under-specified queries: Short, vague queries produce mediocre embeddings. "How to do authentication" generates a vector that's equidistant from dozens of documents, and retrieval returns a noise sample rather than a ranked result.
  • Multi-intent queries: "Is product X worth buying and where can I get it cheapest" contains two independent intents. A single embedding vector can't represent both faithfully. Traditional embedding models produce one representation per query; that representation interpolates between the intents rather than serving either.

The solution to pre-retrieval failures isn't a better embedding model. It's query pre-processing — a step that happens before the vector lookup.

Query Reformulation Is the Biggest Leverage Point

Teams that build search products with high retention all do some form of query normalization. The simplest version: spell correction and synonym expansion. The more sophisticated version: neural query rewriting that transforms vague user inputs into queries that match the structure of your corpus.

The key insight from production search systems is that user queries and document language are written by different people with different goals. Documents are written to inform; queries are written to retrieve. A user searching for "why does my connection keep dropping" is looking for documents that likely describe "intermittent connectivity issues" or "TCP timeout configuration." The surface forms share almost no vocabulary.

Query reformulation bridges this gap by rewriting the user's query into a form that better matches the index — not by changing what the user wants, but by translating how they expressed it into a form your retrieval system can actually serve.

Behavioral signals make this loop self-improving. When users reformulate a query in a session — typing "python list sort" and then, after seeing disappointing results, trying "python sorted function descending" — they've handed you a high-quality training pair. The first query is the input; the second is a better version. At scale, these session-level reformulation chains become training data for a query rewriting model that continuously improves without any annotation overhead.

This is why search teams that instrument their products like RAG miss half their failures. A RAG evaluation asks "were the retrieved documents relevant to the query?" It doesn't ask "did the user have to reformulate three times before giving up?" The second question is the one that tells you whether your search product actually works.

Multi-Intent Queries Require a Different Architecture

Single-embedding retrieval works when a query has one intent. Many user queries don't.

A standard approach to multi-intent queries is to detect intent at the query level and route to specialized retrievers. A query about product availability might go to a catalog index; a query about troubleshooting a product might go to a support index; a query that combines both gets handled by something that can merge results from both. This isn't sophisticated — it's a classifier and a routing layer — but it dramatically outperforms trying to serve all intents from a single general-purpose embedding search.

More advanced architectures use Neural Multiple Intent Representation (NMIR) models, which generate distinct query embeddings for each detected intent. Rather than one vector representing the query, you get a set of vectors, each targeting a different facet. Retrieval happens against all of them, and a re-ranking step decides which results best satisfy the combination of intents.

The architectural implication: user-facing semantic search needs a query understanding layer that sits upstream of the retrieval system. This layer handles intent detection, query expansion, disambiguation, and rewriting. In RAG, this layer either doesn't exist or is handled implicitly by the LLM constructing the retrieval query. In search products, it needs to be explicit and instrumented — because it's where most of the opportunity and most of the failures are.

Behavioral Signals Replace Manual Relevance Judgments

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates