Skip to main content

Semantic Search as a Product: What Changes When Retrieval Understands Intent

· 11 min read
Tian Pan
Software Engineer

Most teams building semantic search start from a RAG proof-of-concept: chunk documents, embed them, store vectors, query with cosine similarity. It works well enough in demos. Then they ship it to users, and half the queries fail in ways that have nothing to do with retrieval quality.

The reason is that RAG and user-facing semantic search are solving different problems. RAG asks "given a question, retrieve context for an LLM to answer it." Semantic search asks "given a user's query, surface results that match what they actually want." The second problem has a layer of complexity that RAG benchmarks systematically ignore — and that complexity lives almost entirely before retrieval begins.

The Pre-Retrieval Problem Everyone Ignores

In a RAG pipeline, the query is typically well-formed. Someone asks a question, the system retrieves context, and the LLM answers. The query is rarely ambiguous because there's usually a human or another LLM composing it deliberately.

Real users don't compose queries deliberately. They type "python sort list reverse" when they want to know the difference between sort() and sorted(). They search "login not working" when they mean "OAuth refresh token expiry handling." They abbreviate, they misspell, they use internal jargon that doesn't appear in your documentation, and they reformulate three times in the same session without ever finding what they need.

The consequence: a large fraction of failures in user-facing semantic search happen before any vector is touched. If you instrument your system the same way you instrument a RAG pipeline — tracking retrieval recall, checking whether returned documents are relevant — you'll miss the failures that occur when the query itself was never a fair test of your index.

Pre-retrieval failures take a few specific forms:

  • Query-corpus vocabulary mismatch: The user's phrasing has no semantic overlap with how your corpus describes the same concept. Embedding models can handle paraphrases but struggle with domain-specific abbreviations, acronyms, or jargon that didn't appear in training data.
  • Under-specified queries: Short, vague queries produce mediocre embeddings. "How to do authentication" generates a vector that's equidistant from dozens of documents, and retrieval returns a noise sample rather than a ranked result.
  • Multi-intent queries: "Is product X worth buying and where can I get it cheapest" contains two independent intents. A single embedding vector can't represent both faithfully. Traditional embedding models produce one representation per query; that representation interpolates between the intents rather than serving either.

The solution to pre-retrieval failures isn't a better embedding model. It's query pre-processing — a step that happens before the vector lookup.

Query Reformulation Is the Biggest Leverage Point

Teams that build search products with high retention all do some form of query normalization. The simplest version: spell correction and synonym expansion. The more sophisticated version: neural query rewriting that transforms vague user inputs into queries that match the structure of your corpus.

The key insight from production search systems is that user queries and document language are written by different people with different goals. Documents are written to inform; queries are written to retrieve. A user searching for "why does my connection keep dropping" is looking for documents that likely describe "intermittent connectivity issues" or "TCP timeout configuration." The surface forms share almost no vocabulary.

Query reformulation bridges this gap by rewriting the user's query into a form that better matches the index — not by changing what the user wants, but by translating how they expressed it into a form your retrieval system can actually serve.

Behavioral signals make this loop self-improving. When users reformulate a query in a session — typing "python list sort" and then, after seeing disappointing results, trying "python sorted function descending" — they've handed you a high-quality training pair. The first query is the input; the second is a better version. At scale, these session-level reformulation chains become training data for a query rewriting model that continuously improves without any annotation overhead.

This is why search teams that instrument their products like RAG miss half their failures. A RAG evaluation asks "were the retrieved documents relevant to the query?" It doesn't ask "did the user have to reformulate three times before giving up?" The second question is the one that tells you whether your search product actually works.

Multi-Intent Queries Require a Different Architecture

Single-embedding retrieval works when a query has one intent. Many user queries don't.

A standard approach to multi-intent queries is to detect intent at the query level and route to specialized retrievers. A query about product availability might go to a catalog index; a query about troubleshooting a product might go to a support index; a query that combines both gets handled by something that can merge results from both. This isn't sophisticated — it's a classifier and a routing layer — but it dramatically outperforms trying to serve all intents from a single general-purpose embedding search.

More advanced architectures use Neural Multiple Intent Representation (NMIR) models, which generate distinct query embeddings for each detected intent. Rather than one vector representing the query, you get a set of vectors, each targeting a different facet. Retrieval happens against all of them, and a re-ranking step decides which results best satisfy the combination of intents.

The architectural implication: user-facing semantic search needs a query understanding layer that sits upstream of the retrieval system. This layer handles intent detection, query expansion, disambiguation, and rewriting. In RAG, this layer either doesn't exist or is handled implicitly by the LLM constructing the retrieval query. In search products, it needs to be explicit and instrumented — because it's where most of the opportunity and most of the failures are.

Behavioral Signals Replace Manual Relevance Judgments

Evaluating search quality is expensive when done correctly. The gold standard is human relevance judgments: trained annotators rating each query-document pair. At scale, this is prohibitively slow and costly, and the judgments go stale as your corpus changes.

Production search teams use behavioral signals as a scalable proxy. The two most reliable:

Dwell time — how long a user spends on a result after clicking. Short dwell time (under 10 seconds) combined with a return to search results is a strong signal of irrelevance. Long dwell time is a strong signal of relevance. The signal is noisy on its own — some pages are dense, some are fast to scan — but in aggregate across thousands of sessions, it reliably tracks result quality. Research consistently finds that combining click signals with dwell time produces higher-quality implicit relevance judgments than click data alone.

Query reformulation patterns — when a user searches, clicks, then reformulates, the reformulation tells you the previous result didn't satisfy the intent. At the aggregate level, queries that consistently trigger reformulation are surfacing bad results. At the model level, successful sessions (where a user found what they wanted without reformulating) and failed sessions (where they reformulated or abandoned) become implicit labels for training improved ranking.

The operational advantage is that these signals are collected automatically, they're available immediately, and they scale with query volume. An annotation pipeline that takes weeks to produce relevance labels produces click and dwell data in real time. For fast-moving products where relevance changes frequently — new content, changing user needs, updated corpus — behavioral signals are the only evaluation loop that keeps pace.

One important caveat: behavioral signals have selection bias. You only observe behavior for results that appeared in the top-K, which means results that never ranked highly never generate signal. A result that's highly relevant but never shown doesn't improve its position through behavioral feedback. This is the explore-exploit problem in search ranking, and it requires deliberate countermeasure — injecting occasional random or diversified results to expose the system to results outside its current top-K.

Eval Methodology for Search Without Annotated Datasets

The standard information retrieval eval pipeline — build a query set, annotate relevant documents, compute NDCG or MAP — is expensive, slow, and produces benchmarks that diverge from production quality within weeks of corpus changes.

There are two approaches production teams use to avoid this.

The first is overlap-based evaluation. If you have multiple retrieval systems (keyword, dense, hybrid), the documents that only one system retrieves — that no other system agreed on — tend to be lower quality. Consensus across diverse retrieval strategies is a weak signal of relevance, and lack of consensus is a signal of questionable quality. This isn't a perfect eval, but it's entirely automated and requires no labeled data.

The second is online evaluation using business metrics. Define success operationally: a successful search session ends in a click and a dwell time above your threshold, or a conversion, or zero reformulations. These proxies are imperfect, but they're measuring actual user behavior rather than a simulation of it. A/B testing retrieval changes against these metrics is more reliable than comparing against static annotated benchmarks because it tests against real user queries on the real corpus.

The trap is optimizing only for clicks. Click-through rate is easily gamed — better titles produce more clicks regardless of content quality, and clickbait headlines are the classic failure mode. Pairing CTR with dwell time or downstream engagement metrics closes this gap. A result that gets clicked but immediately abandoned is worse than a result that doesn't get clicked, and your eval methodology should reflect that.

What Actually Breaks in Production

Uber migrated their user-facing semantic search to a system handling 1.5 billion vectors across nearly 400 dimensions. The engineering wasn't the hard part. The hard part was that their initial prototype measured retrieval recall against a fixed test set and shipped. Production traffic immediately exposed gaps: queries that worked fine in the test set failed on the long tail of real user phrasing, and the feedback loop to detect and fix those failures didn't exist.

The pattern is common enough that it's worth naming explicitly. Search teams building on top of embedding retrieval typically hit the same sequence of failures:

  1. The demo works — a curated set of queries on a clean corpus produces impressive results. Recall is high, results look good.
  2. Production query distribution is different — real users phrase things the demo queries didn't cover. The head of the query distribution is fine; the tail is broken.
  3. There's no instrumentation to see the tail — the team is tracking average precision on a test set that doesn't represent production traffic.
  4. Users reformulate and abandon — the failure is invisible in metrics but visible in retention. Users stop using the search feature because it doesn't reliably find what they're looking for.

The fix is straightforward but requires building infrastructure that most teams skip: session-level logging that tracks the full query sequence including reformulations and abandonments, not just individual query-result pairs. This is the data that tells you whether your search product is working.

Building a Search Product That Compounds

The difference between a search feature and a search product is feedback loops. A feature retrieves documents; a product learns from user behavior to retrieve better documents over time.

The compounding loop has three components: behavioral signal collection (clicks, dwell time, reformulations), a training pipeline that converts those signals into improved query understanding and ranking, and an eval methodology that measures whether the system is improving on real traffic rather than a static benchmark.

Teams that get this loop working see steady improvement in search quality without any changes to the underlying retrieval infrastructure. The embedding model doesn't change; the vector index doesn't change; what changes is the query pre-processing layer and the ranking function, both of which learn continuously from user behavior.

The semantic search market is growing at 23% CAGR as organizations recognize that keyword matching is no longer sufficient for the volume and variety of queries their products need to handle. The infrastructure is now commodity — hosted vector databases, managed embedding APIs, open-source retrieval libraries. The differentiation is in the layers that RAG benchmarks don't measure: how well you understand queries before retrieval, how well you learn from user behavior after retrieval, and how well you eval on real traffic rather than curated test sets.

Build those layers first. The embedding model is the easy part.

References:Let's stay in touch and Follow me for more thoughts and updates