The Discovery Problem: Why Semantic Search Fails Browsing Users
Vector search is eating the world. Embedding-based retrieval now powers product search at every major e-commerce platform, drives the retrieval layer of RAG systems, and sits at the core of most AI-powered search rewrites. But there is a category of user that these systems fail silently and consistently: the browsing user. Not because the embeddings are bad. Because they were built to solve a different problem.
The fundamental assumption behind semantic search is that users arrive with a query that approximates what they want. Optimize for proximity in embedding space to that query, and you win. But a significant fraction of real users arrive with something closer to curiosity than a query — and for them, the nearest neighbors in vector space are exactly the wrong answer.
Lookup vs. Exploration: Two Fundamentally Different Search Modes
Every search system serves two populations simultaneously, and most treat them as one.
Lookup users know what they want. They have a precise information need and are looking for a specific fact, product, or document. Their session looks like: one query, click the first or second result, done. Success means precision — the right answer is at the top.
Exploratory users do not know what they want, or they know the domain but not the specific answer. They might be searching for "gift ideas for an outdoor person," "machine learning papers on recommendation systems," or "warm jackets for men." Their session looks like: several reformulations, scattered clicks across diverse results, longer dwell time, and frequent backtracking. Success means coverage and serendipity — seeing enough of the space to make a good decision.
Research on query reformulation patterns shows exploratory users reformulate queries in characteristic ways: specialization (moving from broad to narrow), generalization (broadening when stuck), and lateral moves (exploring adjacent concepts). None of these behaviors is well-served by "find the nearest neighbors to this embedding."
The failure is invisible in standard metrics. If you measure precision@5 averaged across all query types, lookup queries carry your aggregate score. Exploratory queries quietly underperform without dragging down the headline number.
Why Embeddings Are the Wrong Tool for Exploration
Embedding-based retrieval encodes queries and documents into high-dimensional vectors, then finds documents whose vectors are closest to the query vector. The implicit assumption is that "nearest" means "most relevant." For lookup queries, that assumption is close enough to work. For exploration, it produces exactly the failure mode you'd expect: a pile of similar documents, all near the query, with no diversity.
The theoretical explanation is not just intuition — it's a mathematical constraint. Embedding dimension directly limits the number of distinct top-k document combinations a model can represent. For any fixed embedding dimension, there exist combinations of relevant documents that are unreachable regardless of how you phrase the query. This is not a gap that scaling dimensionality will close; it's a fundamental property of the compression.
The practical consequence is measurable. Recent benchmarks on retrieval tasks found that state-of-the-art embedding models achieved less than 20% recall@100 on relatively simple matching tasks where the answer set had fewer than 50 documents. Nearly half of the top-5 results from cosine similarity retrieval were irrelevant to the query intent — just nearby in embedding space.
For exploratory users, the problem compounds. They do not know enough about the space to formulate a precise query. Short, vague queries — "vegetable garden," "trip to Japan," "investing basics" — provide insufficient signal for the embedding model to distinguish intent. The model returns results that cluster around the dominant sense of those tokens, missing the tail of the distribution where discovery actually lives.
Detecting Browsing Intent at Query Time
Intent signals are noisy but detectable. Engineers building search systems can instrument for them:
Query characteristics:
- Short, high-entropy terms without specificity markers (brand names, product IDs, proper nouns) correlate with exploratory intent
- Comparative terms ("vs", "alternatives", "types of") are strong exploration signals
- Absence of qualifiers, negations, or constraints suggests the user is not yet committed to a particular subspace
Session behavior:
- Click entropy — how spread out clicks are across different URL domains — is a strong disambiguation signal. Lookup sessions concentrate clicks; exploratory sessions scatter them
- Multiple rapid query reformulations within a short time window indicate the user is navigating, not fetching
- Long dwell time on result pages, followed by back-navigation, indicates evaluation of options rather than extraction of a fact
Click through patterns over time:
- Users who consistently reformulate and click diverse results for a given query domain signal that the domain is inherently exploratory for that person
- Personalized intent modeling at session start can weight these signals dynamically
These signals can route users to different retrieval configurations in real time — tighter precision for lookup patterns, diversity-weighted results for exploratory patterns. The routing logic does not need to be perfect; even crude intent classification improves aggregate satisfaction.
Ranking Strategies That Serve Discovery
Once you accept that exploration requires different retrieval logic, several practical approaches work.
Maximum Marginal Relevance (MMR)
MMR reranks a candidate result set by iteratively selecting results that are both relevant to the query and maximally different from already-selected results. Each selection step scores candidates as: λ × relevance − (1 − λ) × max_similarity_to_selected. The λ parameter controls the relevance-diversity tradeoff.
For exploratory intent, λ in the range 0.3–0.5 works well. For lookup intent, push λ toward 0.8–0.9. The algorithm is straightforward to implement as a post-retrieval reranking step on any existing system — it operates on the candidate pool, not the retrieval mechanism itself.
Reciprocal Rank Fusion (RRF) in Hybrid Search
Hybrid search — running both keyword (BM25/sparse) and semantic (dense vector) retrieval in parallel, then merging — addresses a different layer of the problem. BM25 captures lexical precision: exact product names, domain-specific jargon, SKUs, and terminology that embeddings struggle with out-of-distribution. Dense retrieval captures semantic intent and paraphrasing.
RRF merges the two ranked lists without relying on raw scores: each document's fusion score is the sum of 1 / (rank + k) across lists it appears in, where k is a smoothing constant (typically 60). Documents that appear in both lists get a large bonus — this tends to surface results that are both lexically relevant and semantically appropriate, which skews toward diversity compared to pure semantic retrieval.
Benchmarks consistently show hybrid search outperforming either pure approach by 15–25% on mixed-intent query sets. For exploration-heavy product search, the improvement is larger.
Faceted Navigation as a Discovery Surface
Faceted interfaces offload the exploration problem from the ranking algorithm to the user. Rather than trying to predict what subset of results the user wants, facets make the dimensions of the result space visible and manipulable. Users who do not know what they want can navigate by constraint — price, category, attribute, rating — discovering the space incrementally.
The engineering tradeoff is real: faceted interfaces add UI complexity and require consistent structured metadata across your corpus. For product catalogs and document libraries, the investment usually pays off. For unstructured corpora (general web search, raw document retrieval), facets are harder to generate reliably.
Serendipity Metrics in Evaluation
Standard relevance metrics (NDCG, precision@k) cannot measure exploration quality because they do not penalize redundancy. A list of five nearly-identical documents scores the same as a diverse list covering five different aspects of the query.
α-nDCG (alpha-NDCG) addresses this directly: it's an extension of NDCG that tracks which subtopics of a query have been covered by each successive result, penalizing documents that repeat already-covered ground. Systems optimized for α-nDCG instead of standard NDCG produce measurably more diverse result sets.
Serendipity in recommender research is often formalized as unexpectedness × relevance: a result must be both relevant and non-obvious to count as serendipitous. Building explicit serendipity signals into your eval suite forces you to measure and optimize for what exploration users actually need.
Where the Architecture Ends Up
Production systems at scale — e-commerce platforms, large document retrieval, enterprise search — converge on a common pattern:
- Retrieve broadly: Run BM25 and dense retrieval in parallel over a large candidate pool
- Merge with RRF: Blend the two lists using rank fusion to promote documents that appear in both
- Detect intent signals: Classify query-level and session-level intent from observable signals
- Rerank with diversity weight: Apply MMR or a learned reranker that takes intent class as input — tighter relevance for lookup, MMR λ toward 0.4 for exploration
- Expose facets where the corpus supports it: For structured corpora, surface facets as a parallel navigation surface
The intent classification step is where most teams underinvest. Even simple heuristics — query length, presence of comparative terms, session reformulation count — improve the quality of the routing decision substantially. More sophisticated approaches train classifiers on click entropy labels derived from historical sessions.
The Gap That Pure Semantic Search Cannot Close
The deeper issue is that exploratory users are not just underserved by semantic search — they are actively misled by it. Embedding models return results with high confidence scores that look relevant but are all the same kind of relevant. A user searching for "python web frameworks" who gets five results about Flask, three about Django, and two about FastAPI has not actually seen the space. They have seen the center of the space, weighted by how frequently those topics appear in the training data.
Good discovery is not about finding the nearest neighbors. It is about giving users a faithful sample of the relevant landscape — including corners they would not have thought to query for directly. That requires the system to model the space, not just the query.
Vector search is a powerful tool for lookup. For discovery, it is a necessary but insufficient component. The systems that serve browsing users well are the ones that have instrumented intent, built diversity into their ranking objectives, and treated "help the user understand what exists" as a first-class product requirement alongside "find what the user is asking for."
Most search systems have only optimized for the second.
- https://arxiv.org/pdf/2110.04640
- https://bergum.medium.com/four-mistakes-when-introducing-embeddings-and-vector-search-d39478a568c5
- https://www.elastic.co/search-labs/blog/maximum-marginal-relevance-diversify-results
- https://qdrant.tech/blog/mmr-diversity-aware-reranking/
- https://www.elastic.co/what-is/hybrid-search
- https://opensearch.org/blog/building-effective-hybrid-search-in-opensearch-techniques-and-best-practices/
- https://www.uber.com/blog/evolution-and-scale-of-ubers-delivery-search-platform/
- https://www.mdpi.com/2079-9292/14/4/821
