Skip to main content

5 posts tagged with "search"

View all tags

The Query Rewriting Layer Your RAG Pipeline Skipped

· 10 min read
Tian Pan
Software Engineer

When a RAG system answers wrong, the first instinct on most teams is to blame the encoder. Swap to a bigger embedding model. Try a domain-tuned one. Bump the dimension count. Three sprints later the recall curve has nudged a few points and the user complaints look the same.

The diagnosis was wrong. Most retrieval failures aren't embedding failures. They're query-shape failures — and no amount of vector tuning fixes a vocabulary mismatch that exists before the encoder ever runs.

A user types "how do I cancel." The relevant document is titled "Subscription Lifecycle Management" and uses words like "termination," "billing cycle close," and "service deactivation." There is no encoder in the world that pulls those two strings into the same neighborhood by lexical luck. The cosine similarity gap is real, and it lives in the input, not the model. The query rewriting layer that goes ahead of retrieval is the thing most pipelines skip and then spend a quarter trying to compensate for downstream.

Live Web Grounding in Production: Why Calling a Search API Is Only the Beginning

· 10 min read
Tian Pan
Software Engineer

Most engineers discover the limits of live web grounding the same way: they wire up a search API in an afternoon, ship it to production, and spend the next three weeks explaining why the latency is six seconds, the answers are wrong about recent events, and users are occasionally getting directed to fake phone numbers.

The underlying assumption — that search-augmented LLMs are just "regular RAG but with fresh data" — is the source of most of the pain. Live web grounding shares almost nothing with static retrieval beyond the word "retrieval." It is a distributed systems problem wearing an NLP hat.

The Discovery Problem: Why Semantic Search Fails Browsing Users

· 9 min read
Tian Pan
Software Engineer

Vector search is eating the world. Embedding-based retrieval now powers product search at every major e-commerce platform, drives the retrieval layer of RAG systems, and sits at the core of most AI-powered search rewrites. But there is a category of user that these systems fail silently and consistently: the browsing user. Not because the embeddings are bad. Because they were built to solve a different problem.

The fundamental assumption behind semantic search is that users arrive with a query that approximates what they want. Optimize for proximity in embedding space to that query, and you win. But a significant fraction of real users arrive with something closer to curiosity than a query — and for them, the nearest neighbors in vector space are exactly the wrong answer.

Semantic Search as a Product: What Changes When Retrieval Understands Intent

· 11 min read
Tian Pan
Software Engineer

Most teams building semantic search start from a RAG proof-of-concept: chunk documents, embed them, store vectors, query with cosine similarity. It works well enough in demos. Then they ship it to users, and half the queries fail in ways that have nothing to do with retrieval quality.

The reason is that RAG and user-facing semantic search are solving different problems. RAG asks "given a question, retrieve context for an LLM to answer it." Semantic search asks "given a user's query, surface results that match what they actually want." The second problem has a layer of complexity that RAG benchmarks systematically ignore — and that complexity lives almost entirely before retrieval begins.

Hybrid Search in Production: Why BM25 Still Wins on the Queries That Matter

· 11 min read
Tian Pan
Software Engineer

BM25 was published in 1994. The math is simple enough to fit on a whiteboard. Yet in production retrieval benchmarks in 2025, it still outperforms multi-billion-parameter dense embedding models on a meaningful slice of real-world queries. Teams that discover this after deploying pure vector search tend to discover it the worst possible way: through hallucination complaints they can't reproduce in evaluation, because their eval set was built from queries that already worked.

This is the retrieval equivalent of sampling bias. Dense retrieval fails on a specific and predictable query shape. The failure is silent — the LLM still produces fluent, confident-sounding answers from whatever fragments it retrieved. No error log fires. No latency spike. Just quietly wrong answers for users querying product SKUs, error codes, API names, or anything that is lexically specific rather than semantically general.

The fix is hybrid search. But "hybrid search" is underspecified as an engineering decision. This post covers what the failure modes actually look like, how to fuse retrieval signals correctly, where the reranking layer goes, and — most critically — how to find the query types your current pipeline is silently failing on before users find them for you.