Hybrid Search in Production: Why BM25 Still Wins on the Queries That Matter
BM25 was published in 1994. The math is simple enough to fit on a whiteboard. Yet in production retrieval benchmarks in 2025, it still outperforms multi-billion-parameter dense embedding models on a meaningful slice of real-world queries. Teams that discover this after deploying pure vector search tend to discover it the worst possible way: through hallucination complaints they can't reproduce in evaluation, because their eval set was built from queries that already worked.
This is the retrieval equivalent of sampling bias. Dense retrieval fails on a specific and predictable query shape. The failure is silent — the LLM still produces fluent, confident-sounding answers from whatever fragments it retrieved. No error log fires. No latency spike. Just quietly wrong answers for users querying product SKUs, error codes, API names, or anything that is lexically specific rather than semantically general.
The fix is hybrid search. But "hybrid search" is underspecified as an engineering decision. This post covers what the failure modes actually look like, how to fuse retrieval signals correctly, where the reranking layer goes, and — most critically — how to find the query types your current pipeline is silently failing on before users find them for you.
