Skip to main content

Beyond RAG: Hybrid Search, Agentic Retrieval, and the Database Design Decisions That Actually Matter

· 8 min read
Tian Pan
Software Engineer

Most teams ship RAG and call it a retrieval strategy. They chunk documents, embed them, store the vectors, and run nearest-neighbor search at query time. It works well enough in demos. In production, users start reporting that the system can't find an article they know exists, misses error codes verbatim in the docs, or returns semantically similar but factually wrong passages.

The problem isn't RAG. The problem is treating retrieval as a one-dimensional problem when it's always been multi-dimensional.

Pure vector search is great at finding conceptually similar content. It's bad at finding exact matches — product SKUs, error codes, version numbers, names of people or files. BM25-based full-text search is the opposite: precise on exact terms, blind to synonyms and paraphrase. Neither alone is sufficient for a production system that real users depend on.

This post covers the retrieval architecture decisions that actually move the needle: when to combine search modalities, how agentic workloads are reshaping retrieval system design, and what this means for the databases you choose and build on top of.

Why Pure Vector Search Fails at the Edges

The intuition behind dense retrieval is compelling. Embed your documents and queries into the same high-dimensional space, and semantically similar things end up geometrically close. This works remarkably well for concept-level matching.

But real workloads are messier. Consider:

  • A user searching for "TypeError: Cannot read properties of undefined (reading 'map')" — the exact error string
  • A code assistant looking for a function called processUserPayload in a 500-file codebase
  • A legal research tool querying for a specific statute number

Embedding-based retrieval will approximate these. It won't nail them. Recall studies consistently show that BM25 achieves around 72% recall on keyword-dominated queries, while hybrid search bumps that to 91% — a 25-point gain that matters enormously in high-stakes domains.

The failure mode is subtle enough that it often doesn't surface in your eval set, which tends to be built from cases where retrieval worked. The cases you're missing are systematically excluded from the benchmark.

Hybrid Search: Combining BM25 and Dense Vectors

Hybrid search isn't a new idea — it's just finally getting proper tooling. The core pattern:

  1. Run the query through a sparse retrieval system (BM25 or SPLADE) to get keyword-matched results
  2. Run the same query through your embedding model to get semantically similar results
  3. Fuse the ranked lists using Reciprocal Rank Fusion (RRF) or a learned ranker
  4. Optionally re-rank the merged list with a cross-encoder

RRF is the practical default. It's simple, robust, and doesn't require training data:

def reciprocal_rank_fusion(ranked_lists: list[list[str]], k: int = 60) -> list[str]:
scores: dict[str, float] = {}
for ranked in ranked_lists:
for rank, doc_id in enumerate(ranked):
scores[doc_id] = scores.get(doc_id, 0) + 1.0 / (k + rank + 1)
return sorted(scores, key=scores.get, reverse=True)

The choice between BM25 and SPLADE for the sparse leg matters. BM25 is fast and interpretable. SPLADE learns sparse representations that can handle vocabulary mismatch — users asking "how do I cancel my subscription" when docs say "terminate your plan." For enterprise knowledge bases where document and query vocabulary diverge, SPLADE wins. For code search or structured data where exact terms matter, BM25 is usually sufficient.

The practical decision tree:

  • Exact-match dominated queries (error codes, SKUs, identifiers): weight BM25 higher
  • Concept-level queries (how-to, explanations): weight dense retrieval higher
  • Most production systems: hybrid with roughly equal weighting and RRF fusion

How Agentic Workloads Are Breaking Retrieval Infrastructure

Traditional RAG does retrieval once per user turn: user sends a message, system embeds it, fetches top-K chunks, injects into context, LLM responds. One retrieval call per interaction.

Agentic systems shatter this assumption. A coding agent working through a large codebase might fire 50–100 parallel retrieval calls in a single "turn" — semantic search for relevant functions, grep-style pattern matching for specific syntax, file-path lookups, dependency graph traversal. Retrieval becomes a high-concurrency tool-call workload rather than a single-threaded pipeline.

This changes the infrastructure requirements dramatically:

Latency profile shifts: You used to care about P50 latency per query. Now you care about the P99 of 100 parallel queries, because your agent waits for all of them. A system with 40ms P50 and 500ms P99 is much worse for agents than one with 60ms P50 and 200ms P99.

Query diversity increases: Agents issue semantically diverse queries in parallel — they're not running the same query N times. A hybrid system that can handle semantic search, full-text search, and metadata filtering in the same request becomes essential, not optional.

Pricing models matter: Per-query pricing that made sense for human-paced interaction becomes unworkable when a single agent turn fires 100 queries. This is pushing vector database vendors toward storage-plus-writes pricing with cheap reads, which better matches how agents actually consume the system.

Azure AI Search's agentic retrieval reports up to 40% latency reduction by running subqueries in parallel. Cursor's coding assistant found a 25% improvement on specific workloads by combining semantic search with grep-based retrieval — and invested in a custom embedding model to close the domain gap further.

The Database Design Decisions That Follow

Once you understand the workload — hybrid queries, high concurrency, variable query types — the database architecture decisions become clearer.

Object storage as the primary store is now viable. S3 achieving strong consistency in December 2020 was a quiet inflection point. You can now build systems where the canonical data lives in object storage and compute nodes are stateless, with NVMe SSDs used only as a cache layer. This wasn't safely possible before. The implication: your vector store can scale storage and compute independently, which matters when you have 100 billion vectors but relatively bursty query load.

You need a retrieval system that handles mixed query types natively. A system that does vector search but punts on full-text search forces you to maintain two separate indexes and merge results yourself. This works but adds operational complexity. A system that handles both in a unified query interface — and can apply metadata filters alongside both retrieval types — dramatically simplifies the application layer.

Re-ranking is not optional for quality-sensitive applications. Approximate nearest-neighbor search trades some precision for speed. For most use cases, this is the right tradeoff. For applications where retrieval quality directly impacts user trust — legal, medical, financial — a cross-encoder re-ranker on the top-20 results (not the top-1000) is worth the added latency. Cross-encoders see both query and document jointly rather than independently, catching relevance that bi-encoders miss.

A Practical Architecture for 2025

Here's what a production-grade retrieval stack looks like for a document-heavy application:

User query

├── Sparse retrieval (BM25/SPLADE)
│ └── Document index (inverted)

└── Dense retrieval (ANN search)
└── Vector index (HNSW or IVF)

└── RRF fusion

└── Metadata filter (date, source, access control)

└── Cross-encoder re-rank (top 20 → top 5)

LLM context injection

For agentic applications, wrap the retrieval layer in a tool interface that the agent can call with different parameters:

@tool
def retrieve(
query: str,
mode: Literal["semantic", "keyword", "hybrid"] = "hybrid",
filters: dict = {},
top_k: int = 10,
) -> list[Document]:
"""Retrieve relevant documents. Use 'keyword' for exact matches,
'semantic' for concept-level search, 'hybrid' for unknown cases."""
...

Giving the agent explicit control over retrieval mode lets it make better decisions than defaulting to one-size-fits-all hybrid search. A coding agent searching for a function signature should specify mode="keyword". An agent answering a conceptual question should use mode="semantic". The agent that can reason about retrieval strategy outperforms the one that can't.

What This Means for Your Stack

If you're building a new retrieval system today:

  • Don't default to pure vector search. Budget for hybrid from the start. Retrofitting it is painful.
  • Measure retrieval quality separately from end-to-end task quality. Retrieval recall and precision are independent metrics. You need to know which layer is failing.
  • Design for agent-level concurrency. If you're building anything agents will call, your retrieval layer needs to handle high concurrent query load with consistent P99 latency.
  • Treat embedding model choice as a first-class decision. Off-the-shelf models underperform domain-specific ones for specialized corpora. Cursor's custom embedding model produced measurable gains. The cost of fine-tuning an embedding model is often less than teams expect.
  • Re-rank late, filter early. Apply metadata filters before re-ranking to reduce the candidate set. Apply re-ranking on the small set that passes filters. Getting the order wrong is expensive.

The gap between teams that ship retrieval as an afterthought and teams that treat it as a core engineering problem is widening. As agents issue more and more retrieval calls, retrieval quality compounds — small gains per call become large gains across a workflow. It's worth getting right.


The key architectural insight is that retrieval has never been a single problem. Keyword search and semantic search solve different subproblems. Agentic workloads just made the stakes high enough that the gap between "good enough" and "designed correctly" became impossible to ignore.

Let's stay in touch and Follow me for more thoughts and updates