6 posts tagged with "reranking"

Reranking Is the Real Work: Why Your Retrieval System's Bottleneck Is Never the Index

May 5, 2026 · 10 min read

Software Engineer

Teams building RAG systems almost universally hit the same wall: they spend a week tuning their HNSW index parameters, add product quantization, push recall@100 from 0.81 to 0.87 — and then watch LLM output quality barely budge. The assumption baked into months of effort is that a better index equals better answers. It doesn't. The bottleneck was never the index.

The actual chokepoint is the ranking step between your candidate set and your context window. What you put into the LLM determines what comes out, and the job of ranking is to ensure that the most genuinely relevant documents, not just the most semantically similar ones, make it through. That distinction matters more than any HNSW configuration you'll ever tune.

RAG Position Bias: Why Chunk Order Changes Your Answers

April 20, 2026 · 8 min read

Tian Pan

Software Engineer

You've spent weeks tuning your embedding model. Your retrieval precision looks solid. Chunk size, overlap, metadata filters — all dialed in. And yet users keep reporting that the system "ignores" information it clearly has access to. The relevant passage is in the top-5 retrieved results every time. The model just doesn't seem to use it.

The culprit is often position bias: a systematic tendency for language models to over-rely on information at the beginning and end of their context window, while dramatically under-attending to content in the middle. In controlled experiments, moving a relevant passage from position 1 to position 10 in a 20-document context produces accuracy drops of 30–40 percentage points. Your retriever found the right content. The ordering killed it.

The Reranker Gap: Why Most RAG Pipelines Skip the Most Important Layer

April 20, 2026 · 8 min read

Tian Pan

Software Engineer

Most RAG pipelines have an invisible accuracy ceiling, and the engineers who built them don't know it's there. You tune your chunking strategy, upgrade your embedding model, swap vector databases — and the system still returns plausible but subtly wrong documents for a stubborn class of queries. The retrieval looks reasonable. The LLM sounds confident. But downstream accuracy has quietly plateaued at a level that no amount of prompt engineering will break through.

The gap almost always traces to the same missing piece: a reranker. Specifically, the absence of a cross-encoder in a second retrieval stage. It's the layer that's technically optional, practically expensive to skip, and systematically omitted from the canonical "embed, index, query" tutorials that most RAG pipelines are built from.

Cross-Encoder Reranking in Practice: What Cosine Similarity Misses

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Your RAG pipeline retrieves the top 10 documents and your LLM still gives a wrong answer. You increase the retrieval count to 50. Still wrong. The frustrating part: the correct document was in your vector store the whole time—it was just ranked 23rd. This is not a recall problem. It's a ranking problem, and cosine similarity is the culprit.

Vector search does a decent job of finding semantically adjacent content. But "semantically adjacent" and "most useful for this specific query" are not the same thing. Cosine similarity measures the angle between two vectors in embedding space, and that angle only captures a coarse notion of topical proximity. What it cannot capture is the fine-grained interaction between the specific words in your query and the specific words in a document—the difference between "how to prevent buffer overflows" and "buffer overflow exploit techniques" is subtle at the vector level but critical for your retrieval system.

When Embeddings Aren't Enough: A Decision Framework for Hybrid Retrieval Architecture

April 17, 2026 · 11 min read

Tian Pan

Software Engineer

Most RAG implementations start the same way: spin up a vector database, embed documents with a decent model, run cosine similarity at query time, and ship it. The demo looks great. Relevance feels surprisingly good. Then you deploy it to production and discover that "Error 221" retrieves documents about "Error 222," that searching for a specific product SKU surfaces semantically similar but wrong items, and that adding a date filter causes retrieval quality to crater.

Vector search is a genuinely powerful tool. It's also not sufficient on its own for most production retrieval workloads. The teams winning with RAG in 2025 aren't choosing between dense embeddings and keyword search — they're using both, deliberately.

This is a decision framework for when hybrid retrieval is worth the added complexity, and how to build each layer without destroying your latency budget.

The Production Retrieval Stack: Why Pure Vector Search Fails and What to Do Instead

April 9, 2026 · 12 min read

Tian Pan

Software Engineer

Most RAG systems are deployed with a vector database, a few thousand embeddings, and the assumption that semantic similarity is close enough to correctness. It is not. That gap between "semantically similar" and "actually correct" is why 73% of RAG systems fail in production, and almost all of those failures happen at the retrieval stage — before the LLM ever generates a word.

The standard playbook of "embed your documents, query with cosine similarity, pass top-k to the LLM" works in demos because demo queries are designed to work. Production queries are not. Users search for product IDs, invoice numbers, regulation codes, competitor names spelled wrong, and multi-constraint questions that a single embedding vector cannot geometrically satisfy. Dense vector search is not wrong — it is incomplete. Building a retrieval stack that actually works in production requires understanding why, and layering in the components that compensate.

About Tian Pan