Skip to main content

44 posts tagged with "retrieval"

View all tags

When Embeddings Aren't Enough: A Decision Framework for Hybrid Retrieval Architecture

· 11 min read
Tian Pan
Software Engineer

Most RAG implementations start the same way: spin up a vector database, embed documents with a decent model, run cosine similarity at query time, and ship it. The demo looks great. Relevance feels surprisingly good. Then you deploy it to production and discover that "Error 221" retrieves documents about "Error 222," that searching for a specific product SKU surfaces semantically similar but wrong items, and that adding a date filter causes retrieval quality to crater.

Vector search is a genuinely powerful tool. It's also not sufficient on its own for most production retrieval workloads. The teams winning with RAG in 2025 aren't choosing between dense embeddings and keyword search — they're using both, deliberately.

This is a decision framework for when hybrid retrieval is worth the added complexity, and how to build each layer without destroying your latency budget.

The Knowledge Contamination Problem: When Your RAG System Ignores Its Own Retrieval

· 8 min read
Tian Pan
Software Engineer

A team ships a RAG pipeline for internal documentation. Retrieval looks solid — the right passages come back. But in production, users keep getting stale answers. They dig into the logs and find the model is returning facts from its training data, not from the documents it was handed. The retrieval worked. The model just didn't use it.

This is the knowledge contamination problem: the model's parametric memory — the knowledge baked into its weights during training — overrides the retrieved context. It's quiet, it's confident, and it's one of the most common failure modes in production RAG systems.

Poisoned at the Source: RAG Corpus Decay and Data Governance for Vector Stores

· 11 min read
Tian Pan
Software Engineer

Your RAG system was working fine at launch. Three months later it's confidently wrong about a third of user queries — and your traces show nothing broken. The retriever is fetching documents. The model is generating responses. The pipeline looks healthy. The problem is invisible: every vector in your store still has a similarity score, but half of them are pointing to facts that no longer exist.

This is corpus decay. It doesn't throw errors. It doesn't trigger alerts. It accumulates quietly in the background, and by the time you notice it through user complaints or quality degradation, your vector store has become a liability.

The RAG Eval Antipattern That Hides Retriever Bugs

· 10 min read
Tian Pan
Software Engineer

There's a failure mode common in RAG systems that goes undetected for months: your retriever is returning the wrong documents, but your generator is good enough at improvising that end-to-end quality scores stay green. You keep tuning the prompt. You upgrade the model. Nothing helps. The bug is three layers upstream and your metrics are invisible to it.

This is the retriever eval antipattern — evaluating your entire RAG pipeline as a single unit, which lets the generator absorb and hide retrieval failures. The result is a system where you cannot distinguish between "the generator failed" and "the retriever failed," making systematic improvement nearly impossible.

The Discovery Problem: Why Semantic Search Fails Browsing Users

· 9 min read
Tian Pan
Software Engineer

Vector search is eating the world. Embedding-based retrieval now powers product search at every major e-commerce platform, drives the retrieval layer of RAG systems, and sits at the core of most AI-powered search rewrites. But there is a category of user that these systems fail silently and consistently: the browsing user. Not because the embeddings are bad. Because they were built to solve a different problem.

The fundamental assumption behind semantic search is that users arrive with a query that approximates what they want. Optimize for proximity in embedding space to that query, and you win. But a significant fraction of real users arrive with something closer to curiosity than a query — and for them, the nearest neighbors in vector space are exactly the wrong answer.

Semantic Search as a Product: What Changes When Retrieval Understands Intent

· 11 min read
Tian Pan
Software Engineer

Most teams building semantic search start from a RAG proof-of-concept: chunk documents, embed them, store vectors, query with cosine similarity. It works well enough in demos. Then they ship it to users, and half the queries fail in ways that have nothing to do with retrieval quality.

The reason is that RAG and user-facing semantic search are solving different problems. RAG asks "given a question, retrieve context for an LLM to answer it." Semantic search asks "given a user's query, surface results that match what they actually want." The second problem has a layer of complexity that RAG benchmarks systematically ignore — and that complexity lives almost entirely before retrieval begins.

Knowledge Graphs as a RAG Alternative: When Structured Retrieval Beats Embeddings

· 9 min read
Tian Pan
Software Engineer

Most RAG implementations fail in exactly the same way: the vector search retrieves something plausible but not what the user actually needed, the LLM wraps it in confident prose, and the user gets an answer that's approximately right but specifically wrong. The frustrating part is that the failure mode is invisible — cosine similarity scores look fine, the retrieved passages mention the right topics, but the answer is still wrong because the question required reasoning across relationships, not just semantic proximity.

Vector embeddings are excellent at one thing: finding text that sounds like your query. That's a powerful capability, and it covers an enormous range of production use cases. But it breaks predictably when the question depends on how entities connect to each other rather than how closely their descriptions match. For those queries, a knowledge graph — a property graph you traverse with Cypher or SPARQL — is not an optimization. It's a fundamentally different kind of retrieval that solves a different class of problem.

The Query Rewrite Layer Your RAG System Is Missing

· 10 min read
Tian Pan
Software Engineer

Most teams tuning a RAG system focus on two levers: chunking strategy and embedding model selection. When retrieval quality degrades, they re-chunk. When recall numbers look bad, they upgrade the embedding model. Both are reasonable moves — but they're optimizing the middle of the pipeline while leaving the highest-leverage point untouched.

The user's query is almost never in the ideal form for vector retrieval. It's terse, colloquial, ambiguous, or assumes context that the index doesn't have. No matter how good your embeddings are, if you're searching with a poorly formed query, you're going to retrieve poorly. The fix isn't downstream — it's transforming the query before it reaches the vector index.

The Retrieval Emptiness Problem: Why Your RAG Refuses to Say 'I Don't Know'

· 10 min read
Tian Pan
Software Engineer

Ask a production RAG system a question your corpus cannot answer and watch what happens. It rarely says "I don't have that information." Instead, it retrieves the five highest-ranked chunks — which, having nothing better to match, are the five least-bad chunks of unrelated content — and hands them to the model with a prompt that reads something like "answer the user's question using the context below." The model, trained to be helpful and now holding text that sort of resembles the topic, produces a confident answer. The answer is wrong in a way that's architecturally invisible: the retrieval succeeded, the generation succeeded, every span was grounded in a retrieved document, and the user walked away misled.

This is the retrieval emptiness problem. It isn't a bug in any single layer. It's the emergent behavior of a pipeline that treats "top-k" as a contract and never asks whether the top-k is any good. Research published at ICLR 2025 on "sufficient context" quantified the effect: when Gemma receives sufficient context, its hallucination rate on factual QA is around 10%. When it receives insufficient context — retrieved documents that don't actually contain the answer — that rate jumps to 66%. Adding retrieved documents to an under-specified query makes the model more confidently wrong, not less.

Stale Retrieval: The Data Quality Problem Your RAG Pipeline Is Hiding

· 10 min read
Tian Pan
Software Engineer

Your RAG system is lying to you about the past. When a user asks about current pricing, active security policies, or a feature that shipped last quarter, the retrieval pipeline returns the most semantically similar document in the index — not the most recent one. An 18-month-old pricing page and this morning's update look identical to cosine similarity. Nothing in the standard RAG stack has any concept of whether the retrieved document is still true.

This is stale retrieval, and it fails differently than hallucination. The model isn't inventing anything. It accurately summarizes real content that once existed. Standard evaluation metrics — faithfulness, groundedness, context precision — all pass. The system is confidently correct about a fact that stopped being correct months ago.

Knowledge Graphs Are Back: Why RAG Teams Are Adding Structure to Their Retrieval

· 8 min read
Tian Pan
Software Engineer

Your RAG pipeline answers single-fact questions beautifully. Ask it "What is our refund policy?" and it nails it every time. But ask "Which customers on the enterprise plan filed support tickets about the billing API within 30 days of their contract renewal?" and it falls apart. The answer exists in your data — scattered across three different document types, connected by relationships that cosine similarity cannot see.

This is the multi-hop reasoning problem, and it's the reason a growing number of production RAG teams are grafting knowledge graphs onto their vector retrieval pipelines. Not because graphs are trendy again, but because they've hit a concrete accuracy ceiling that no amount of chunk-size tuning or reranking can fix.

Dynamic Few-Shot Retrieval: Why Your Static Examples Are Costing You Accuracy

· 11 min read
Tian Pan
Software Engineer

When a team hardcodes three example input-output pairs at the top of a system prompt, it feels like a reasonable engineering decision. The examples are hand-verified, formatting is consistent, and the model behavior predictably improves. Six months later, the same three examples are still there — covering 30% of incoming queries well, covering the rest indifferently, and nobody has run the numbers to find out which is which.

Static few-shot prompting is the most underexamined performance sink in production LLM systems. The alternative — selecting examples per request based on semantic similarity to the actual query — consistently outperforms fixed examples by double-digit quality margins across diverse task types. But the transition is neither free nor risk-free, and the failure modes on the dynamic side are less obvious than on the static side.

This post covers what the research actually shows, how the retrieval stack works in production, the ordering and poisoning risks that most practitioners miss, and the specific cases where static examples should win.