Skip to main content

50 posts tagged with "retrieval"

View all tags

RAG Against a Phantom Inventory: When Your Corpus Describes Features Your Product Removed

· 11 min read
Tian Pan
Software Engineer

A customer asks your support agent how to do something. The agent retrieves three documentation chunks with high relevance scores, synthesizes a confident answer, and walks the customer through a five-step procedure that ends on a button that hasn't existed for four months. The customer files a ticket. The on-call engineer pulls the eval suite, finds it green, pulls the retrieval traces, finds them green too — the model didn't hallucinate, it faithfully quoted documentation describing a feature your product team renamed in the last quarterly release.

This is the failure mode I want to name: not a hallucination, not a retrieval miss, but a phantom inventory problem. Your retrieval corpus is a snapshot of a product surface that no longer exists. The vector store doesn't know the product changed. The eval suite doesn't know either. The only system that consistently catches it is the support ticket queue, and by the time a ticket is filed the customer has already been told to click a button that isn't there.

The Freshness-Relevance Tradeoff in RAG: Why You Can't Optimize Both at Query Time

· 11 min read
Tian Pan
Software Engineer

A user asks your assistant what the company's parental leave policy is. The bot returns 12 weeks, with a citation. The cited document was the right answer in 2023; HR posted an update last quarter that took it to 16. Both versions are in your knowledge base. Cosine similarity scored the 2023 version 0.87 and the 2024 version 0.84, because the older page has the cleaner phrasing and fewer hedges. The fresher document loses by three percentage points and the user gets a wrong answer that looks audited.

This is the freshness-relevance tradeoff, and the uncomfortable part is that it has no clean solution at query time. If you weight recency, you bias retrieval toward whatever was edited yesterday — which in most knowledge bases is the noisy, high-churn surface area that should not be the source of truth. If you don't weight recency, you ship answers grounded in documents that were superseded months ago. There is no single global knob that gets both right, and most teams discover this only after a few embarrassing answers leak past their eval suite.

Retrieval Cascade Failure: How Document Deletion Poisons Your RAG Pipeline

· 9 min read
Tian Pan
Software Engineer

A user asks your support bot when the refund window closes. The bot answers "60 days" with cheerful confidence and a citation. The policy page that says "60 days" was deleted from the CMS three months ago. The new policy is 14. Nobody on your team knows the bot is wrong until a customer escalates.

This is a retrieval cascade failure: the document is gone from the source of truth, but its embedding is still in the index, still ranking high on cosine similarity, still feeding the model a ghost. RAG pipelines treat embedding indexes as caches of source content, but most teams build the cache without building the invalidation. Inserts get all the engineering attention. Deletes get a TODO comment.

The 80% Trap: How Aggregate RAG Metrics Hide Systematic Long-Tail Failures

· 9 min read
Tian Pan
Software Engineer

Your RAG pipeline hit 80% retrieval accuracy on the eval set. The team ships it. Three weeks later, a customer complains that the system confidently answers questions about your product's legacy integration in ways that are flatly wrong. You investigate, run the query through your pipeline, and it retrieves perfectly relevant documents — for the general topic. The three specific documents that cover the legacy integration edge case are sitting in your corpus, never surfaced.

That 80% number was real. It was also nearly useless as a signal for what just happened.

The Data Contract Problem in RAG: When Your Ingestion Pipeline Silently Breaks Retrieval Quality

· 10 min read
Tian Pan
Software Engineer

Your RAG system has a bug that doesn't throw exceptions. It doesn't spike your error rate. It doesn't show up in your latency dashboards. Instead, it quietly delivers confident, plausible-sounding answers that are wrong — and nobody notices for weeks.

This is the data contract problem in RAG: your ingestion pipeline is the source of truth for everything downstream, but it has no schema enforcement, no freshness guarantees, and no alerting when the shape of the world changes underneath it. Every time an upstream data source adds a field, a chunking parameter shifts, or an embedding model gets updated, your retrieval quality silently degrades.

Eighty percent of enterprise RAG projects experience critical failures in production. The most insidious of those failures don't announce themselves.

The Knowledge Half-Life Problem: Why Your RAG System Is Already Wrong

· 9 min read
Tian Pan
Software Engineer

Your RAG system passed all the retrieval benchmarks. Precision looks solid. The LLM-as-judge eval scores are green. And yet, somewhere in your index, there is a document describing an API endpoint that was deprecated eight months ago, a pricing tier that no longer exists, and a compliance policy that was superseded by new regulations in Q3. Your retriever has no idea. Semantic similarity has no concept of time.

This is the knowledge half-life problem: the silent failure mode where RAG systems appear healthy on every metric you're measuring while serving increasingly stale decisions to users. Seventy-three percent of organizations report accuracy degradation in RAG deployments within 90 days — not from poor retrieval architecture or embedding model quality, but from knowledge staleness that no one modeled as a reliability concern.

The Embedding Fine-Tuning Gap: Generic Vectors Don't Know What Relevant Means in Your Domain

· 11 min read
Tian Pan
Software Engineer

Your RAG pipeline looks solid on paper: chunking is clean, the vector store is indexed, latency is acceptable. But users keep complaining that the results are wrong — not completely wrong, just slightly wrong in ways that matter. The retrieved passage discusses the right concept but from the wrong time period. It covers the right topic but from the wrong jurisdiction. It mentions the right product but is missing the inventory signal that would make it actually useful.

This is the embedding fine-tuning gap. Generic embedding models are trained to encode semantic similarity — the property of two texts meaning roughly the same thing. That's not the same as relevance. Relevance is domain-specific, context-sensitive, and often invisible to a model trained on web-scale generic corpora.

When RAG Makes Your AI Worse: The Creativity-Grounding Tradeoff

· 8 min read
Tian Pan
Software Engineer

A team at a product company built a brainstorming assistant for their marketing department. They added RAG over their document corpus — campaign briefs, brand guidelines, competitor analyses — figuring the richer context would produce better ideas. Usage dropped within three weeks. The qualitative feedback: outputs felt "too safe," "too predictable," "like it just remixed our existing stuff." They removed retrieval from the brainstorming feature. Ideas improved. Engagement recovered.

This pattern repeats more often than practitioners admit. Retrieval-augmented generation has become the default architecture for grounding LLM outputs in facts, and for factual tasks it earns that default. But for generative tasks — ideation, creative writing, novel solution generation — adding a retrieval layer can silently cap the ceiling of what your model produces. Not because retrieval is broken, but because it's working exactly as designed.

Tool Discovery at Scale: Why Embedding-Only Retrieval Fails Past 20 Tools

· 10 min read
Tian Pan
Software Engineer

Most teams building AI agents discover the same problem on their fifth sprint: the agent can't reliably pick the right tool anymore. At ten tools, it mostly works. At twenty, accuracy starts to slip. At fifty, you're watching the agent call search_documents when it should call update_record, and the logs offer no explanation. The usual reaction is to tweak the tool descriptions — add more context, be more explicit, rewrite the examples. This occasionally helps. But it misses the root cause: flat embedding retrieval is architecturally wrong for large tool inventories, and better descriptions cannot fix an architectural problem.

Tool selection is retrieval, and retrieval has known scaling limits. Understanding those limits — and the structured metadata patterns that work around them — is what separates agent systems that hold up in production from ones that require constant babysitting.

Knowledge Age Routing: Matching Queries to the Right Temporal Layer in Production AI

· 9 min read
Tian Pan
Software Engineer

Here is a scenario that surfaces in production more often than anyone likes to admit. A user asks your AI assistant what the current interest rate policy is. Your RAG system fetches a highly relevant Federal Reserve document—semantically it scores 0.91 similarity—and the model confidently returns an answer. The answer is six months out of date. The RAG index was last refreshed in October. The parametric knowledge is older still. A live API call would have returned the correct current figure in 400 milliseconds, but nobody wired up the routing logic to ask: how old is this question's answer allowed to be?

That failure is not a retrieval failure. It is a temporal routing failure. The system had access to correct information somewhere in its stack. It just sent the query to the wrong layer.

The 'What Changed' Query Is the RAG Question Your Index Can't Answer

· 10 min read
Tian Pan
Software Engineer

A user asks your assistant, "what changed about our refund policy this quarter?" The system returns a confident, well-formatted summary of the current refund policy. The user nods, closes the chat, and acts on information that has nothing to do with the question they asked. Nothing in your eval suite caught this. Nothing in your faithfulness metric flagged it. The retrieval looked perfect — it returned highly-relevant chunks. The synthesis looked perfect — it cited every chunk it used. The only problem is that the question was about change, and your index has no concept of change.

This is the failure mode that vector-similarity retrieval cannot fix by tuning. Two versions of the same document have nearly-identical embeddings — that is what good embeddings do, they collapse semantically equivalent text into the same neighborhood. So when you ask "what changed," the retriever returns one of the versions, the LLM summarizes that version, and the answer is silently a hallucination of nothing-changed. The user cannot tell. Your eval set probably cannot tell either, because your eval set is built around "what is X" questions, not "what's different about X now."

Your Embedding Model Choice Sets the Ceiling Your LLM Can't Raise

· 11 min read
Tian Pan
Software Engineer

A team I was advising had spent two months swapping LLMs in their RAG pipeline. Claude, GPT, Gemini, then back again. Each swap shaved a few percentage points off hallucination rate but never moved the needle on the metric that mattered: their support agents still couldn't find the right knowledge base article more than 60% of the time. They were tuning the wrong layer. The retriever was returning irrelevant chunks, and no amount of LLM cleverness can answer a question from documents the retriever never surfaced.

The embedding model is the part of a RAG system that decides what the LLM is even allowed to see. It draws the geometry of your corpus — which documents land near which queries in vector space. Once that geometry is wrong, the LLM is just a confident narrator of bad context. Swapping it for a smarter one usually makes the answers more articulate, not more correct.