141 posts tagged with "rag"

RAG Against a Phantom Inventory: When Your Corpus Describes Features Your Product Removed

May 14, 2026 · 11 min read

Software Engineer

A customer asks your support agent how to do something. The agent retrieves three documentation chunks with high relevance scores, synthesizes a confident answer, and walks the customer through a five-step procedure that ends on a button that hasn't existed for four months. The customer files a ticket. The on-call engineer pulls the eval suite, finds it green, pulls the retrieval traces, finds them green too — the model didn't hallucinate, it faithfully quoted documentation describing a feature your product team renamed in the last quarterly release.

This is the failure mode I want to name: not a hallucination, not a retrieval miss, but a phantom inventory problem. Your retrieval corpus is a snapshot of a product surface that no longer exists. The vector store doesn't know the product changed. The eval suite doesn't know either. The only system that consistently catches it is the support ticket queue, and by the time a ticket is filed the customer has already been told to click a button that isn't there.

The Retrieval Citation Tax: Why Compliance Adds 30% to Your RAG Token Bill

May 14, 2026 · 10 min read

Tian Pan

Software Engineer

A team I talked to recently sold their legal-AI product into a Fortune 500 in-house counsel office and added one line to their system prompt: "every factual claim must include an inline citation to the retrieved source." The product roadmap allocated a 5% buffer on their token budget for the new behavior. Sixty days after the regulated tenant went live, finance flagged a 34% jump in monthly inference spend. Nobody had broken the product. Nobody had shipped new features. The compliance requirement that closed the deal also quietly rewrote the unit economics underneath it.

This is the retrieval citation tax, and almost every RAG system serving a regulated industry — legal, healthcare, finance, audit-bound enterprise — eventually pays it. The tax is structural, not a bug. It comes from the way citation discipline forces the model into a different generation regime, and it shows up nowhere on the procurement spec the customer signed.

The Embedding Migration Black Hole: How a Vector Model Bump Silently Rewrites Your Business Rules

May 13, 2026 · 11 min read

Tian Pan

Software Engineer

The migration ticket is one line: "Upgrade embedding model from v3-small to v3-large." The new model wins on the public benchmark by 12%. The pipeline change is six lines of Python. The team estimates two days of engineering plus a re-embedding job that runs over a weekend. Two months later, the duplicate-detection feature is producing twice as many false positives as it did before the swap, the "related items" carousel on the marketing site has quietly become a slop generator, and the semantic cache hit rate has fallen off a cliff because the threshold of 0.95 that worked perfectly in the old space now matches almost nothing.

Nobody touched those features. Nobody filed a bug. The model swap that the migration plan called "infrastructure" silently rewrote every business rule that consumed a similarity score.

The Five Definitions of 'Now' Inside Your LLM Prompt

May 10, 2026 · 11 min read

Tian Pan

Software Engineer

A customer support agent told a user "based on our latest pricing, as of today" and quoted last quarter's price sheet. The system prompt interpolated today is {current_date} correctly. The retrieval layer pulled the document with the highest freshness score. The model answered confidently. Every component did exactly what it was specified to do, and the user got a wrong answer that the on-call engineer could not reproduce because, by the time they replayed the trace at 9pm, "today" was a different day.

This is not a rare bug. It is a failure mode that lives in almost every production LLM pipeline because "now" is implicit in the prompt at five different layers, and those layers were authored at different times, by different people, against different definitions of the present. As long as a request runs synchronously from a foreground user session, the layers mostly agree. The moment the request is replayed for debugging, batch-processed overnight, run from an eval harness pinned in March, or queued and consumed an hour later, the layers start disagreeing — and the model produces an answer that is internally consistent within its prompt but externally wrong.

The Freshness-Relevance Tradeoff in RAG: Why You Can't Optimize Both at Query Time

May 9, 2026 · 11 min read

Tian Pan

Software Engineer

A user asks your assistant what the company's parental leave policy is. The bot returns 12 weeks, with a citation. The cited document was the right answer in 2023; HR posted an update last quarter that took it to 16. Both versions are in your knowledge base. Cosine similarity scored the 2023 version 0.87 and the 2024 version 0.84, because the older page has the cleaner phrasing and fewer hedges. The fresher document loses by three percentage points and the user gets a wrong answer that looks audited.

This is the freshness-relevance tradeoff, and the uncomfortable part is that it has no clean solution at query time. If you weight recency, you bias retrieval toward whatever was edited yesterday — which in most knowledge bases is the noisy, high-churn surface area that should not be the source of truth. If you don't weight recency, you ship answers grounded in documents that were superseded months ago. There is no single global knob that gets both right, and most teams discover this only after a few embarrassing answers leak past their eval suite.

Retrieval Cascade Failure: How Document Deletion Poisons Your RAG Pipeline

May 9, 2026 · 9 min read

Tian Pan

Software Engineer

A user asks your support bot when the refund window closes. The bot answers "60 days" with cheerful confidence and a citation. The policy page that says "60 days" was deleted from the CMS three months ago. The new policy is 14. Nobody on your team knows the bot is wrong until a customer escalates.

This is a retrieval cascade failure: the document is gone from the source of truth, but its embedding is still in the index, still ranking high on cosine similarity, still feeding the model a ghost. RAG pipelines treat embedding indexes as caches of source content, but most teams build the cache without building the invalidation. Inserts get all the engineering attention. Deletes get a TODO comment.

The Attack Vector You Ship With Every Open RAG System

May 8, 2026 · 9 min read

Tian Pan

Software Engineer

Five carefully crafted documents. A corpus of 2.6 million. A 97% success rate at manipulating specific AI responses. That's the benchmark result from PoisonedRAG, presented at USENIX Security 2025 — and the attack didn't require model access, prompt injection at inference time, or any direct interaction with the system at all. The attacker simply contributed content to the knowledge base.

If your RAG system lets users add content — helpdesk tickets, wiki edits, customer feedback, shared notes — you've already shipped the attack vector. The question is whether you've also shipped the defenses.

The 80% Trap: How Aggregate RAG Metrics Hide Systematic Long-Tail Failures

May 8, 2026 · 9 min read

Tian Pan

Software Engineer

Your RAG pipeline hit 80% retrieval accuracy on the eval set. The team ships it. Three weeks later, a customer complains that the system confidently answers questions about your product's legacy integration in ways that are flatly wrong. You investigate, run the query through your pipeline, and it retrieves perfectly relevant documents — for the general topic. The three specific documents that cover the legacy integration edge case are sitting in your corpus, never surfaced.

That 80% number was real. It was also nearly useless as a signal for what just happened.

Code-Specific RAG: Why General Retrieval Fails for Codebases

May 7, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams building AI coding assistants reach for the same off-the-shelf RAG pipeline they use for document retrieval: chunk the source files by token count, embed the chunks, store them in a vector database, query by semantic similarity. The pipeline works well enough on prose. On code, it quietly fails — and the failures are hard to see in aggregate metrics, because the retrieved chunks look plausible right up until the model generates code with the wrong return type, calls a function with the wrong signature, or misses a dependency that only exists three hops down the call graph.

The problem isn't the embedding model or the vector database. It's the chunking strategy. Code is not prose. It has structural properties — dependency graphs, call chains, type signatures, scope hierarchies — that token-based chunking destroys before the retriever ever sees them. Fixing this requires rethinking how you decompose code before it ever reaches the embedding step.

The Cross-User Consistency Problem: When Your AI Gives Different Answers to the Same Question

May 7, 2026 · 9 min read

Tian Pan

Software Engineer

Two analysts at the same company both ask your AI assistant: "What was our Q3 churn rate?" One gets 4.2%. The other gets 4.8%. Neither is wrong — they just queried at different times, in different session contexts, against a retrieval index that ranked slightly different chunks. The AI answered both confidently, without hedging, without flagging the discrepancy. The analysts go into the same meeting with different numbers and your tool has just become a liability.

This is the cross-user consistency problem, and it's one of the most common reasons enterprise AI deployments quietly lose trust. The failure isn't a hallucination in the classic sense — no facts were invented. The failure is that your system is non-deterministic at scale, and that non-determinism is invisible until two users compare notes.

The Domain Expert Bottleneck in RAG: Why Knowledge Curation Breaks Production AI

May 7, 2026 · 7 min read

Tian Pan

Software Engineer

Most teams building RAG systems spend their first month on the pipeline — chunking strategy, embedding model selection, vector store configuration, retrieval tuning. They get that working. The demo passes. Stakeholders are impressed.

Then six months later, the system starts quietly degrading. Support tickets reference wrong procedures. The bot cites a pricing tier that was retired in Q3. A customer gets a confident answer about a product feature that was deprecated before they even signed up. The pipeline is fine. The knowledge base is the problem.

Embedding Model Churn: When Your Provider Silently Invalidates Your Entire Vector Index

May 7, 2026 · 9 min read

Tian Pan

Software Engineer

You spent weeks building a retrieval pipeline. Chunking strategy tuned, similarity thresholds calibrated, user feedback looking positive. Then one Monday morning, without any deployment on your end, retrieval quality starts degrading. Queries that used to surface the right documents now return loosely related noise. No error logs. No exceptions. The pipeline runs clean.

What changed was your embedding provider updated their model. Your entire vector index — millions of documents painstakingly embedded — is now populated with vectors from a coordinate system that no longer matches what your query encoder produces. The result is not a crash. It's invisible garbage.

About Tian Pan