Skip to main content

163 posts tagged with "rag"

View all tags

The Marketing Page Your RAG Cited as an Engineering Spec

· 9 min read
Tian Pan
Software Engineer

A support engineer pastes a customer ticket into your internal assistant. The question is sharp: "Does our API support multi-region writes on the free tier?" The assistant comes back instantly, citing a chunk it retrieved with 0.91 cosine similarity. The answer is yes. The chunk is from a landing page written by marketing in 2023 to win a head-to-head against a competitor. Engineering removed multi-region writes from the free tier eighteen months ago and posted a terse internal RFC that nobody linked from a customer-facing page. The RFC is also in the vector store. It scored 0.74.

The assistant didn't hallucinate. It retrieved the highest-scoring document and faithfully grounded its answer in the text. The retriever did its job. The job was wrong.

Your Embeddings Don't Know the Contractor Was Off-Boarded

· 9 min read
Tian Pan
Software Engineer

A contractor finished a six-month engagement last quarter. HR ran the off-boarding checklist: SSO disabled, laptop wiped, GitHub seat removed, Slack archived, Notion access revoked. Compliance signed off. Six weeks later, an internal RAG assistant answered a question by quoting a confidential strategy document the contractor had authored — and the chunk it cited was still tagged with the contractor's user ID in the vector store's allow-list. Nothing in the access logs of the source-of-truth ever recorded a read, because there was no read. The retrieval came from a copy of the data that nobody wired into the off-boarding flow.

This is the structural problem nobody puts on the architecture diagram. Your vector index is not just a similarity-search engine. It is a permission cache — a derived store of who-can-see-what, frozen at the moment you ran your embedding job — and almost nobody is invalidating it the way they invalidate everything else.

The Agent That Read Last Week's Slack Like It Was Yesterday

· 10 min read
Tian Pan
Software Engineer

Your operations agent answers a question about the upcoming launch by quoting a Slack message that says "we'll ship tomorrow." The agent treats that as a present-tense plan and starts writing comms. The message was posted six weeks ago. The ship happened. The retrieval pipeline pulled the right chunk by every metric you measure — semantic similarity to "launch date," top-1 confidence above your threshold, source channel matching the project — and the agent built a plan on a sentence that meant something only inside the meeting where it was written.

The bug is not in the model. The bug is that tomorrow is not a date. It is a pointer to a clock, and the clock the message was written against is not the clock the agent is reading it on. Your retrieval pipeline indexed the body of the message and discarded the frame.

The Embedding That Aged Out of Meaning

· 9 min read
Tian Pan
Software Engineer

You embedded the knowledge base eighteen months ago. The model has not changed. The chunks have not changed. The index is healthy, the latency is fine, the recall dashboard is a flat line at 0.86. And yet support is quietly pasting the wrong article links into ticket replies, the sales bot keeps surfacing a deprecated SKU when a prospect asks about the new one, and an internal user just told you the assistant "feels dumber" without being able to say why.

Nothing broke. Your embeddings aged. The word post used to mean blog post in your domain; now half the corpus uses it for a Slack post, a forum post, and a job posting, and your eighteen-month-old vectors still treat it as one concept. The model that encoded those vectors never saw the new senses, never saw the new product names, never saw the rebrand, never saw the regulation that introduced three new terms your customers now use without thinking. The retrieval system answers the question it knows how to answer, which is no longer the question your users are asking.

The Token Budget Is a Product Decision, Not a Config Value

· 10 min read
Tian Pan
Software Engineer

Somewhere in your codebase there is a line that looks like retriever.search(query, top_k=8). An engineer wrote that 8 in an afternoon. It was never reviewed by anyone outside the team, never appeared in a spec, and has never been revisited. That single integer decides how much of your context window goes to retrieved documents instead of conversation history, how much each request costs, how slow the response feels, and — because of how language models actually behave at length — how accurate the answer is.

That is a product decision. It is sitting in an f-string.

When RAG Should Have Been a JOIN

· 9 min read
Tian Pan
Software Engineer

A support team asked their new AI assistant a simple question: "Which enterprise customers opened a ticket last week?" The assistant came back with a confident, fluent answer naming six accounts. Five were right. One had churned two months ago, and one enterprise account that had filed three tickets was missing entirely. Nobody caught it until a renewal call went sideways.

The bug was not in the model. It was in the architecture. Somewhere in the design review, a question with hard predicates — a plan tier, a date range, a ticket count — got routed to a vector index. The team had a retrieval system, so they retrieved. They embedded the ticket records, embedded the question, and asked cosine similarity to do the job of a WHERE clause. It cannot. It never could.

This is one of the most common and least discussed failure modes in production AI systems: reaching for semantic search when the real query was relational. The data lived in tidy rows with foreign keys. The answer was one JOIN away. Instead it went through an embedding model, and the precision evaporated.

The Embedding Upgrade That Silently Re-Ranks Your Entire Corpus

· 9 min read
Tian Pan
Software Engineer

A new embedding model lands on the leaderboard. It scores higher than the one you shipped eighteen months ago, the API is a one-line change, and the dimensions even match. Someone files a ticket: "upgrade embedding model." It looks like swapping a logging library.

It is not. The embedding model is not a component of your retrieval system — it is the coordinate system your retrieval system lives in. Changing it does not improve your index. It invalidates it. And the cruelest part is that nothing crashes. No exception, no failed health check. Your search just starts returning subtly different results, and "subtly different" in a RAG pipeline means a different document feeds the model, which means a different answer reaches the user.

The Semantic Cache That Confidently Returns the Wrong Answer

· 9 min read
Tian Pan
Software Engineer

Two support users ask your agent almost the same question within a minute of each other. The first asks, "What's our refund window for EU orders?" The second asks, "What's our refund window for US orders?" The embeddings of those two sentences sit a hair's breadth apart — same length, same structure, one two-letter token of difference. Your semantic cache, tuned to a similarity threshold that looked perfectly reasonable in the demo, scores them as a match. The second user gets the first user's answer. The EU's 14-day cooling-off period is presented to a US customer as fact, in fluent prose, with no asterisk.

Nobody gets paged for this. The cache returned a 200. Latency was great. The cost dashboard shows a hit, which is the outcome everyone wanted. The only signal that anything went wrong is a customer acting on policy that does not apply to them — and that signal arrives days later, through a refund dispute, not through your monitoring.

This is the failure mode that makes semantic caching different from every cache you have built before. An exact-match cache can be stale, but it is never wrong — the key either matches or it doesn't. A semantic cache trades that guarantee away on purpose. It is designed to return answers for keys it has never seen, and the price of that latency win is a correctness risk that most teams never put a number on.

Your Vector Index Is a Cache With No Invalidation Strategy

· 9 min read
Tian Pan
Software Engineer

A vector index feels like a database. You write documents into it, you query it, it returns results. But it is not a database — it is a derived, denormalized copy of data that lives somewhere else. Your source of truth is a wiki, a ticket system, a CRM, a folder of PDFs. The embeddings are a projection of that truth, frozen at the moment you ran the ingestion job.

That makes your vector index a cache. And like every cache, it goes stale. The difference is that most teams build a caching layer on purpose, with a TTL and an invalidation hook, while almost nobody builds a vector index on purpose as a cache. They build it as a "knowledge base" and then act surprised when it serves knowledge that stopped being true three weeks ago.

The Vector Index Has a Staleness SLO Nobody Set

· 10 min read
Tian Pan
Software Engineer

A user asks your agent what the current price tier is for an enterprise plan. The agent retrieves a chunk, reads it, and answers: "$2,000 per month." Confident, sourced, formatted nicely. The problem is that pricing changed four days ago. The number the agent quoted was true last week. The chunk it retrieved was embedded before the change, and the index has not caught up.

Nobody decided this would happen. There was no design review where someone said "the agent may answer from data up to four days old." There is just a re-indexing job that runs nightly, or weekly, and a content team that edits the source whenever they feel like it, and a gap between those two clocks that nobody measures. That gap is a service level objective. It exists whether or not you wrote it down. The only question is whether you set it on purpose or inherited it by accident.

RAG Against a Phantom Inventory: When Your Corpus Describes Features Your Product Removed

· 11 min read
Tian Pan
Software Engineer

A customer asks your support agent how to do something. The agent retrieves three documentation chunks with high relevance scores, synthesizes a confident answer, and walks the customer through a five-step procedure that ends on a button that hasn't existed for four months. The customer files a ticket. The on-call engineer pulls the eval suite, finds it green, pulls the retrieval traces, finds them green too — the model didn't hallucinate, it faithfully quoted documentation describing a feature your product team renamed in the last quarterly release.

This is the failure mode I want to name: not a hallucination, not a retrieval miss, but a phantom inventory problem. Your retrieval corpus is a snapshot of a product surface that no longer exists. The vector store doesn't know the product changed. The eval suite doesn't know either. The only system that consistently catches it is the support ticket queue, and by the time a ticket is filed the customer has already been told to click a button that isn't there.

The Retrieval Citation Tax: Why Compliance Adds 30% to Your RAG Token Bill

· 10 min read
Tian Pan
Software Engineer

A team I talked to recently sold their legal-AI product into a Fortune 500 in-house counsel office and added one line to their system prompt: "every factual claim must include an inline citation to the retrieved source." The product roadmap allocated a 5% buffer on their token budget for the new behavior. Sixty days after the regulated tenant went live, finance flagged a 34% jump in monthly inference spend. Nobody had broken the product. Nobody had shipped new features. The compliance requirement that closed the deal also quietly rewrote the unit economics underneath it.

This is the retrieval citation tax, and almost every RAG system serving a regulated industry — legal, healthcare, finance, audit-bound enterprise — eventually pays it. The tax is structural, not a bug. It comes from the way citation discipline forces the model into a different generation regime, and it shows up nowhere on the procurement spec the customer signed.