Skip to main content

171 posts tagged with "rag"

View all tags

How PII Redaction Sentinels Quietly Collapse Your Vector Index

· 10 min read
Tian Pan
Software Engineer

A support engineer pulled up your RAG console to debug a complaint. The customer had asked "what does my account look like right now," the answer had come back coherent and confident, and it had been about somebody else's account entirely. The top-3 retrieved chunks all belonged to other customers. The engineer ran the same query against a fresh corpus snapshot to rule out indexing lag. Same result. Then she ran it against a snapshot from six months ago, before the privacy redactor had shipped. The right customer's chunk came back at rank 1.

The redactor was working as designed. Every name was a [NAME], every email an [EMAIL], every account number an [ACCOUNT]. The legal team had a clean audit trail and the security team had a closed compliance ticket. What nobody on either team had modeled was that those sentinels, dropped into the same syntactic slots across millions of documents, were being seen by the embedding model as ordinary tokens — tokens that co-occurred more reliably with each other than any real content did. The redactor had not just removed information. It had added a new, very strong signal that every redacted document shared and nothing else did.

The Citation Index Your Chunker Shifted by One When It Started Prefixing Line Numbers

· 11 min read
Tian Pan
Software Engineer

The chunker started prepending [line N] to every chunk. The eval went green. Every citation the model produced after that day pointed to the paragraph one position before the actual evidence, on every document, in the regulated industry the product serves. The team did not find out from the eval. The team found out from an auditor who looked at the cited sentence, read it, and pointed out that it contradicted the claim it was supposed to support.

This is the kind of regression that survives a code review, a manual QA pass on three sample documents, and a feature-flag rollout. None of those checks were wrong in isolation. They were all asking the same question — does a citation appear where one is expected — and none of them were asking the question the auditor asked, which is whether the citation points at the sentence the claim came from. The gap between those two questions is where the off-by-one lived for as long as it lived.

What makes this failure mode worth a separate write-up is not the bug itself. Off-by-one errors are old news. The interesting part is that the failure was produced by two systems that continued to agree on the structure of an integer while silently disagreeing about what the integer meant.

The Citation URL That Resolved But No Longer Said What the Model Quoted

· 10 min read
Tian Pan
Software Engineer

A RAG agent answers a customer's regulatory question with a tidy paragraph and a citation. The verification layer fetches the URL, sees a 200 OK, ticks the box, and ships. Six months later a compliance audit pulls the transcript, clicks the same link, and finds a page that now says the opposite of what the agent quoted. The URL is fine. The quote is fine in the transcript. The two no longer match. The customer's compliance officer asks whether the agent fabricated the quote, and the team cannot prove it didn't, because the only surviving evidence of what the URL used to say is the agent's own assertion of what it said.

This is not a hallucination in the usual sense. The model retrieved real content, faithfully extracted a real sentence, and emitted a real URL that still resolves. Every link-checker on earth would call this citation valid. The audit fails anyway, because the verification layer was measuring the wrong property. Reachability is not fidelity. A URL is a pointer to a mutable document under someone else's editorial control, and the moment the document changes, every transcript that quoted it becomes a hallucination report waiting to happen.

The Cost Dashboard Your Finance Team Built That Excluded the Embeddings Re-index

· 10 min read
Tian Pan
Software Engineer

Your finance team built a beautiful AI cost dashboard. Token spend, sliced by feature. Embedding spend, sliced by provider. Every quarter, the per-feature pane gets reviewed in a leadership meeting and somebody asks why the support-chat workflow is up 12%, and a product manager has a defensible answer. Every quarter, the per-provider pane gets reviewed in an infra meeting and somebody asks why OpenAI is up 8%, and a platform engineer has a defensible answer. And every quarter, the line that actually doubles your AI bill — the corpus re-index — lands in a third bucket called "infrastructure" that nobody reviews because nobody owns it.

That bucket is where forty percent of your AI spend goes to die unattributed. The teams who could have optimized it never see it. The teams who see it can't tell you which feature it serves. The dashboard is honest about every cost it can explain and silent about the cost it can't, which is exactly the cost that matters most.

The Embedding Deprecation That Halved Your Retrieval Recall Without a Deploy

· 10 min read
Tian Pan
Software Engineer

The most expensive embedding bug a RAG system can ship is the one where nothing in your repository changes. Your retrieval code is the same. Your index is the same. Your query path is the same. And one Tuesday in week six, somebody notices that the answers used to be better.

The provider posted a sunset notice for the embedding family your index was built against twelve months ago. The platform team filed it in a deprecations dashboard with a year of runway and moved on. The sunset path wasn't a hard cutoff — it was a quiet quality regression where the deprecated endpoint started routing to a "compatibility" successor that returned vectors in the same dimensionality and a subtly different semantic geometry. Query embeddings began drifting against the corpus you embedded a year ago. Recall@10 on your standing eval slid by 47% over six weeks. The team only traced it back when an unrelated quality dashboard crossed a threshold, dragging a senior engineer into a root-cause exercise that ended at an embedding endpoint no one on the call had touched in a year.

The RAG Dedup Step That Broke Silently and Flooded Your Top-K With Near-Duplicates

· 10 min read
Tian Pan
Software Engineer

A retrieval-augmented generation pipeline can degrade for weeks without a single metric noticing. The relevance scores look fine. The retrieval latency is unchanged. The eval slice that touches the broken topic moves a quarter of a point in the wrong direction, and your weekly review chalks it up to noise. Then someone reads the actual context window the model received for a customer ticket and sees the same paragraph three times — once in title case, once in lowercase, once with the punctuation stripped — and you understand that your top-five has secretly been a top-two for a month.

This is the class of failure where the system is doing exactly what it was told to do. The retriever is returning the most similar vectors to the query. Each of those vectors is genuinely about the right topic. The index has no idea that three of them came from the same paragraph indexed three ways, because the ingestion-time dedup pass that was supposed to catch that case is silently skipping it.

The RAG Threshold Pinned to an Absolute Score the Embedding Upgrade Silently Moved

· 9 min read
Tian Pan
Software Engineer

A RAG pipeline ships with a reranker score threshold of 0.4. Anything below gets dropped from the prompt. Six months in, a routine index rebuild swaps the embedding model for a newer checkpoint in the same family — a transparent upgrade, the change log says. Two days later answer relevance falls 6%. The team blames the LLM, runs a model bake-off, finds no candidate that recovers the loss, and spends a quarter chasing a regression that lives in none of the models they were comparing.

The regression lives in the gate. The reranker — untouched, same checkpoint, same weights — is now scoring a different candidate set. The new embeddings pull different chunks into the top-50, the reranker scores them lower on its own calibration, and the gate at 0.4 drops 37% more candidates than it did the week before. The number 0.4 didn't change. What 0.4 meant changed.

Your RAG Corpus Trust Boundary Is Whoever Can Write to Its Sources

· 10 min read
Tian Pan
Software Engineer

A support agent gives the right answer to the wrong audience. A customer asks about their account, the model dutifully calls a URL-fetch tool, and a snapshot of that account's context lands on a server the security team has never heard of. No credentials leaked. No API keys exposed. The exfiltration vector was a five-star product review written by a competitor three weeks earlier, retrieved as relevant context because the visible praise actually was relevant to the user's question.

This is the failure mode that breaks the mental model engineers carry from years of web security. The threat model in RAG systems is usually phrased as "we own the corpus" because we own the ingestion pipeline, the embedding model, and the vector database. But owning the code that pulls the content is not the same as owning the content. If your corpus includes any source whose writes are not gated by your authorization, you have handed a prompt-engineering channel to whoever can post.

Retrieval Pipeline Residency: The Embedding That Crossed the Border Your LLM Call Didn't

· 9 min read
Tian Pan
Software Engineer

The team that ships "AI for EU customers" usually ships exactly one residency control: an inference endpoint pinned to an EU region. The procurement team gets a DPA, the architecture diagram gets a green checkmark next to "model hosted in Frankfurt," and the launch proceeds. What the diagram doesn't show is that the customer's verbatim query gets vectorized by a US-hosted embedding API on its way to the model, that the vector store the query is matched against has its operational plane in us-east-1, that the rerank model is a third-party SaaS deployed wherever the vendor chose, that the prompt cache is keyed regionally on hits and globally on misses, and that the trace store logging the retrieved chunks has a 30-day retention bucket that replicates cross-region for redundancy.

The inference layer respects residency. The retrieval pipeline doesn't even know it's a participant.

This is the gap where most "GDPR-compliant" RAG deployments fail an audit the team didn't realize was coming. The fix isn't another control on the model call — it's recognizing that data residency is a property of every component the customer's bytes touch, and that the team owning "the LLM" owns at most one of the six surfaces involved.

The Embedding Model Rotation That Shadowed Your A/B Test for a Quarter

· 10 min read
Tian Pan
Software Engineer

You ran the experiment cleanly. Two arms, one feature flag, a clear metric, the stats team blessed the design. Twelve weeks later you ship the winner, and the lift quietly evaporates within a sprint. The post-mortem turns up nothing in the code, nothing in the flag rollout, nothing on the analytics side. The thing that moved was something nobody on your experimentation list owned: the hosted embedding model behind your retrieval call returned a slightly different vector for the same query in week three, in week seven, and again on the morning your readout meeting happened. Your A/B test was real. The substrate it ran on was not.

This is the failure mode every team running retrieval-augmented generation eventually walks into and the one almost nobody designs against. The embedding endpoint is treated as a stable substrate the way Postgres is treated as a stable substrate. It is not. It is a model with a release cadence the vendor controls, a changelog you do not read, and a behavior surface that can shift without changing the dimension count, the SLA, or the API contract you signed against. The experiment you thought was measuring a feature change was measuring a retrieval regime change with the feature flag noise on top.

The Middle-Context Blindness Your Retrieval Pipeline Never Measured

· 8 min read
Tian Pan
Software Engineer

The retrieval logs are clean. Recall@10 against your hand-labeled query set has not regressed in months. The answer-quality dashboard says faithfulness is holding above 90%. Then a customer pastes a question into your support agent, the gold passage is right there at position 7 of 12 in the assembled prompt, and the model answers as if it were never retrieved.

The retrieval team will tell you the chunk was there. The prompt team will tell you the prompt was correct. Both are technically right. The model attended to the first thousand tokens, attended to the last thousand tokens, and skimmed the middle band where the answer lived. Your pipeline is hitting a positional attention bias that neither team owns, neither dashboard tracks, and neither benchmark catches.

The prompt injection that survived your sanitizer because the agent read it through a tool

· 11 min read
Tian Pan
Software Engineer

A team I talked to last month had a clean prompt-injection story. Their gateway ran every user message through a classifier. Anything that scored above a threshold got bounced with a polite error. They benchmarked it against a public adversarial set, hit 99.4% block rate, and shipped. Two weeks later, a customer-success ticket revealed that the agent had quietly drafted, approved, and sent an email instructing an internal billing tool to refund a stranger's invoice to a new account. The malicious instruction had never touched the user input. It came in through a Confluence page the agent fetched when the user asked, perfectly innocently, "what does our refund policy say?"

That is the failure mode no input sanitizer catches, and it is now the dominant prompt-injection vector in production agents. The classifier you trained on user prompts never saw the payload, because the payload arrived through a different door. By the time the bytes hit the model, the agent had already labeled them as "context I retrieved to help the user," not "untrusted text from a stranger on the internet." The model treats both with the same compliance instinct, because the model has no concept of trust at all.