Skip to main content

65 posts tagged with "retrieval"

View all tags

How PII Redaction Sentinels Quietly Collapse Your Vector Index

· 10 min read
Tian Pan
Software Engineer

A support engineer pulled up your RAG console to debug a complaint. The customer had asked "what does my account look like right now," the answer had come back coherent and confident, and it had been about somebody else's account entirely. The top-3 retrieved chunks all belonged to other customers. The engineer ran the same query against a fresh corpus snapshot to rule out indexing lag. Same result. Then she ran it against a snapshot from six months ago, before the privacy redactor had shipped. The right customer's chunk came back at rank 1.

The redactor was working as designed. Every name was a [NAME], every email an [EMAIL], every account number an [ACCOUNT]. The legal team had a clean audit trail and the security team had a closed compliance ticket. What nobody on either team had modeled was that those sentinels, dropped into the same syntactic slots across millions of documents, were being seen by the embedding model as ordinary tokens — tokens that co-occurred more reliably with each other than any real content did. The redactor had not just removed information. It had added a new, very strong signal that every redacted document shared and nothing else did.

The Embedding Deprecation That Halved Your Retrieval Recall Without a Deploy

· 10 min read
Tian Pan
Software Engineer

The most expensive embedding bug a RAG system can ship is the one where nothing in your repository changes. Your retrieval code is the same. Your index is the same. Your query path is the same. And one Tuesday in week six, somebody notices that the answers used to be better.

The provider posted a sunset notice for the embedding family your index was built against twelve months ago. The platform team filed it in a deprecations dashboard with a year of runway and moved on. The sunset path wasn't a hard cutoff — it was a quiet quality regression where the deprecated endpoint started routing to a "compatibility" successor that returned vectors in the same dimensionality and a subtly different semantic geometry. Query embeddings began drifting against the corpus you embedded a year ago. Recall@10 on your standing eval slid by 47% over six weeks. The team only traced it back when an unrelated quality dashboard crossed a threshold, dragging a senior engineer into a root-cause exercise that ended at an embedding endpoint no one on the call had touched in a year.

The RAG Dedup Step That Broke Silently and Flooded Your Top-K With Near-Duplicates

· 10 min read
Tian Pan
Software Engineer

A retrieval-augmented generation pipeline can degrade for weeks without a single metric noticing. The relevance scores look fine. The retrieval latency is unchanged. The eval slice that touches the broken topic moves a quarter of a point in the wrong direction, and your weekly review chalks it up to noise. Then someone reads the actual context window the model received for a customer ticket and sees the same paragraph three times — once in title case, once in lowercase, once with the punctuation stripped — and you understand that your top-five has secretly been a top-two for a month.

This is the class of failure where the system is doing exactly what it was told to do. The retriever is returning the most similar vectors to the query. Each of those vectors is genuinely about the right topic. The index has no idea that three of them came from the same paragraph indexed three ways, because the ingestion-time dedup pass that was supposed to catch that case is silently skipping it.

The RAG Threshold Pinned to an Absolute Score the Embedding Upgrade Silently Moved

· 9 min read
Tian Pan
Software Engineer

A RAG pipeline ships with a reranker score threshold of 0.4. Anything below gets dropped from the prompt. Six months in, a routine index rebuild swaps the embedding model for a newer checkpoint in the same family — a transparent upgrade, the change log says. Two days later answer relevance falls 6%. The team blames the LLM, runs a model bake-off, finds no candidate that recovers the loss, and spends a quarter chasing a regression that lives in none of the models they were comparing.

The regression lives in the gate. The reranker — untouched, same checkpoint, same weights — is now scoring a different candidate set. The new embeddings pull different chunks into the top-50, the reranker scores them lower on its own calibration, and the gate at 0.4 drops 37% more candidates than it did the week before. The number 0.4 didn't change. What 0.4 meant changed.

Retrieval Pipeline Residency: The Embedding That Crossed the Border Your LLM Call Didn't

· 9 min read
Tian Pan
Software Engineer

The team that ships "AI for EU customers" usually ships exactly one residency control: an inference endpoint pinned to an EU region. The procurement team gets a DPA, the architecture diagram gets a green checkmark next to "model hosted in Frankfurt," and the launch proceeds. What the diagram doesn't show is that the customer's verbatim query gets vectorized by a US-hosted embedding API on its way to the model, that the vector store the query is matched against has its operational plane in us-east-1, that the rerank model is a third-party SaaS deployed wherever the vendor chose, that the prompt cache is keyed regionally on hits and globally on misses, and that the trace store logging the retrieved chunks has a 30-day retention bucket that replicates cross-region for redundancy.

The inference layer respects residency. The retrieval pipeline doesn't even know it's a participant.

This is the gap where most "GDPR-compliant" RAG deployments fail an audit the team didn't realize was coming. The fix isn't another control on the model call — it's recognizing that data residency is a property of every component the customer's bytes touch, and that the team owning "the LLM" owns at most one of the six surfaces involved.

The Middle-Context Blindness Your Retrieval Pipeline Never Measured

· 8 min read
Tian Pan
Software Engineer

The retrieval logs are clean. Recall@10 against your hand-labeled query set has not regressed in months. The answer-quality dashboard says faithfulness is holding above 90%. Then a customer pastes a question into your support agent, the gold passage is right there at position 7 of 12 in the assembled prompt, and the model answers as if it were never retrieved.

The retrieval team will tell you the chunk was there. The prompt team will tell you the prompt was correct. Both are technically right. The model attended to the first thousand tokens, attended to the last thousand tokens, and skimmed the middle band where the answer lived. Your pipeline is hitting a positional attention bias that neither team owns, neither dashboard tracks, and neither benchmark catches.

The Reranker You Added That Slowed Recall More Than It Improved Precision

· 11 min read
Tian Pan
Software Engineer

The offline eval was unambiguous. After bolting a cross-encoder on top of the top-50 from vector search, nDCG@5 went up four points. The team shipped it on a Tuesday. By Thursday, p99 retrieval latency had crossed the SLO by 700 milliseconds, and customer success was forwarding screenshots of empty results pages that the old pipeline would have populated. The graph that mattered — user-perceived answer quality — was down. The reranker was a regression that the team had branded as an improvement, and the eval rubric was the thing that hid the regression in plain sight.

This is one of the most common failure modes in production retrieval, and it is rarely described as what it actually is: an evaluation bug. The reranker did what it was advertised to do. It reordered the top-50 with finer-grained precision. The problem is that the metric used to justify it — offline nDCG, computed at infinite budget, against the full reranked list — describes a world the production system does not live in. In production, the answer that ships is not the best-scored reranked list. It is whatever the system can return before the request deadline. And once you write the metric that way, the reranker's contribution is no longer a four-point lift. It is a curve.

The Retrieval Corpus Whose Jargon Your Embeddings Model Never Saw in Training

· 9 min read
Tian Pan
Software Engineer

A retrieval team ships an off-the-shelf embedding model against their product catalogue. The eval set — a few hundred queries scraped from the search logs of the last month — comes back at recall@10 of 0.91. They promote to production. Three weeks in, support starts forwarding tickets: a user searched for the actual SKU of a part and got back five plausible-looking but wrong parts. Another user searched for the internal codename of a feature and got the marketing name of an unrelated feature. The eval set never caught it because the eval set was drawn from queries the system already handled — queries about common terms. The long tail of jargon, where the business actually lives, was never sampled.

The model didn't fail. The model did exactly what it was trained to do, against a vocabulary distribution that did not include the corpus the team handed it. The team treated the embedding as a domain-neutral primitive — a function from text to vector — when it was actually a contract about which vocabulary it could resolve, signed with someone else's training corpus.

The Vector Index That Was Sharded by Ingestion Date

· 9 min read
Tian Pan
Software Engineer

There is a specific kind of recall lie that hides inside time-partitioned vector indexes, and the people who built the offline eval are usually the last to find it. The dashboard says recall@10 is 0.94. The retriever is shipping the right snippet 94% of the time. The product team is shipping more retrieval-grounded features on the back of that number. And then the support tickets arrive: "the assistant cited a guide that does not match the answer," "the assistant linked to last week's version of the policy," "the assistant could not find a document I uploaded two months ago." None of those tickets contradict the 0.94. They are evidence that the 0.94 is measuring the wrong thing.

The mechanism is simple and easy to miss. The vector index is sharded by ingestion date because that is the easiest way to keep write throughput high, retire old data, and keep the hot working set in fast memory. The offline test set is generated nightly from production logs, which means the queries are drawn from the same recent window that the freshest shard happens to hold. Recall is measured against ground truth that lives one or two shards deep. The retriever performs beautifully on those queries because, in production, those queries are the ones the routing layer keeps inside the same shard.

The Vector Index Whose Source Updates Never Reached the Embeddings

· 10 min read
Tian Pan
Software Engineer

A support engineer pings the on-call channel. A customer pasted a sentence the assistant retrieved last week, and the policy team replied: we don't say that anymore. They haven't said it for four months. The document in the CMS reads correctly. The embedded chunk in the vector index still reads the old way, with a confident similarity score, surfaced to the model on every relevant query. Nobody changed the retrieval code. Nobody changed the model. The source-of-truth changed, and the index never heard about it.

This is the failure mode of an ingestion pipeline that was designed for creates and grew into a system that also handles updates without anyone designing for updates. The "embed on create" job ran the day each document was first written. The CMS shipped an edit endpoint a quarter later, owned by a different team, who plumbed it into search and into the public-facing renderer and into the changelog feed — every consumer except the one that was a derived dataset hiding behind a different name. Months pass. The corpus drifts. Retrieval starts answering questions about a world the company has formally left behind, and the only signal is a confused customer.

The Chunk Boundary That Bisected the Sentence Your Answer Depended On

· 9 min read
Tian Pan
Software Engineer

Your RAG pipeline chunks documents into 512-token spans with 50-token overlap. It is a clean industry default. Somewhere in your corpus there is a sentence — "Refunds are processed within five business days unless the order originated from the EU region, in which case the regulatory window is fourteen days" — that landed across a chunk boundary. Chunk N contains the first half. Chunk N+1 contains the second.

A user asks "how long do EU refunds take." Retrieval scores chunk N highest because the query embedding aligns with "EU region" in the first fragment. Chunk N+1, which contains the only actual answer, ranks too low to be retrieved alongside. The agent answers "five business days" with a confident citation to chunk N. The customer is in Frankfurt. The answer is wrong. The pipeline behaved exactly as designed.

This is the failure mode that does not show up in your chunk-quality eval. The chunks are well-formed. The corpus is well-formed. The embedding model is well-formed. The boundaries between chunks — the lines you drew through your own documents — are where the answer lives.

The Wiki Edit Mid-Flight When Your RAG Pipeline Read It

· 11 min read
Tian Pan
Software Engineer

A tech writer on your platform team is moving a paragraph. Not metaphorically — literally cutting a section from the onboarding page, pasting it into the runbook, deleting a stub draft on a third page, and rewording a deprecated warning on a fourth. The whole edit takes her about eleven minutes. Your RAG ingest job runs every fifteen. It happens to fire at minute six.

For the next fifteen minutes, your retrieval index contains a state of the wiki that did not exist at any single moment in her mind. The onboarding page still has the section. The runbook still doesn't. The stub draft is captured halfway through being deleted, with a placeholder sentence she never intended to publish. The old deprecated warning is still indexed. When an engineer asks the agent "how do we handle credential rotation in this service," the model retrieves contradictory chunks from the same source and confidently synthesizes whichever was ranked higher. The answer is wrong in a shape no one wrote.

This is a failure mode most teams ship without noticing: the source-of-truth is transactional, the ingest is a poll, and the gap between them is where dirty reads live.