Skip to main content

163 posts tagged with "rag"

View all tags

Retrieval Pipeline Residency: The Embedding That Crossed the Border Your LLM Call Didn't

· 9 min read
Tian Pan
Software Engineer

The team that ships "AI for EU customers" usually ships exactly one residency control: an inference endpoint pinned to an EU region. The procurement team gets a DPA, the architecture diagram gets a green checkmark next to "model hosted in Frankfurt," and the launch proceeds. What the diagram doesn't show is that the customer's verbatim query gets vectorized by a US-hosted embedding API on its way to the model, that the vector store the query is matched against has its operational plane in us-east-1, that the rerank model is a third-party SaaS deployed wherever the vendor chose, that the prompt cache is keyed regionally on hits and globally on misses, and that the trace store logging the retrieved chunks has a 30-day retention bucket that replicates cross-region for redundancy.

The inference layer respects residency. The retrieval pipeline doesn't even know it's a participant.

This is the gap where most "GDPR-compliant" RAG deployments fail an audit the team didn't realize was coming. The fix isn't another control on the model call — it's recognizing that data residency is a property of every component the customer's bytes touch, and that the team owning "the LLM" owns at most one of the six surfaces involved.

The Embedding Model Rotation That Shadowed Your A/B Test for a Quarter

· 10 min read
Tian Pan
Software Engineer

You ran the experiment cleanly. Two arms, one feature flag, a clear metric, the stats team blessed the design. Twelve weeks later you ship the winner, and the lift quietly evaporates within a sprint. The post-mortem turns up nothing in the code, nothing in the flag rollout, nothing on the analytics side. The thing that moved was something nobody on your experimentation list owned: the hosted embedding model behind your retrieval call returned a slightly different vector for the same query in week three, in week seven, and again on the morning your readout meeting happened. Your A/B test was real. The substrate it ran on was not.

This is the failure mode every team running retrieval-augmented generation eventually walks into and the one almost nobody designs against. The embedding endpoint is treated as a stable substrate the way Postgres is treated as a stable substrate. It is not. It is a model with a release cadence the vendor controls, a changelog you do not read, and a behavior surface that can shift without changing the dimension count, the SLA, or the API contract you signed against. The experiment you thought was measuring a feature change was measuring a retrieval regime change with the feature flag noise on top.

The Middle-Context Blindness Your Retrieval Pipeline Never Measured

· 8 min read
Tian Pan
Software Engineer

The retrieval logs are clean. Recall@10 against your hand-labeled query set has not regressed in months. The answer-quality dashboard says faithfulness is holding above 90%. Then a customer pastes a question into your support agent, the gold passage is right there at position 7 of 12 in the assembled prompt, and the model answers as if it were never retrieved.

The retrieval team will tell you the chunk was there. The prompt team will tell you the prompt was correct. Both are technically right. The model attended to the first thousand tokens, attended to the last thousand tokens, and skimmed the middle band where the answer lived. Your pipeline is hitting a positional attention bias that neither team owns, neither dashboard tracks, and neither benchmark catches.

The prompt injection that survived your sanitizer because the agent read it through a tool

· 11 min read
Tian Pan
Software Engineer

A team I talked to last month had a clean prompt-injection story. Their gateway ran every user message through a classifier. Anything that scored above a threshold got bounced with a polite error. They benchmarked it against a public adversarial set, hit 99.4% block rate, and shipped. Two weeks later, a customer-success ticket revealed that the agent had quietly drafted, approved, and sent an email instructing an internal billing tool to refund a stranger's invoice to a new account. The malicious instruction had never touched the user input. It came in through a Confluence page the agent fetched when the user asked, perfectly innocently, "what does our refund policy say?"

That is the failure mode no input sanitizer catches, and it is now the dominant prompt-injection vector in production agents. The classifier you trained on user prompts never saw the payload, because the payload arrived through a different door. By the time the bytes hit the model, the agent had already labeled them as "context I retrieved to help the user," not "untrusted text from a stranger on the internet." The model treats both with the same compliance instinct, because the model has no concept of trust at all.

The Reranker You Added That Slowed Recall More Than It Improved Precision

· 11 min read
Tian Pan
Software Engineer

The offline eval was unambiguous. After bolting a cross-encoder on top of the top-50 from vector search, nDCG@5 went up four points. The team shipped it on a Tuesday. By Thursday, p99 retrieval latency had crossed the SLO by 700 milliseconds, and customer success was forwarding screenshots of empty results pages that the old pipeline would have populated. The graph that mattered — user-perceived answer quality — was down. The reranker was a regression that the team had branded as an improvement, and the eval rubric was the thing that hid the regression in plain sight.

This is one of the most common failure modes in production retrieval, and it is rarely described as what it actually is: an evaluation bug. The reranker did what it was advertised to do. It reordered the top-50 with finer-grained precision. The problem is that the metric used to justify it — offline nDCG, computed at infinite budget, against the full reranked list — describes a world the production system does not live in. In production, the answer that ships is not the best-scored reranked list. It is whatever the system can return before the request deadline. And once you write the metric that way, the reranker's contribution is no longer a four-point lift. It is a curve.

The Retrieval Corpus Whose Jargon Your Embeddings Model Never Saw in Training

· 9 min read
Tian Pan
Software Engineer

A retrieval team ships an off-the-shelf embedding model against their product catalogue. The eval set — a few hundred queries scraped from the search logs of the last month — comes back at recall@10 of 0.91. They promote to production. Three weeks in, support starts forwarding tickets: a user searched for the actual SKU of a part and got back five plausible-looking but wrong parts. Another user searched for the internal codename of a feature and got the marketing name of an unrelated feature. The eval set never caught it because the eval set was drawn from queries the system already handled — queries about common terms. The long tail of jargon, where the business actually lives, was never sampled.

The model didn't fail. The model did exactly what it was trained to do, against a vocabulary distribution that did not include the corpus the team handed it. The team treated the embedding as a domain-neutral primitive — a function from text to vector — when it was actually a contract about which vocabulary it could resolve, signed with someone else's training corpus.

The Vector Index That Was Sharded by Ingestion Date

· 9 min read
Tian Pan
Software Engineer

There is a specific kind of recall lie that hides inside time-partitioned vector indexes, and the people who built the offline eval are usually the last to find it. The dashboard says recall@10 is 0.94. The retriever is shipping the right snippet 94% of the time. The product team is shipping more retrieval-grounded features on the back of that number. And then the support tickets arrive: "the assistant cited a guide that does not match the answer," "the assistant linked to last week's version of the policy," "the assistant could not find a document I uploaded two months ago." None of those tickets contradict the 0.94. They are evidence that the 0.94 is measuring the wrong thing.

The mechanism is simple and easy to miss. The vector index is sharded by ingestion date because that is the easiest way to keep write throughput high, retire old data, and keep the hot working set in fast memory. The offline test set is generated nightly from production logs, which means the queries are drawn from the same recent window that the freshest shard happens to hold. Recall is measured against ground truth that lives one or two shards deep. The retriever performs beautifully on those queries because, in production, those queries are the ones the routing layer keeps inside the same shard.

The Vector Index Whose Source Updates Never Reached the Embeddings

· 10 min read
Tian Pan
Software Engineer

A support engineer pings the on-call channel. A customer pasted a sentence the assistant retrieved last week, and the policy team replied: we don't say that anymore. They haven't said it for four months. The document in the CMS reads correctly. The embedded chunk in the vector index still reads the old way, with a confident similarity score, surfaced to the model on every relevant query. Nobody changed the retrieval code. Nobody changed the model. The source-of-truth changed, and the index never heard about it.

This is the failure mode of an ingestion pipeline that was designed for creates and grew into a system that also handles updates without anyone designing for updates. The "embed on create" job ran the day each document was first written. The CMS shipped an edit endpoint a quarter later, owned by a different team, who plumbed it into search and into the public-facing renderer and into the changelog feed — every consumer except the one that was a derived dataset hiding behind a different name. Months pass. The corpus drifts. Retrieval starts answering questions about a world the company has formally left behind, and the only signal is a confused customer.

Security by Obscurity and the Agent Reading Your Wiki

· 12 min read
Tian Pan
Software Engineer

There is an endpoint inside your company that has been safe for ten years. It lives at a path that nobody outside the original team would ever guess. It is not in the public docs. It is not in the OpenAPI spec. It is not in the gateway's allowlist of "documented routes." Its auth layer is a token that any internal service can mint, because the threat model said the only way to reach it was to already know it existed. The endpoint accepts a JSON blob that, on a slow Tuesday, will reissue a refund or rotate an API key or move a row between two billing ledgers. It has worked correctly and unremarkably since 2016.

Last month, a teammate wired a coding agent into the engineering wiki to help with onboarding questions. The agent indexed every Confluence space, every archived design doc, every "do not delete — historical" page. Yesterday, a junior engineer asked it how refunds work. The agent stitched together a forgotten 2018 architecture diagram, a Slack export someone had pasted into a runbook, and a half-written postmortem. It produced, in conversational prose, a complete description of that endpoint, the token type required, and an example payload. The endpoint had not changed. Its threat model had.

The Chunk Boundary That Bisected the Sentence Your Answer Depended On

· 9 min read
Tian Pan
Software Engineer

Your RAG pipeline chunks documents into 512-token spans with 50-token overlap. It is a clean industry default. Somewhere in your corpus there is a sentence — "Refunds are processed within five business days unless the order originated from the EU region, in which case the regulatory window is fourteen days" — that landed across a chunk boundary. Chunk N contains the first half. Chunk N+1 contains the second.

A user asks "how long do EU refunds take." Retrieval scores chunk N highest because the query embedding aligns with "EU region" in the first fragment. Chunk N+1, which contains the only actual answer, ranks too low to be retrieved alongside. The agent answers "five business days" with a confident citation to chunk N. The customer is in Frankfurt. The answer is wrong. The pipeline behaved exactly as designed.

This is the failure mode that does not show up in your chunk-quality eval. The chunks are well-formed. The corpus is well-formed. The embedding model is well-formed. The boundaries between chunks — the lines you drew through your own documents — are where the answer lives.

The Eval Set That Started Leaking Into Your Prompt

· 10 min read
Tian Pan
Software Engineer

The benchmark number went up for four quarters in a row. User satisfaction did not. Nobody on the team could explain the gap until someone diffed the prompt template and noticed that the few-shot examples were being pulled from the same CSV that the evaluator was reading. The eval set had quietly become the in-context examples. The number was no longer measuring generalization. It was measuring how well the model could copy the nearest neighbor of a question whose answer it had just been shown.

This is the failure mode I want to name: eval-to-prompt leakage. It is structurally identical to test-set contamination in classical machine learning, but it happens through a back channel the team built deliberately. Few-shot retrieval is a reasonable engineering move. Eval banks are a reasonable engineering artifact. The contamination emerges when the two converge on the same storage layer without anyone naming the boundary.

The Wiki Edit Mid-Flight When Your RAG Pipeline Read It

· 11 min read
Tian Pan
Software Engineer

A tech writer on your platform team is moving a paragraph. Not metaphorically — literally cutting a section from the onboarding page, pasting it into the runbook, deleting a stub draft on a third page, and rewording a deprecated warning on a fourth. The whole edit takes her about eleven minutes. Your RAG ingest job runs every fifteen. It happens to fire at minute six.

For the next fifteen minutes, your retrieval index contains a state of the wiki that did not exist at any single moment in her mind. The onboarding page still has the section. The runbook still doesn't. The stub draft is captured halfway through being deleted, with a placeholder sentence she never intended to publish. The old deprecated warning is still indexed. When an engineer asks the agent "how do we handle credential rotation in this service," the model retrieves contradictory chunks from the same source and confidently synthesizes whichever was ranked higher. The answer is wrong in a shape no one wrote.

This is a failure mode most teams ship without noticing: the source-of-truth is transactional, the ingest is a poll, and the gap between them is where dirty reads live.