Skip to main content

34 posts tagged with "embeddings"

View all tags

How PII Redaction Sentinels Quietly Collapse Your Vector Index

· 10 min read
Tian Pan
Software Engineer

A support engineer pulled up your RAG console to debug a complaint. The customer had asked "what does my account look like right now," the answer had come back coherent and confident, and it had been about somebody else's account entirely. The top-3 retrieved chunks all belonged to other customers. The engineer ran the same query against a fresh corpus snapshot to rule out indexing lag. Same result. Then she ran it against a snapshot from six months ago, before the privacy redactor had shipped. The right customer's chunk came back at rank 1.

The redactor was working as designed. Every name was a [NAME], every email an [EMAIL], every account number an [ACCOUNT]. The legal team had a clean audit trail and the security team had a closed compliance ticket. What nobody on either team had modeled was that those sentinels, dropped into the same syntactic slots across millions of documents, were being seen by the embedding model as ordinary tokens — tokens that co-occurred more reliably with each other than any real content did. The redactor had not just removed information. It had added a new, very strong signal that every redacted document shared and nothing else did.

The Cost Dashboard Your Finance Team Built That Excluded the Embeddings Re-index

· 10 min read
Tian Pan
Software Engineer

Your finance team built a beautiful AI cost dashboard. Token spend, sliced by feature. Embedding spend, sliced by provider. Every quarter, the per-feature pane gets reviewed in a leadership meeting and somebody asks why the support-chat workflow is up 12%, and a product manager has a defensible answer. Every quarter, the per-provider pane gets reviewed in an infra meeting and somebody asks why OpenAI is up 8%, and a platform engineer has a defensible answer. And every quarter, the line that actually doubles your AI bill — the corpus re-index — lands in a third bucket called "infrastructure" that nobody reviews because nobody owns it.

That bucket is where forty percent of your AI spend goes to die unattributed. The teams who could have optimized it never see it. The teams who see it can't tell you which feature it serves. The dashboard is honest about every cost it can explain and silent about the cost it can't, which is exactly the cost that matters most.

The Embedding Deprecation That Halved Your Retrieval Recall Without a Deploy

· 10 min read
Tian Pan
Software Engineer

The most expensive embedding bug a RAG system can ship is the one where nothing in your repository changes. Your retrieval code is the same. Your index is the same. Your query path is the same. And one Tuesday in week six, somebody notices that the answers used to be better.

The provider posted a sunset notice for the embedding family your index was built against twelve months ago. The platform team filed it in a deprecations dashboard with a year of runway and moved on. The sunset path wasn't a hard cutoff — it was a quiet quality regression where the deprecated endpoint started routing to a "compatibility" successor that returned vectors in the same dimensionality and a subtly different semantic geometry. Query embeddings began drifting against the corpus you embedded a year ago. Recall@10 on your standing eval slid by 47% over six weeks. The team only traced it back when an unrelated quality dashboard crossed a threshold, dragging a senior engineer into a root-cause exercise that ended at an embedding endpoint no one on the call had touched in a year.

The Embedding Model Rotation That Shadowed Your A/B Test for a Quarter

· 10 min read
Tian Pan
Software Engineer

You ran the experiment cleanly. Two arms, one feature flag, a clear metric, the stats team blessed the design. Twelve weeks later you ship the winner, and the lift quietly evaporates within a sprint. The post-mortem turns up nothing in the code, nothing in the flag rollout, nothing on the analytics side. The thing that moved was something nobody on your experimentation list owned: the hosted embedding model behind your retrieval call returned a slightly different vector for the same query in week three, in week seven, and again on the morning your readout meeting happened. Your A/B test was real. The substrate it ran on was not.

This is the failure mode every team running retrieval-augmented generation eventually walks into and the one almost nobody designs against. The embedding endpoint is treated as a stable substrate the way Postgres is treated as a stable substrate. It is not. It is a model with a release cadence the vendor controls, a changelog you do not read, and a behavior surface that can shift without changing the dimension count, the SLA, or the API contract you signed against. The experiment you thought was measuring a feature change was measuring a retrieval regime change with the feature flag noise on top.

The Retrieval Corpus Whose Jargon Your Embeddings Model Never Saw in Training

· 9 min read
Tian Pan
Software Engineer

A retrieval team ships an off-the-shelf embedding model against their product catalogue. The eval set — a few hundred queries scraped from the search logs of the last month — comes back at recall@10 of 0.91. They promote to production. Three weeks in, support starts forwarding tickets: a user searched for the actual SKU of a part and got back five plausible-looking but wrong parts. Another user searched for the internal codename of a feature and got the marketing name of an unrelated feature. The eval set never caught it because the eval set was drawn from queries the system already handled — queries about common terms. The long tail of jargon, where the business actually lives, was never sampled.

The model didn't fail. The model did exactly what it was trained to do, against a vocabulary distribution that did not include the corpus the team handed it. The team treated the embedding as a domain-neutral primitive — a function from text to vector — when it was actually a contract about which vocabulary it could resolve, signed with someone else's training corpus.

The Vector Index Whose Source Updates Never Reached the Embeddings

· 10 min read
Tian Pan
Software Engineer

A support engineer pings the on-call channel. A customer pasted a sentence the assistant retrieved last week, and the policy team replied: we don't say that anymore. They haven't said it for four months. The document in the CMS reads correctly. The embedded chunk in the vector index still reads the old way, with a confident similarity score, surfaced to the model on every relevant query. Nobody changed the retrieval code. Nobody changed the model. The source-of-truth changed, and the index never heard about it.

This is the failure mode of an ingestion pipeline that was designed for creates and grew into a system that also handles updates without anyone designing for updates. The "embed on create" job ran the day each document was first written. The CMS shipped an edit endpoint a quarter later, owned by a different team, who plumbed it into search and into the public-facing renderer and into the changelog feed — every consumer except the one that was a derived dataset hiding behind a different name. Months pass. The corpus drifts. Retrieval starts answering questions about a world the company has formally left behind, and the only signal is a confused customer.

The Chunk Boundary That Bisected the Sentence Your Answer Depended On

· 9 min read
Tian Pan
Software Engineer

Your RAG pipeline chunks documents into 512-token spans with 50-token overlap. It is a clean industry default. Somewhere in your corpus there is a sentence — "Refunds are processed within five business days unless the order originated from the EU region, in which case the regulatory window is fourteen days" — that landed across a chunk boundary. Chunk N contains the first half. Chunk N+1 contains the second.

A user asks "how long do EU refunds take." Retrieval scores chunk N highest because the query embedding aligns with "EU region" in the first fragment. Chunk N+1, which contains the only actual answer, ranks too low to be retrieved alongside. The agent answers "five business days" with a confident citation to chunk N. The customer is in Frankfurt. The answer is wrong. The pipeline behaved exactly as designed.

This is the failure mode that does not show up in your chunk-quality eval. The chunks are well-formed. The corpus is well-formed. The embedding model is well-formed. The boundaries between chunks — the lines you drew through your own documents — are where the answer lives.

The Embedding That Aged Out of Meaning

· 9 min read
Tian Pan
Software Engineer

You embedded the knowledge base eighteen months ago. The model has not changed. The chunks have not changed. The index is healthy, the latency is fine, the recall dashboard is a flat line at 0.86. And yet support is quietly pasting the wrong article links into ticket replies, the sales bot keeps surfacing a deprecated SKU when a prospect asks about the new one, and an internal user just told you the assistant "feels dumber" without being able to say why.

Nothing broke. Your embeddings aged. The word post used to mean blog post in your domain; now half the corpus uses it for a Slack post, a forum post, and a job posting, and your eighteen-month-old vectors still treat it as one concept. The model that encoded those vectors never saw the new senses, never saw the new product names, never saw the rebrand, never saw the regulation that introduced three new terms your customers now use without thinking. The retrieval system answers the question it knows how to answer, which is no longer the question your users are asking.

The Embedding Upgrade That Silently Re-Ranks Your Entire Corpus

· 9 min read
Tian Pan
Software Engineer

A new embedding model lands on the leaderboard. It scores higher than the one you shipped eighteen months ago, the API is a one-line change, and the dimensions even match. Someone files a ticket: "upgrade embedding model." It looks like swapping a logging library.

It is not. The embedding model is not a component of your retrieval system — it is the coordinate system your retrieval system lives in. Changing it does not improve your index. It invalidates it. And the cruelest part is that nothing crashes. No exception, no failed health check. Your search just starts returning subtly different results, and "subtly different" in a RAG pipeline means a different document feeds the model, which means a different answer reaches the user.

The Embedding Migration Black Hole: How a Vector Model Bump Silently Rewrites Your Business Rules

· 11 min read
Tian Pan
Software Engineer

The migration ticket is one line: "Upgrade embedding model from v3-small to v3-large." The new model wins on the public benchmark by 12%. The pipeline change is six lines of Python. The team estimates two days of engineering plus a re-embedding job that runs over a weekend. Two months later, the duplicate-detection feature is producing twice as many false positives as it did before the swap, the "related items" carousel on the marketing site has quietly become a slop generator, and the semantic cache hit rate has fallen off a cliff because the threshold of 0.95 that worked perfectly in the old space now matches almost nothing.

Nobody touched those features. Nobody filed a bug. The model swap that the migration plan called "infrastructure" silently rewrote every business rule that consumed a similarity score.

Code-Specific RAG: Why General Retrieval Fails for Codebases

· 10 min read
Tian Pan
Software Engineer

Most teams building AI coding assistants reach for the same off-the-shelf RAG pipeline they use for document retrieval: chunk the source files by token count, embed the chunks, store them in a vector database, query by semantic similarity. The pipeline works well enough on prose. On code, it quietly fails — and the failures are hard to see in aggregate metrics, because the retrieved chunks look plausible right up until the model generates code with the wrong return type, calls a function with the wrong signature, or misses a dependency that only exists three hops down the call graph.

The problem isn't the embedding model or the vector database. It's the chunking strategy. Code is not prose. It has structural properties — dependency graphs, call chains, type signatures, scope hierarchies — that token-based chunking destroys before the retriever ever sees them. Fixing this requires rethinking how you decompose code before it ever reaches the embedding step.

Embedding Model Churn: When Your Provider Silently Invalidates Your Entire Vector Index

· 9 min read
Tian Pan
Software Engineer

You spent weeks building a retrieval pipeline. Chunking strategy tuned, similarity thresholds calibrated, user feedback looking positive. Then one Monday morning, without any deployment on your end, retrieval quality starts degrading. Queries that used to surface the right documents now return loosely related noise. No error logs. No exceptions. The pipeline runs clean.

What changed was your embedding provider updated their model. Your entire vector index — millions of documents painstakingly embedded — is now populated with vectors from a coordinate system that no longer matches what your query encoder produces. The result is not a crash. It's invisible garbage.