Skip to main content

35 posts tagged with "vector-search"

View all tags

How PII Redaction Sentinels Quietly Collapse Your Vector Index

· 10 min read
Tian Pan
Software Engineer

A support engineer pulled up your RAG console to debug a complaint. The customer had asked "what does my account look like right now," the answer had come back coherent and confident, and it had been about somebody else's account entirely. The top-3 retrieved chunks all belonged to other customers. The engineer ran the same query against a fresh corpus snapshot to rule out indexing lag. Same result. Then she ran it against a snapshot from six months ago, before the privacy redactor had shipped. The right customer's chunk came back at rank 1.

The redactor was working as designed. Every name was a [NAME], every email an [EMAIL], every account number an [ACCOUNT]. The legal team had a clean audit trail and the security team had a closed compliance ticket. What nobody on either team had modeled was that those sentinels, dropped into the same syntactic slots across millions of documents, were being seen by the embedding model as ordinary tokens — tokens that co-occurred more reliably with each other than any real content did. The redactor had not just removed information. It had added a new, very strong signal that every redacted document shared and nothing else did.

The RAG Dedup Step That Broke Silently and Flooded Your Top-K With Near-Duplicates

· 10 min read
Tian Pan
Software Engineer

A retrieval-augmented generation pipeline can degrade for weeks without a single metric noticing. The relevance scores look fine. The retrieval latency is unchanged. The eval slice that touches the broken topic moves a quarter of a point in the wrong direction, and your weekly review chalks it up to noise. Then someone reads the actual context window the model received for a customer ticket and sees the same paragraph three times — once in title case, once in lowercase, once with the punctuation stripped — and you understand that your top-five has secretly been a top-two for a month.

This is the class of failure where the system is doing exactly what it was told to do. The retriever is returning the most similar vectors to the query. Each of those vectors is genuinely about the right topic. The index has no idea that three of them came from the same paragraph indexed three ways, because the ingestion-time dedup pass that was supposed to catch that case is silently skipping it.

The Vector Index That Was Sharded by Ingestion Date

· 9 min read
Tian Pan
Software Engineer

There is a specific kind of recall lie that hides inside time-partitioned vector indexes, and the people who built the offline eval are usually the last to find it. The dashboard says recall@10 is 0.94. The retriever is shipping the right snippet 94% of the time. The product team is shipping more retrieval-grounded features on the back of that number. And then the support tickets arrive: "the assistant cited a guide that does not match the answer," "the assistant linked to last week's version of the policy," "the assistant could not find a document I uploaded two months ago." None of those tickets contradict the 0.94. They are evidence that the 0.94 is measuring the wrong thing.

The mechanism is simple and easy to miss. The vector index is sharded by ingestion date because that is the easiest way to keep write throughput high, retire old data, and keep the hot working set in fast memory. The offline test set is generated nightly from production logs, which means the queries are drawn from the same recent window that the freshest shard happens to hold. Recall is measured against ground truth that lives one or two shards deep. The retriever performs beautifully on those queries because, in production, those queries are the ones the routing layer keeps inside the same shard.

The Embedding That Aged Out of Meaning

· 9 min read
Tian Pan
Software Engineer

You embedded the knowledge base eighteen months ago. The model has not changed. The chunks have not changed. The index is healthy, the latency is fine, the recall dashboard is a flat line at 0.86. And yet support is quietly pasting the wrong article links into ticket replies, the sales bot keeps surfacing a deprecated SKU when a prospect asks about the new one, and an internal user just told you the assistant "feels dumber" without being able to say why.

Nothing broke. Your embeddings aged. The word post used to mean blog post in your domain; now half the corpus uses it for a Slack post, a forum post, and a job posting, and your eighteen-month-old vectors still treat it as one concept. The model that encoded those vectors never saw the new senses, never saw the new product names, never saw the rebrand, never saw the regulation that introduced three new terms your customers now use without thinking. The retrieval system answers the question it knows how to answer, which is no longer the question your users are asking.

When RAG Should Have Been a JOIN

· 9 min read
Tian Pan
Software Engineer

A support team asked their new AI assistant a simple question: "Which enterprise customers opened a ticket last week?" The assistant came back with a confident, fluent answer naming six accounts. Five were right. One had churned two months ago, and one enterprise account that had filed three tickets was missing entirely. Nobody caught it until a renewal call went sideways.

The bug was not in the model. It was in the architecture. Somewhere in the design review, a question with hard predicates — a plan tier, a date range, a ticket count — got routed to a vector index. The team had a retrieval system, so they retrieved. They embedded the ticket records, embedded the question, and asked cosine similarity to do the job of a WHERE clause. It cannot. It never could.

This is one of the most common and least discussed failure modes in production AI systems: reaching for semantic search when the real query was relational. The data lived in tidy rows with foreign keys. The answer was one JOIN away. Instead it went through an embedding model, and the precision evaporated.

The Embedding Upgrade That Silently Re-Ranks Your Entire Corpus

· 9 min read
Tian Pan
Software Engineer

A new embedding model lands on the leaderboard. It scores higher than the one you shipped eighteen months ago, the API is a one-line change, and the dimensions even match. Someone files a ticket: "upgrade embedding model." It looks like swapping a logging library.

It is not. The embedding model is not a component of your retrieval system — it is the coordinate system your retrieval system lives in. Changing it does not improve your index. It invalidates it. And the cruelest part is that nothing crashes. No exception, no failed health check. Your search just starts returning subtly different results, and "subtly different" in a RAG pipeline means a different document feeds the model, which means a different answer reaches the user.

The Vector Index Has a Staleness SLO Nobody Set

· 10 min read
Tian Pan
Software Engineer

A user asks your agent what the current price tier is for an enterprise plan. The agent retrieves a chunk, reads it, and answers: "$2,000 per month." Confident, sourced, formatted nicely. The problem is that pricing changed four days ago. The number the agent quoted was true last week. The chunk it retrieved was embedded before the change, and the index has not caught up.

Nobody decided this would happen. There was no design review where someone said "the agent may answer from data up to four days old." There is just a re-indexing job that runs nightly, or weekly, and a content team that edits the source whenever they feel like it, and a gap between those two clocks that nobody measures. That gap is a service level objective. It exists whether or not you wrote it down. The only question is whether you set it on purpose or inherited it by accident.

The Embedding Migration Black Hole: How a Vector Model Bump Silently Rewrites Your Business Rules

· 11 min read
Tian Pan
Software Engineer

The migration ticket is one line: "Upgrade embedding model from v3-small to v3-large." The new model wins on the public benchmark by 12%. The pipeline change is six lines of Python. The team estimates two days of engineering plus a re-embedding job that runs over a weekend. Two months later, the duplicate-detection feature is producing twice as many false positives as it did before the swap, the "related items" carousel on the marketing site has quietly become a slop generator, and the semantic cache hit rate has fallen off a cliff because the threshold of 0.95 that worked perfectly in the old space now matches almost nothing.

Nobody touched those features. Nobody filed a bug. The model swap that the migration plan called "infrastructure" silently rewrote every business rule that consumed a similarity score.

The Freshness-Relevance Tradeoff in RAG: Why You Can't Optimize Both at Query Time

· 11 min read
Tian Pan
Software Engineer

A user asks your assistant what the company's parental leave policy is. The bot returns 12 weeks, with a citation. The cited document was the right answer in 2023; HR posted an update last quarter that took it to 16. Both versions are in your knowledge base. Cosine similarity scored the 2023 version 0.87 and the 2024 version 0.84, because the older page has the cleaner phrasing and fewer hedges. The fresher document loses by three percentage points and the user gets a wrong answer that looks audited.

This is the freshness-relevance tradeoff, and the uncomfortable part is that it has no clean solution at query time. If you weight recency, you bias retrieval toward whatever was edited yesterday — which in most knowledge bases is the noisy, high-churn surface area that should not be the source of truth. If you don't weight recency, you ship answers grounded in documents that were superseded months ago. There is no single global knob that gets both right, and most teams discover this only after a few embarrassing answers leak past their eval suite.

Embedding Model Churn: When Your Provider Silently Invalidates Your Entire Vector Index

· 9 min read
Tian Pan
Software Engineer

You spent weeks building a retrieval pipeline. Chunking strategy tuned, similarity thresholds calibrated, user feedback looking positive. Then one Monday morning, without any deployment on your end, retrieval quality starts degrading. Queries that used to surface the right documents now return loosely related noise. No error logs. No exceptions. The pipeline runs clean.

What changed was your embedding provider updated their model. Your entire vector index — millions of documents painstakingly embedded — is now populated with vectors from a coordinate system that no longer matches what your query encoder produces. The result is not a crash. It's invisible garbage.

The Vector Dimension Tax: How Embedding Size Quietly Drains Your Budget

· 8 min read
Tian Pan
Software Engineer

Most teams building RAG systems spend zero time thinking about embedding dimensions. They grab text-embedding-3-large, leave the dimensions at the default 3072, and move on. At 10,000 documents that's fine. At 10 million, you've handed your cloud provider a 30/monthstoragebillthatshouldhavebeen30/month storage bill that should have been 3.75. At 100 million documents, you're staring at a terabyte of float32 values that mostly aren't earning their keep.

The relationship between embedding dimensions and actual retrieval quality is far weaker than the relationship between dimensions and operational cost. That gap — between the cost you're paying and the quality you're getting — is the vector dimension tax.

Chunking for Agents vs. RAG: Why One Strategy Breaks Both

· 9 min read
Tian Pan
Software Engineer

Most teams pick a chunk size, tune it for retrieval quality, and call it done. Then they build an agent on the same index and wonder why the agent fails in strange ways — it executes half a workflow, ignores conditional logic, or confidently acts on incomplete instructions. The chunk size that maximized your NDCG score is exactly what's making your agent unreliable.

RAG retrieval and agent execution are not the same problem. They have different goals, different failure modes, and fundamentally different definitions of what a "good chunk" looks like. When you optimize chunking for one, you systematically degrade the other. Most teams don't realize this until they've already built on the wrong foundation.