Skip to main content

33 posts tagged with "vector-search"

View all tags

The Vector Index That Was Sharded by Ingestion Date

· 9 min read
Tian Pan
Software Engineer

There is a specific kind of recall lie that hides inside time-partitioned vector indexes, and the people who built the offline eval are usually the last to find it. The dashboard says recall@10 is 0.94. The retriever is shipping the right snippet 94% of the time. The product team is shipping more retrieval-grounded features on the back of that number. And then the support tickets arrive: "the assistant cited a guide that does not match the answer," "the assistant linked to last week's version of the policy," "the assistant could not find a document I uploaded two months ago." None of those tickets contradict the 0.94. They are evidence that the 0.94 is measuring the wrong thing.

The mechanism is simple and easy to miss. The vector index is sharded by ingestion date because that is the easiest way to keep write throughput high, retire old data, and keep the hot working set in fast memory. The offline test set is generated nightly from production logs, which means the queries are drawn from the same recent window that the freshest shard happens to hold. Recall is measured against ground truth that lives one or two shards deep. The retriever performs beautifully on those queries because, in production, those queries are the ones the routing layer keeps inside the same shard.

The Embedding That Aged Out of Meaning

· 9 min read
Tian Pan
Software Engineer

You embedded the knowledge base eighteen months ago. The model has not changed. The chunks have not changed. The index is healthy, the latency is fine, the recall dashboard is a flat line at 0.86. And yet support is quietly pasting the wrong article links into ticket replies, the sales bot keeps surfacing a deprecated SKU when a prospect asks about the new one, and an internal user just told you the assistant "feels dumber" without being able to say why.

Nothing broke. Your embeddings aged. The word post used to mean blog post in your domain; now half the corpus uses it for a Slack post, a forum post, and a job posting, and your eighteen-month-old vectors still treat it as one concept. The model that encoded those vectors never saw the new senses, never saw the new product names, never saw the rebrand, never saw the regulation that introduced three new terms your customers now use without thinking. The retrieval system answers the question it knows how to answer, which is no longer the question your users are asking.

When RAG Should Have Been a JOIN

· 9 min read
Tian Pan
Software Engineer

A support team asked their new AI assistant a simple question: "Which enterprise customers opened a ticket last week?" The assistant came back with a confident, fluent answer naming six accounts. Five were right. One had churned two months ago, and one enterprise account that had filed three tickets was missing entirely. Nobody caught it until a renewal call went sideways.

The bug was not in the model. It was in the architecture. Somewhere in the design review, a question with hard predicates — a plan tier, a date range, a ticket count — got routed to a vector index. The team had a retrieval system, so they retrieved. They embedded the ticket records, embedded the question, and asked cosine similarity to do the job of a WHERE clause. It cannot. It never could.

This is one of the most common and least discussed failure modes in production AI systems: reaching for semantic search when the real query was relational. The data lived in tidy rows with foreign keys. The answer was one JOIN away. Instead it went through an embedding model, and the precision evaporated.

The Embedding Upgrade That Silently Re-Ranks Your Entire Corpus

· 9 min read
Tian Pan
Software Engineer

A new embedding model lands on the leaderboard. It scores higher than the one you shipped eighteen months ago, the API is a one-line change, and the dimensions even match. Someone files a ticket: "upgrade embedding model." It looks like swapping a logging library.

It is not. The embedding model is not a component of your retrieval system — it is the coordinate system your retrieval system lives in. Changing it does not improve your index. It invalidates it. And the cruelest part is that nothing crashes. No exception, no failed health check. Your search just starts returning subtly different results, and "subtly different" in a RAG pipeline means a different document feeds the model, which means a different answer reaches the user.

The Vector Index Has a Staleness SLO Nobody Set

· 10 min read
Tian Pan
Software Engineer

A user asks your agent what the current price tier is for an enterprise plan. The agent retrieves a chunk, reads it, and answers: "$2,000 per month." Confident, sourced, formatted nicely. The problem is that pricing changed four days ago. The number the agent quoted was true last week. The chunk it retrieved was embedded before the change, and the index has not caught up.

Nobody decided this would happen. There was no design review where someone said "the agent may answer from data up to four days old." There is just a re-indexing job that runs nightly, or weekly, and a content team that edits the source whenever they feel like it, and a gap between those two clocks that nobody measures. That gap is a service level objective. It exists whether or not you wrote it down. The only question is whether you set it on purpose or inherited it by accident.

The Embedding Migration Black Hole: How a Vector Model Bump Silently Rewrites Your Business Rules

· 11 min read
Tian Pan
Software Engineer

The migration ticket is one line: "Upgrade embedding model from v3-small to v3-large." The new model wins on the public benchmark by 12%. The pipeline change is six lines of Python. The team estimates two days of engineering plus a re-embedding job that runs over a weekend. Two months later, the duplicate-detection feature is producing twice as many false positives as it did before the swap, the "related items" carousel on the marketing site has quietly become a slop generator, and the semantic cache hit rate has fallen off a cliff because the threshold of 0.95 that worked perfectly in the old space now matches almost nothing.

Nobody touched those features. Nobody filed a bug. The model swap that the migration plan called "infrastructure" silently rewrote every business rule that consumed a similarity score.

The Freshness-Relevance Tradeoff in RAG: Why You Can't Optimize Both at Query Time

· 11 min read
Tian Pan
Software Engineer

A user asks your assistant what the company's parental leave policy is. The bot returns 12 weeks, with a citation. The cited document was the right answer in 2023; HR posted an update last quarter that took it to 16. Both versions are in your knowledge base. Cosine similarity scored the 2023 version 0.87 and the 2024 version 0.84, because the older page has the cleaner phrasing and fewer hedges. The fresher document loses by three percentage points and the user gets a wrong answer that looks audited.

This is the freshness-relevance tradeoff, and the uncomfortable part is that it has no clean solution at query time. If you weight recency, you bias retrieval toward whatever was edited yesterday — which in most knowledge bases is the noisy, high-churn surface area that should not be the source of truth. If you don't weight recency, you ship answers grounded in documents that were superseded months ago. There is no single global knob that gets both right, and most teams discover this only after a few embarrassing answers leak past their eval suite.

Embedding Model Churn: When Your Provider Silently Invalidates Your Entire Vector Index

· 9 min read
Tian Pan
Software Engineer

You spent weeks building a retrieval pipeline. Chunking strategy tuned, similarity thresholds calibrated, user feedback looking positive. Then one Monday morning, without any deployment on your end, retrieval quality starts degrading. Queries that used to surface the right documents now return loosely related noise. No error logs. No exceptions. The pipeline runs clean.

What changed was your embedding provider updated their model. Your entire vector index — millions of documents painstakingly embedded — is now populated with vectors from a coordinate system that no longer matches what your query encoder produces. The result is not a crash. It's invisible garbage.

The Vector Dimension Tax: How Embedding Size Quietly Drains Your Budget

· 8 min read
Tian Pan
Software Engineer

Most teams building RAG systems spend zero time thinking about embedding dimensions. They grab text-embedding-3-large, leave the dimensions at the default 3072, and move on. At 10,000 documents that's fine. At 10 million, you've handed your cloud provider a 30/monthstoragebillthatshouldhavebeen30/month storage bill that should have been 3.75. At 100 million documents, you're staring at a terabyte of float32 values that mostly aren't earning their keep.

The relationship between embedding dimensions and actual retrieval quality is far weaker than the relationship between dimensions and operational cost. That gap — between the cost you're paying and the quality you're getting — is the vector dimension tax.

Chunking for Agents vs. RAG: Why One Strategy Breaks Both

· 9 min read
Tian Pan
Software Engineer

Most teams pick a chunk size, tune it for retrieval quality, and call it done. Then they build an agent on the same index and wonder why the agent fails in strange ways — it executes half a workflow, ignores conditional logic, or confidently acts on incomplete instructions. The chunk size that maximized your NDCG score is exactly what's making your agent unreliable.

RAG retrieval and agent execution are not the same problem. They have different goals, different failure modes, and fundamentally different definitions of what a "good chunk" looks like. When you optimize chunking for one, you systematically degrade the other. Most teams don't realize this until they've already built on the wrong foundation.

The Embedding Fine-Tuning Gap: Generic Vectors Don't Know What Relevant Means in Your Domain

· 11 min read
Tian Pan
Software Engineer

Your RAG pipeline looks solid on paper: chunking is clean, the vector store is indexed, latency is acceptable. But users keep complaining that the results are wrong — not completely wrong, just slightly wrong in ways that matter. The retrieved passage discusses the right concept but from the wrong time period. It covers the right topic but from the wrong jurisdiction. It mentions the right product but is missing the inventory signal that would make it actually useful.

This is the embedding fine-tuning gap. Generic embedding models are trained to encode semantic similarity — the property of two texts meaning roughly the same thing. That's not the same as relevance. Relevance is domain-specific, context-sensitive, and often invisible to a model trained on web-scale generic corpora.

Reranking Is the Real Work: Why Your Retrieval System's Bottleneck Is Never the Index

· 10 min read
Tian Pan
Software Engineer

Teams building RAG systems almost universally hit the same wall: they spend a week tuning their HNSW index parameters, add product quantization, push recall@100 from 0.81 to 0.87 — and then watch LLM output quality barely budge. The assumption baked into months of effort is that a better index equals better answers. It doesn't. The bottleneck was never the index.

The actual chokepoint is the ranking step between your candidate set and your context window. What you put into the LLM determines what comes out, and the job of ranking is to ensure that the most genuinely relevant documents, not just the most semantically similar ones, make it through. That distinction matters more than any HNSW configuration you'll ever tune.