Skip to main content

31 posts tagged with "embeddings"

View all tags

Vector DB Sharding: Why HNSW Breaks at Partition Boundaries and What to Do About It

· 9 min read
Tian Pan
Software Engineer

Most vector database tutorials show you how to insert a million embeddings and run a query. What they don't show you is what happens six months later, when your corpus has grown past what a single node can hold, and you're trying to shard the HNSW index your entire retrieval pipeline depends on. The answer, which vendors leave out of the marketing copy, is that HNSW graphs resist partitioning in ways that cause silent recall degradation — and the operational patterns needed to recover that quality add real complexity.

This post covers the technical reasons HNSW sharding breaks down, what recall loss looks like in practice, and the operational patterns teams use to maintain retrieval accuracy when they've outgrown a single node.

Your Embedding Model Choice Sets the Ceiling Your LLM Can't Raise

· 11 min read
Tian Pan
Software Engineer

A team I was advising had spent two months swapping LLMs in their RAG pipeline. Claude, GPT, Gemini, then back again. Each swap shaved a few percentage points off hallucination rate but never moved the needle on the metric that mattered: their support agents still couldn't find the right knowledge base article more than 60% of the time. They were tuning the wrong layer. The retriever was returning irrelevant chunks, and no amount of LLM cleverness can answer a question from documents the retriever never surfaced.

The embedding model is the part of a RAG system that decides what the LLM is even allowed to see. It draws the geometry of your corpus — which documents land near which queries in vector space. Once that geometry is wrong, the LLM is just a confident narrator of bad context. Swapping it for a smarter one usually makes the answers more articulate, not more correct.

Per-Vector Version Tags: The Missing Column Behind Every Embedding Migration

· 10 min read
Tian Pan
Software Engineer

A new embedding model lands. The benchmark numbers are 4% better. A staff engineer files the ticket: "Upgrade embeddings to v3." Two weeks later the index has been re-embedded, the alias has been swapped, and the team has shipped the change behind a feature flag. Six weeks later, support tickets pile up. Search results "feel off." A retro is scheduled. Nobody can explain what regressed because nothing crashed and every dashboard is green.

The problem is not the model swap. The problem is that the vector store has no idea which vectors came from which model. There is no column for it. There is no migration table tracking which records have been backfilled. There is no alembic_version row, no schema_migrations table, no pg_dump of the previous state. The team treated an embedding upgrade like a config flip, and the vector store had no schema-level concept that would have stopped them.

Embedding migrations need the same artifact that database migrations have relied on for two decades: a per-record version tag, written into every vector, queried on every read, and used as the gating criterion for cutover and rollback. It is the single column most teams forget to add, and adding it later costs more than adding it up front.

Embedding Migrations Are the New Schema Migrations

· 12 min read
Tian Pan
Software Engineer

The first time most teams swap an embedding model in production, they treat it as a batch job. Re-run the embedder, build a new index, swap the alias, deploy. Latency stays normal. Error rates stay zero. Every query returns results. And retrieval quality silently regresses for weeks before anyone notices, because the symptom is "users complain the answers feel off," not a red dashboard.

This is not a deployment problem. It is a schema migration that the team has decided to run blind. The old embedding space and the new one are different reference frames; the cosine geometry that used to mean "these two paragraphs are about the same topic" no longer means that with the same numerical confidence. Documents and queries that used to cluster together drift apart non-uniformly. Re-rankers trained on the old distribution start firing on examples that no longer match what they learned. The eval suite that scores green on pointwise relevance misses all of it, because no individual document moved very far while the entire graph rotated.

Treat the swap like a database migration and almost everything that goes wrong becomes preventable. Treat it like a batch job and the regressions arrive on a schedule that nobody owns.

The Embedding API Hidden Tax: Why Vector Spend Quietly Eclipses Generation

· 12 min read
Tian Pan
Software Engineer

A team I talked to last quarter had a moment of quiet panic when their finance partner flagged the AI bill. They had assumed, like most teams do, that the expensive line item would be generation — the GPT-class calls behind chat, summarization, and agent reasoning. It wasn't. Their monthly embedding spend had silently crossed generation in January, doubled it by March, and was on track to triple it by mid-year. Nobody had modeled it because per-token pricing on embedding models looks like rounding error: two cents per million tokens for small, thirteen cents for large. At that rate, who budgets for it?

The answer is: anyone whose product survives past prototype and starts indexing things at scale. Semantic search over a growing corpus, duplicate detection, classification, clustering, reindexing when you swap models — every one of these workloads burns embedding tokens by the billion, not by the million. And unlike generation, which is gated by user requests, embedding throughput is only gated by what you decide to index. That decision rarely gets a cost review.

This post is about the specific mechanics of how embedding spend escalates, the architectural levers that bend the curve, and the breakeven math for moving off a hosted API onto something you run yourself.

Embedding Model Rotation Is a Database Migration, Not a Deploy

· 11 min read
Tian Pan
Software Engineer

Somewhere in a staging channel, an engineer writes "bumping the embedder to v3, new model scored +4 on MTEB, merging after the smoke test." Two days later support tickets start trickling in about search results that feel "weirdly off." A week later retrieval precision is down fourteen points, cosine scores have collapsed from 0.85 into the 0.65 range, and nobody can explain why — because the deploy looked identical to the last five model bumps. It wasn't a deploy. It was a database migration wearing a deploy's costume.

Embedding model rotation is the most misfiled change type in AI infrastructure. It lands in your system through the same channels as a prompt tweak or a generation-model pin update — a config file, a PR, a CI check — so it gets the governance of a config change. But under the hood, a new embedder does not produce a better version of your old vectors. It produces vectors that live in a different coordinate system entirely, where cosine similarity across the two manifolds is a category error. The correct mental model is not "rev the dependency." It is "swap the primary key encoding on a fifty-million-row table while serving reads."

Chunking Strategy Is the Hidden Load-Bearing Decision in Your RAG Pipeline

· 10 min read
Tian Pan
Software Engineer

Most RAG quality conversations focus on the wrong things. Teams debate embedding model selection, tweak retrieval top-K, and experiment with prompt templates — while a single architectural decision made during ingestion quietly caps how good the system can ever be. That decision is chunking strategy: how you cut documents into pieces before indexing them.

A 2025 benchmark study found that chunking configuration has as much or more influence on retrieval quality as embedding model choice. And yet teams routinely pick a default — 512 tokens with RecursiveCharacterTextSplitter, usually — and then spend months wondering why their retrieval precision keeps disappointing them. The problem was baked in at index time. Swapping models cannot fix it.

RAG Knowledge Base Freshness: The Staleness Problem Teams Solve Last

· 11 min read
Tian Pan
Software Engineer

Most RAG teams spend months tuning chunk sizes, experimenting with embedding models, and debating hybrid search configurations. Then they ship to production, declare success, and move on. Six months later, users start complaining that the system gives wrong answers — and the team discovers that the index they so carefully built has quietly rotted.

Index freshness is the problem that gets solved last, usually after a customer incident rather than before. Unlike retrieval quality failures that show up immediately in evals, staleness degrades silently: latency stays flat, retrieval appears functional, and standard RAG metrics like context recall and faithfulness score well — right up until the moment your system confidently returns a policy that was updated months ago.

Corpus Architecture for RAG: The Indexing Decisions That Determine Quality Before Retrieval Starts

· 12 min read
Tian Pan
Software Engineer

When a RAG system returns the wrong answer, the post-mortem almost always focuses on the same suspects: the retrieval query, the similarity threshold, the reranker, the prompt. Teams spend days tuning these components while the actual cause sits untouched in the indexing pipeline. The failure happened weeks ago when someone decided on a chunk size.

Most RAG quality problems are architectural, not operational. They stem from decisions made at index time that silently shape what the LLM will ever be allowed to see. By the time a user complains, the retrieval system is doing exactly what it was designed to do — it's just that the design was wrong.

Embedding Drift: The Silent Degradation Killing Your Long-Lived RAG System

· 10 min read
Tian Pan
Software Engineer

Your RAG system is running fine. Latency is normal. Error rate is zero. But a user asking about "California employment law" keeps getting results about real estate — and your logs show nothing wrong.

This is embedding drift in action: the retrieval failure mode that doesn't throw exceptions, doesn't spike error rates, and doesn't show up in standard observability dashboards. It happens when your vector store accumulates embeddings produced under different conditions — different model versions, different chunking rules, different preprocessing pipelines — and the vectors start pointing in incompatible directions. The system keeps serving requests, but the semantic coordinates are no longer aligned, and retrieval quality erodes quietly over weeks or months.

Retrieval Monoculture: Why Your RAG System Has Systematic Blind Spots

· 10 min read
Tian Pan
Software Engineer

Your RAG system's evals look fine. NDCG is acceptable. The demo works. But there's a category of failure no single-metric eval catches: the queries your retriever never even gets close on, consistently, because your entire embedding space was never equipped to handle them in the first place.

That's retrieval monoculture. One embedding model. One similarity metric. One retrieval path — and therefore one set of systematic blind spots that look like model errors, hallucination, or user confusion until you actually examine the retrieval layer.

The fix is not a bigger model or more data. It's understanding that different query structures need different retrieval mechanisms, and building a system that stops routing everything through the same funnel.

Retrieval Debt: Why Your RAG Pipeline Degrades Silently Over Time

· 10 min read
Tian Pan
Software Engineer

Six months after you shipped your RAG pipeline, something changed. Users aren't complaining loudly — they're just trusting the answers a little less. Feedback ratings dropped from 4.2 to 3.7. A few support tickets reference "outdated information." Your engineers look at the logs and see no errors, no timeouts, no obvious regression. The retrieval pipeline looks healthy by every metric you've configured.

It isn't. It's rotting.

Retrieval debt is the accumulated technical decay in a vector index: stale embeddings that no longer represent current document content, tombstoned chunks from deleted records that pollute search results, and semantic drift between the encoder version that indexed your corpus and the encoder version now computing query embeddings. Unlike code rot, retrieval debt produces no stack traces. It produces subtly wrong answers with confident-looking citations.