Skip to main content

Your Embedding Model Choice Sets the Ceiling Your LLM Can't Raise

· 11 min read
Tian Pan
Software Engineer

A team I was advising had spent two months swapping LLMs in their RAG pipeline. Claude, GPT, Gemini, then back again. Each swap shaved a few percentage points off hallucination rate but never moved the needle on the metric that mattered: their support agents still couldn't find the right knowledge base article more than 60% of the time. They were tuning the wrong layer. The retriever was returning irrelevant chunks, and no amount of LLM cleverness can answer a question from documents the retriever never surfaced.

The embedding model is the part of a RAG system that decides what the LLM is even allowed to see. It draws the geometry of your corpus — which documents land near which queries in vector space. Once that geometry is wrong, the LLM is just a confident narrator of bad context. Swapping it for a smarter one usually makes the answers more articulate, not more correct.

This post is about why the embedding layer is the highest-leverage choice in a retrieval pipeline, and the framework I use to pick one: domain match first, then dimensionality and recall trade, then multilingual behavior, then instruction tuning. Most teams underinvest in this layer until quality plateaus, then discover their generation budget was being wasted on retrieval debt the whole time.

The Retrieval Ceiling Is Real and It's Lower Than You Think

There's a thought experiment worth running before any RAG project: assume the LLM is perfect. Assume it never hallucinates, follows every instruction, reasons flawlessly. What's the upper bound on your system's accuracy? It's whatever fraction of queries your retriever returns the right documents for. That number is your retrieval ceiling, and it caps everything downstream.

Teams discover this the hard way. They run the pipeline, see mediocre output, and reach for the most visible knob: the LLM. But if the retriever surfaces the right chunk only 70% of the time, the system can't exceed 70% even with a model trained on the heat death of the universe. The ceiling is set in the embedding step — the moment a query gets converted to a vector and matched against the corpus.

What makes this insidious is that the symptoms look like LLM problems. The model "ignores" provided context (because the wrong context was provided). The model "makes up" answers (because the right snippet wasn't in the top-k). The model "contradicts" itself across queries (because semantically equivalent queries are returning different neighborhoods). All of these get diagnosed as generation issues and triaged with prompt engineering. The fix is upstream.

A useful diagnostic: instrument your pipeline to log retrieved chunks alongside model outputs, then have a human grade whether the right chunk was retrieved separately from whether the answer was correct. Most teams find that retrieval is wrong on 30–50% of failed queries. That's the ceiling. No LLM swap touches it.

Domain Match Beats Benchmark Position Almost Every Time

The MTEB leaderboard is what most teams reach for when picking an embedding model. It's the wrong primary signal. MTEB measures performance on web text, news, Wikipedia, and academic abstracts. If your corpus is biomedical literature, financial filings, legal contracts, source code, or industry-specific support tickets, the leaderboard ranking tells you almost nothing about how the model will behave on your data.

This is partly a coverage problem and partly a contamination problem. MTEB datasets have been around long enough that newer models include them in their training distribution, and the line between "tested zero-shot" and "trained on similar splits" has blurred. The benchmark also overweights tasks that have lots of public data — semantic textual similarity, generic retrieval — and underrepresents the domains most enterprises actually care about.

The practical move is to build a small in-domain eval set before picking a model. Two hundred queries, hand-labeled with the correct documents, drawn from real user behavior or production logs. Run the top three or four candidate models against it and look at recall@10 and MRR. The rankings will often invert relative to MTEB. A general-purpose model that placed seventh on the leaderboard can outperform the leader on your specific corpus by ten or fifteen points.

For specialized domains, prefer fine-tuned variants when they exist. PubMedBERT and BioLORD for medical text. Voyage Finance and BGE Financial Matryoshka for financial filings. Code-specific embeddings for repository search. The gains over general models in these domains routinely run 10–30%. If a tuned variant doesn't exist for your domain, fine-tuning a base model on a few thousand query-document pairs from your own data is usually a higher-ROI investment than upgrading the LLM.

Dimensions Are a Cost Lever, Not a Quality Knob

There's a folk belief that more dimensions means better retrieval. Sometimes it's true at the margin. Mostly it's not, and Matryoshka Representation Learning has changed the calculus enough that any decision made before MRL became standard is probably worth revisiting.

MRL trains an embedding so that the early dimensions carry the most semantically important information, with later dimensions adding finer detail. The result is that you can truncate a 3,072-dim vector down to 1,024 or 512 dimensions and lose only a few percent of retrieval quality. Gemini Embedding 2 shows less than 1% recall@10 loss going from 3,072 to 2,048 dimensions, and Jina v3 maintains 92% of full-dimension performance even at 64 dimensions.

Why does this matter? Vector storage and search cost scale roughly linearly with dimensionality. A 3,072-dim index of 100 million vectors is ~1.2TB before any indexing overhead. The same corpus at 768 dimensions is 300GB. The recall difference might be 2–3 points; the cost difference is 4x. For most production workloads, that trade is overwhelmingly in favor of the smaller representation, especially when paired with a reranker that recovers the lost precision on the top-k.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates