The Retrieval Corpus Whose Jargon Your Embeddings Model Never Saw in Training
A retrieval team ships an off-the-shelf embedding model against their product catalogue. The eval set — a few hundred queries scraped from the search logs of the last month — comes back at recall@10 of 0.91. They promote to production. Three weeks in, support starts forwarding tickets: a user searched for the actual SKU of a part and got back five plausible-looking but wrong parts. Another user searched for the internal codename of a feature and got the marketing name of an unrelated feature. The eval set never caught it because the eval set was drawn from queries the system already handled — queries about common terms. The long tail of jargon, where the business actually lives, was never sampled.
The model didn't fail. The model did exactly what it was trained to do, against a vocabulary distribution that did not include the corpus the team handed it. The team treated the embedding as a domain-neutral primitive — a function from text to vector — when it was actually a contract about which vocabulary it could resolve, signed with someone else's training corpus.
The Bimodal Coverage Problem
Embedding models are trained on web-scale corpora — Common Crawl, Wikipedia, books, code. The vocabulary distribution they internalize reflects what the open internet talks about most often. Domain-specific terms — drug names, internal product codenames, contract clauses, part numbers, scientific notation, jargon that lives only inside one industry — appear rarely or not at all.
The result is a bimodal coverage pattern. Common tokens (verbs, prepositions, popular product names, well-known concepts) are densely represented in the embedding space; the model has seen them in many contexts and learned to distinguish their semantic neighborhoods. Rare tokens — the long tail where domain vocabulary lives — sit in a sparse, poorly-differentiated region. Multiple distinct rare terms collapse into the same neighborhood because the model never had enough signal to push them apart.
Research from 2025 documenting this failure mode is blunt about it: dense retrievers fail on tail entities because the token distributions inside the model "forget" some tokens of those entities. The embedding for a rare drug name is not really an embedding of the drug — it is the embedding of the subword pieces the tokenizer broke it into, averaged into a vector that points toward whatever common words happen to share those pieces. Two different drug names with overlapping subwords end up neighbors in the embedding space not because they are semantically similar but because their tokenization is. Retrieval cheerfully returns one for the other.
This is the failure mode the team in the opening paragraph hit. The catalogue contained part numbers and SKUs the embedding model had never seen as cohesive tokens. The model embedded each one as a smear over its subword components. Retrieval was then a kind of fuzzy match against tokenizer artifacts rather than against meaning.
Why The Eval Suite Did Not Catch It
The eval suite did not catch it because the eval suite was drawn from the wrong distribution. The team built their eval from queries the production system had already handled. Queries the system had already handled were, by selection, queries the system was good at — common terms, paraphrases, navigational lookups. The long tail of jargon-heavy queries either was not present in the log (because users had given up and used the SKU column directly) or was present in such small numbers that aggregate recall numbers swamped it.
Even teams who know to balance their eval often balance it the wrong way. They pull a stratified sample by query length, by user segment, by category. They do not stratify by the vocabulary distribution of the queries against the vocabulary distribution of the embedding model's training corpus. That stratification is the one that matters and the one almost nobody computes.
The MTEB and BEIR benchmarks have the same problem at a larger scale. They aggregate across many tasks and produce a single number that lets you rank models against each other, but a top-of-leaderboard model can still underperform on a specific domain where the vocabulary distribution differs from the benchmark's. Treating a leaderboard score as a domain-quality signal is the same category error as treating a query-log eval as a coverage signal — both measure the model against text it has already been trained or evaluated against, not against the text your corpus actually contains.
The discipline the team needed is a domain-vocab eval that holds the rare terms fixed and measures retrieval against them specifically. Build the eval set from the long tail of your corpus's vocabulary, not from the head of your query log. The two distributions look nothing alike, and a model that scores well on one can be silently broken on the other.
What "Bad" Looks Like In Practice
- https://arxiv.org/pdf/2506.08592
- https://arxiv.org/pdf/2409.18511
- https://arxiv.org/pdf/2212.10380
- https://huggingface.co/blog/nvidia/domain-specific-embedding-finetune
- https://weaviate.io/blog/fine-tune-embedding-model
- https://arxiv.org/html/2412.17364v1
- https://www.systemoverflow.com/learn/ml-embeddings/embedding-quality-evaluation/mteb-and-beir-benchmark-evaluation
- https://app.ailog.fr/en/blog/news/beir-benchmark-update
- https://knowledgesdk.com/blog/embedding-model-comparison-2026
