The Retrieval Corpus Whose Jargon Your Embeddings Model Never Saw in Training
A retrieval team ships an off-the-shelf embedding model against their product catalogue. The eval set — a few hundred queries scraped from the search logs of the last month — comes back at recall@10 of 0.91. They promote to production. Three weeks in, support starts forwarding tickets: a user searched for the actual SKU of a part and got back five plausible-looking but wrong parts. Another user searched for the internal codename of a feature and got the marketing name of an unrelated feature. The eval set never caught it because the eval set was drawn from queries the system already handled — queries about common terms. The long tail of jargon, where the business actually lives, was never sampled.
The model didn't fail. The model did exactly what it was trained to do, against a vocabulary distribution that did not include the corpus the team handed it. The team treated the embedding as a domain-neutral primitive — a function from text to vector — when it was actually a contract about which vocabulary it could resolve, signed with someone else's training corpus.
