Skip to main content

The Semantic Cache That Confidently Returns the Wrong Answer

· 9 min read
Tian Pan
Software Engineer

Two support users ask your agent almost the same question within a minute of each other. The first asks, "What's our refund window for EU orders?" The second asks, "What's our refund window for US orders?" The embeddings of those two sentences sit a hair's breadth apart — same length, same structure, one two-letter token of difference. Your semantic cache, tuned to a similarity threshold that looked perfectly reasonable in the demo, scores them as a match. The second user gets the first user's answer. The EU's 14-day cooling-off period is presented to a US customer as fact, in fluent prose, with no asterisk.

Nobody gets paged for this. The cache returned a 200. Latency was great. The cost dashboard shows a hit, which is the outcome everyone wanted. The only signal that anything went wrong is a customer acting on policy that does not apply to them — and that signal arrives days later, through a refund dispute, not through your monitoring.

This is the failure mode that makes semantic caching different from every cache you have built before. An exact-match cache can be stale, but it is never wrong — the key either matches or it doesn't. A semantic cache trades that guarantee away on purpose. It is designed to return answers for keys it has never seen, and the price of that latency win is a correctness risk that most teams never put a number on.

A similarity match is not an equality match

The whole appeal of semantic caching is that "How do I reset my password?" and "I forgot my password, what now?" should hit the same cached answer despite sharing almost no tokens. A traditional cache keyed on the literal string misses that entirely. So you embed the query, search for the nearest stored vector, and if the cosine similarity clears a threshold, you serve the cached response.

The quiet assumption buried in that design is that similarity above the threshold implies the same correct answer. It does not. Similarity is a continuous score over the whole sentence; correctness often hinges on a single discriminating token. "EU" versus "US," "before tax" versus "after tax," "2024" versus "2025," "include" versus "exclude" — these are the words that determine which answer is right, and they are exactly the words an embedding model treats as minor perturbations because the surrounding 95% of the sentence is identical.

You have inverted the cache's contract without noticing. A normal cache fails closed: a miss costs you a recomputation. A semantic cache can fail open: a false hit costs you a wrong answer delivered with the same confidence as a right one. The output is byte-for-byte indistinguishable from a correct response. There is no exception, no low-confidence flag, no log line that says "I guessed." The system did exactly what you told it to do.

The false-hit rate is a number, not an assumption

Ask a team running a semantic cache what their false-hit rate is and you will usually get a hit rate instead — 40%, 60%, "we're saving a ton." Those are different metrics. Hit rate tells you how often the cache answered. False-hit rate tells you how often it answered wrong. A cache can have a beautiful hit rate and a quietly corrosive false-hit rate, and the dashboard that shows the first will never surface the second.

The false-hit rate has to be measured, because it cannot be reasoned about from the threshold alone. Published analyses of production caches put false positives in the neighborhood of 1% of served hits — and crucially, those errors cluster right at the threshold boundary, where similarity squeaks just over the cutoff but intent has already diverged. Worse, the similarity distributions for "correct to reuse" and "incorrect to reuse" candidate pairs overlap heavily. There is no clean cutoff that separates them. Any single threshold either admits false hits or collapses toward exact-match behavior and gives up the savings you adopted the cache for.

So you build the measurement instead of assuming it away. Take a sample of cache hits — real ones, from production — and for each one, run the live model anyway and compare the cached answer to the fresh answer. Where they disagree on substance, that hit was false. This costs you the LLM call you were trying to save, but only on a sample, and it converts "the cache is probably fine" into a number you can put an error budget around. If your false-hit rate is 0.3% and a wrong answer here is a minor annoyance, fine. If it is 2% and the domain is medical dosing or refund policy, the cache is a liability and you need to know that before a customer does.

Negation and the tokens embeddings shrug off

The refund-window example is not a corner case; it is the center of the distribution. Embedding models are trained to capture topical and structural similarity, and they are demonstrably weak at the linguistic features that flip an answer.

Negation is the sharpest example. Research on text embeddings keeps finding that state-of-the-art models lack negation awareness — they score a sentence and its negation as roughly similar because the two share every content word and differ only by a "not." "Is this charge refundable?" and "Is this charge non-refundable?" are opposites that a semantic cache will happily treat as the same key. The same blind spot covers antonym swaps, quantifier changes ("all" versus "some"), and scope words ("only," "except," "unless"). These are not exotic phrasings. They are how people ask precise questions.

This is why the embedding model is a correctness dependency, not a performance detail. Swapping a general-purpose embedder for a domain-tuned one can move your false-hit rate more than any amount of threshold fiddling, because the failure is upstream of the threshold — two genuinely different questions were mapped to nearby points before the comparison ever ran. A reranker that reads the full text of the candidate pair, rather than comparing two compressed vectors, catches some of what the embedder misses and is worth the extra hop on a cache that gates anything consequential. The general lesson: when a single word decides the answer, do not trust a model that was built to summarize the sentence.

Cache the retrieval, not the generation

Some of the risk evaporates if you cache the right layer. A retrieval-augmented agent runs a chain — embed the query, search a vector store, then call the LLM to synthesize an answer from the retrieved chunks. The expensive, slow steps are the retrieval and the generation. The tempting move is to cache the final generated answer, because that is the biggest single saving. It is also the riskiest, because the generated answer is the most specific artifact in the chain and the one a false hit corrupts most completely.

Caching the retrieval instead is a much safer trade. When a semantically close query arrives, you reuse the retrieved document set rather than the finished answer, and you still run the generation step against the new query. The model gets one more chance to notice that this question wanted the US policy, not the EU one, because it is reading the actual query and the actual chunks. You give up some of the latency win — generation still runs — but published work on retrieval-stage caching shows large latency drops with negligible accuracy loss, precisely because the irreversible step stays live. The rule of thumb: cache the work that is expensive and reusable, not the work that is expensive and final.

This also reframes where the threshold belongs. A loose threshold gating retrieval reuse is a minor inefficiency — worst case, you synthesize an answer from slightly off-topic chunks and the model can still ignore them. A loose threshold gating answer reuse is a wrong answer. Same number, very different blast radius, depending on which layer it controls.

Threshold tuning is a safety decision, and provenance is the debugger

Treat the similarity threshold as a safety parameter, not a knob you turn until the hit rate looks nice. The right value is not global. A cache serving FAQ-style questions where a wrong answer damages trust should be tuned for precision — accept fewer hits, demand near-certainty. A cache serving low-stakes lookups where a miss only costs money can lean toward recall. Teams that split their cache this way report needing thresholds as far apart as 0.88 and 0.94 for different query classes in the same product. One number cannot serve both.

And whatever the threshold, a false hit will eventually ship. The question is whether you can debug it when it does. That requires provenance on every cached entry: which original query produced this answer, what its embedding was, what similarity score let the current query in, and which source documents the answer was built from. When a wrong answer surfaces, you want to open the cache entry and see "this was served because query X scored 0.91 against stored query Y" — not stare at an opaque blob and guess.

Provenance also gives you invalidation. If you record the source document IDs behind each cached answer, then when an underlying document changes you can expire exactly the entries that depended on it, instead of blowing away the whole cache or — worse — letting answers built on last quarter's policy linger. A cache entry that cannot tell you where its answer came from is not a cache entry. It is an unattributed claim with a TTL.

The discipline the latency win demands

Semantic caching is genuinely good infrastructure. Cutting LLM spend by half and latency by an order of magnitude is not a marginal improvement, and the technique earns its place in a serious AI system. But it is the first cache most teams deploy that can be confidently, silently wrong, and it should be operated with that fact in front of you.

That means four habits. Measure the false-hit rate as a real number against a real error budget, instead of inferring safety from the hit rate. Pick the embedding model deliberately, because it decides which different questions get mapped to the same key. Cache the retrieval rather than the generation wherever the chain allows it, so the irreversible step stays live. And attach provenance to every entry, so the false hit you will eventually ship is one you can find and explain.

A cache that is sometimes wrong is not automatically a bad cache. A cache that is sometimes wrong and you do not know how often — that one is shipping a number nobody priced.

References:Let's stay in touch and Follow me for more thoughts and updates