Skip to main content

When RAG Should Have Been a JOIN

· 9 min read
Tian Pan
Software Engineer

A support team asked their new AI assistant a simple question: "Which enterprise customers opened a ticket last week?" The assistant came back with a confident, fluent answer naming six accounts. Five were right. One had churned two months ago, and one enterprise account that had filed three tickets was missing entirely. Nobody caught it until a renewal call went sideways.

The bug was not in the model. It was in the architecture. Somewhere in the design review, a question with hard predicates — a plan tier, a date range, a ticket count — got routed to a vector index. The team had a retrieval system, so they retrieved. They embedded the ticket records, embedded the question, and asked cosine similarity to do the job of a WHERE clause. It cannot. It never could.

This is one of the most common and least discussed failure modes in production AI systems: reaching for semantic search when the real query was relational. The data lived in tidy rows with foreign keys. The answer was one JOIN away. Instead it went through an embedding model, and the precision evaporated.

Embeddings Smear the Things You Needed Sharp

A vector embedding is a lossy compression of meaning into a few hundred dimensions. That compression is exactly what makes semantic search powerful for unstructured text — "cancel my subscription" and "how do I stop being billed" land near each other even though they share almost no words. But the same compression is destructive when the question depends on a precise value.

Embedding models routinely treat "the product costs between $50 and $100" and "the product costs exactly $101" as nearly identical — similarity scores around 0.98 — even though one satisfies a filter and the other does not. The model captures the topic (price, this product) and discards the predicate (the boundary that actually decides the answer). A WHERE price <= 100 clause never makes that mistake. It is not smarter; it is just operating on the value instead of a blurred picture of the value.

Negation is worse. Dense retrievers are well documented to ignore it: research on contradiction-aware retrieval describes a "semantic collapse" where "patient has diabetes" and "patient does not have diabetes" sit almost on top of each other in vector space, because the surrounding tokens dominate the embedding and the single negating word barely moves it. Ask a vector index "which customers are not on the enterprise plan" and you are rolling dice.

Counting and aggregation fall outside the paradigm entirely. A nearest-neighbor search returns the k most similar chunks. It has no notion of "all of them," no COUNT, no SUM, no GROUP BY. Ask a vector store for "average resolution time in Q3" and the best it can do is hand you a document that happens to mention those words. The honest answer to "how many" is not something similarity search can produce, because the operation it performs is "find things like this," not "enumerate things that match."

The Tell: Hard Predicates, Aggregations, Multi-Hop Joins

You can usually diagnose a misrouted query before you write a line of code. Read the question and look for three signatures.

Hard predicates. Any clause that is true or false with no middle ground — a status, a tier, a boolean flag, an exact ID, a date range. "Active subscriptions," "tickets opened after May 1," "accounts in the EU region." These are filters. A filter has a correct answer and a wrong answer, and similarity search gives you neither — it gives you a ranking. Rankings are the right tool for "most relevant," never for "matches the condition."

Aggregations. "How many," "what is the average," "total revenue by month," "the top five by usage." The moment a question asks you to collapse many rows into a number or a sorted list, you are describing a GROUP BY, an ORDER BY, a window function. Retrieval returns rows; it does not compute over them.

Multi-hop relationships. "Which customers on the enterprise plan opened a ticket last week" is three hops: customer → plan, customer → ticket, ticket → date. A vector search over a denormalized blob might stumble onto documents mentioning enterprise customers and recent tickets, but the relationship connecting them is implicit in prose the model has to re-derive by reading. In a relational schema the relationship is explicit — it is the foreign key — and a JOIN traverses it exactly. In a knowledge graph it is an edge, and a graph query walks it directly instead of hoping semantic similarity reconstructs it.

If a question carries any of these three signatures, the data store that answers it correctly is a database or a graph, not a vector index. The presence of an embedding model in your stack is not a reason to use it here.

Why Smart Teams Still Get This Wrong

The mistake is rarely ignorance. It comes from three forces that quietly push every question toward the vector index.

The first is that RAG became the default mental model. "Add AI to the product" got translated, almost reflexively, into "stand up a retrieval-augmented pipeline." Once the vector database is the centerpiece of the architecture, every question looks like a retrieval problem, including the ones that are obviously relational. The tool you built becomes the lens you see through.

The second is that embedding everything feels uniform and clean. One ingestion pipeline, one index, one query path. Routing some questions to SQL and others to a vector store means building a classifier, maintaining two retrieval paths, and owning the seam between them. The uniform design is simpler to draw on a whiteboard. It is also wrong, and the wrongness does not show up in the diagram.

The third is that the failures are silent and fluent. A misrouted relational query does not throw an error. It returns a confident, well-written answer that is subtly incomplete — five right names and one wrong one. There is no stack trace, no 500, no red dashboard. The system looks like it is working, and it keeps looking that way until someone with domain knowledge notices the missing account. By then the architecture is load-bearing and the bad pattern has been copied into four more features.

Benchmark numbers reinforce the complacency. Text-to-SQL systems report 80–90% execution accuracy on academic datasets, and that sounds production-ready. But enterprise schemas have hundreds of tables, ambiguous column names, and business logic that lives in tribal knowledge, and accuracy on real workloads drops well below the benchmark. The lesson is not "SQL is also unreliable" — it is that every retrieval path needs scrutiny, and the relational one at least fails in ways you can test deterministically.

Let the Model Write the Query, Not Guess the Answer

The fix is not to abandon RAG. It is to put each question on the path that can answer it correctly, and to recognize that for structured data the model's job is to write a query, not to be handed a pile of approximately-relevant chunks and reason over them.

Start with routing. Before retrieval, classify the question. Does it contain hard predicates, aggregations, or multi-hop joins? Send it to the structured path. Is it genuinely about the content of unstructured text — "what did customers say about the onboarding flow," "summarize the complaints about latency"? Send it to the vector index. This router can be a small model or even rules over the parsed question; it does not need to be clever, it needs to be consistent. The seam between the two paths is real engineering, but it is engineering you can see and test, unlike the silent corruption it replaces.

On the structured path, the model translates intent into a query and the database executes it. The LLM turns "enterprise customers with a ticket last week" into SQL or a graph traversal; the database returns exact rows. The model never invents the rows — it only describes what it wants, and a deterministic engine produces them. This inverts the usual RAG flow: instead of retrieve-then-generate, it is generate-a-query-then-execute. Factual accuracy stops being probabilistic, because the filtering, counting, and joining happen in a system that does those operations exactly.

Reserve the vector index for what it is uniquely good at: unstructured text with no schema, where "find things like this" is the actual question. A free-text knowledge base, a corpus of call transcripts, documentation. That is a large and valuable surface — just not the surface that has foreign keys.

Many real systems need both, composed. The metadata — plan tier, dates, region, status — lives in columns and gets filtered with SQL. The unstructured remainder — the actual ticket text, the transcript body — gets embedded and searched semantically. The query first narrows to the right rows with exact predicates, then runs similarity search only inside that filtered set. You get precision where the question demands it and fuzzy matching where prose demands it, instead of forcing one mechanism to do both jobs badly.

The Question to Ask in Every Design Review

When you see "AI-powered search" or "intelligent assistant" on a spec, ask one thing before you ask anything else: what does the data look like, and what does the question look like?

If the data has a schema — rows, columns, foreign keys, types — and the question has predicates, counts, or joins, then the answer is a query. Putting an embedding model in front of it does not add intelligence; it adds a lossy layer that turns exact answers into plausible ones. The vector database is a genuinely powerful tool for the problem it was built for. It is also the wrong default, and "we already have a RAG pipeline" is not a reason to route a JOIN through cosine similarity.

The best AI systems are not the ones that embed everything. They are the ones that know which questions deserve an exact answer and route them to a system that can give one.

References:Let's stay in touch and Follow me for more thoughts and updates