Skip to main content

Why SQL Agents Fail in Production: Grounding LLMs Against Live Relational Databases

· 11 min read
Tian Pan
Software Engineer

The Spider benchmark looks great. GPT-4 scores above 85% on text-to-SQL translation across hundreds of test queries. Teams read those numbers, wire up a LangChain SQLDatabaseChain, and ship an "ask your data" feature. Two weeks later, an analyst's innocent question about revenue by region triggers a full table scan that takes down reporting for thirty minutes.

The benchmark number was real. The problem is that benchmarks don't use your schema.

Spider 1.0 tests models on databases with 5–30 tables and 50–100 columns. Your production data warehouse has 200 tables, 700+ columns, three dialects of SQL depending on which system you're querying, and column names that made sense to the engineer who wrote them four years ago but are meaningless to anyone else. When researchers introduced Spider 2.0—a benchmark with enterprise-scale schemas and real-world complexity—GPT-4o dropped from 86.6% to 10.1% success rate. That collapse is what production actually looks like.

This post is about the engineering required to deploy SQL agents safely—not as a wrapper around a model call, but as a system with schema grounding, runtime validation, and layered permissions. Get any of these wrong and you're either getting confidently wrong answers or, worse, running unintended queries against a live database.

SQL Access Is Not Document RAG

When you build a document RAG system, approximate answers are acceptable. A retrieval miss returns a slightly less relevant passage; the model fills the gap with plausible reasoning. The user gets a response that's close enough, and the worst failure mode is a vague answer.

SQL doesn't work like that. A query either maps to the correct table and column names or it doesn't execute—or worse, it executes successfully against the wrong data. An agent that confuses user_events with user_sessions, or uses order_date instead of created_at, doesn't return a slightly imprecise answer. It returns a number with full confidence that is simply wrong. The analyst trusts it. A decision gets made.

The root difference is determinism. Document RAG tolerates semantic approximation at retrieval time. SQL generation requires exact schema mapping—table names, column names, join conditions, filter semantics—and there's no partial credit. An LLM that "almost" gets a SQL query right produces queries that look plausible, parse correctly, and silently return the wrong result.

Production teams who've shipped text-to-SQL systems consistently report this gap. Pinterest found that first-shot acceptance rates in production were around 20%, compared to benchmark numbers that suggest much higher performance. The queries weren't failing to execute—they were executing and returning results that users then flagged as wrong.

How Agents Hallucinate Schema

The common assumption is that injecting the database schema into the model's context solves the grounding problem. It doesn't.

When a schema contains 700 columns across 200 tables, the model has to identify which 5–10 columns are relevant to a given query, construct the correct joins, and apply the right filter conditions—all before writing a single line of SQL. Schema linking (mapping natural language phrases to specific table/column names) accounts for roughly 27% of SQL errors on enterprise-scale benchmarks even when the full schema is in context.

The failure modes are predictable once you've seen them:

Naming convention drift. Column names like acct_revenue_usd_net_30d are internally consistent but opaque to anyone outside the team that created them. The model maps "net revenue last month" to plausible-sounding alternatives that may or may not exist. When they don't exist, you get an error. When they do but mean something slightly different, you get a wrong number.

Compositional reasoning failures. Queries involving multiple joins, subqueries, or aggregations require the model to hold the semantic intent of the original question against an increasingly complex SQL structure. Models that handle single-table queries at high accuracy drop significantly when forced to compose multiple clauses correctly.

Cross-table hallucination. The model knows a concept exists somewhere in the schema but isn't certain which table owns it. It picks the closest-sounding table. Half the time, that table exists. The other half, it's a table from a different but related system that happens to share naming patterns.

The solution isn't a bigger model. The performance collapse on enterprise schemas comes from schema complexity and missing business context, not from the model's underlying reasoning capability. The teams that reach 90%+ accuracy in production do it through schema management, not model selection.

Schema Subsetting: Give Agents Less

The counterintuitive fix for schema hallucination is to show the model less schema, not more.

The approach is called schema subsetting, and it has two components. First, build a retrieval layer that identifies which tables are relevant to a given query before any SQL is generated. This uses embeddings of table summaries, column descriptions, and example queries to retrieve a candidate set of tables—similar to how document RAG retrieves relevant passages, but operating over schema metadata rather than document chunks.

Pinterest's implementation is illustrative. They found that embedding table documentation (plain-language descriptions of what a table contains) was more useful than embedding raw DDL. Documentation provides the semantic signal that links natural language questions to tables; DDL provides structural detail for SQL construction. They weight documentation more heavily in retrieval and saw search hit rates improve from 40% to 90% after properly investing in metadata quality.

Second, once the relevant tables are identified, strip the schema down to only those tables and their relevant columns before injecting it into the model's context. A query about customer churn doesn't need the payments schema visible. A query about order fulfillment times doesn't need user authentication tables. This isn't just a token optimization—it reduces the noise the model has to reason through and makes hallucination less likely because there are fewer wrong answers in context.

The subsetting approach requires investment in schema documentation. If your tables have no descriptions and your columns have names like flg_prm_a, retrieval is guessing. The metadata layer is load-bearing infrastructure, not a nice-to-have.

Runtime Validation Before Execution

Even with schema subsetting, agents generate queries that shouldn't execute. A runtime validation layer catches these before they reach the database.

The validation chain has a specific order:

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates