Why SQL Agents Fail in Production: Grounding LLMs Against Live Relational Databases
The Spider benchmark looks great. GPT-4 scores above 85% on text-to-SQL translation across hundreds of test queries. Teams read those numbers, wire up a LangChain SQLDatabaseChain, and ship an "ask your data" feature. Two weeks later, an analyst's innocent question about revenue by region triggers a full table scan that takes down reporting for thirty minutes.
The benchmark number was real. The problem is that benchmarks don't use your schema.
Spider 1.0 tests models on databases with 5–30 tables and 50–100 columns. Your production data warehouse has 200 tables, 700+ columns, three dialects of SQL depending on which system you're querying, and column names that made sense to the engineer who wrote them four years ago but are meaningless to anyone else. When researchers introduced Spider 2.0—a benchmark with enterprise-scale schemas and real-world complexity—GPT-4o dropped from 86.6% to 10.1% success rate. That collapse is what production actually looks like.
This post is about the engineering required to deploy SQL agents safely—not as a wrapper around a model call, but as a system with schema grounding, runtime validation, and layered permissions. Get any of these wrong and you're either getting confidently wrong answers or, worse, running unintended queries against a live database.
SQL Access Is Not Document RAG
When you build a document RAG system, approximate answers are acceptable. A retrieval miss returns a slightly less relevant passage; the model fills the gap with plausible reasoning. The user gets a response that's close enough, and the worst failure mode is a vague answer.
SQL doesn't work like that. A query either maps to the correct table and column names or it doesn't execute—or worse, it executes successfully against the wrong data. An agent that confuses user_events with user_sessions, or uses order_date instead of created_at, doesn't return a slightly imprecise answer. It returns a number with full confidence that is simply wrong. The analyst trusts it. A decision gets made.
The root difference is determinism. Document RAG tolerates semantic approximation at retrieval time. SQL generation requires exact schema mapping—table names, column names, join conditions, filter semantics—and there's no partial credit. An LLM that "almost" gets a SQL query right produces queries that look plausible, parse correctly, and silently return the wrong result.
Production teams who've shipped text-to-SQL systems consistently report this gap. Pinterest found that first-shot acceptance rates in production were around 20%, compared to benchmark numbers that suggest much higher performance. The queries weren't failing to execute—they were executing and returning results that users then flagged as wrong.
How Agents Hallucinate Schema
The common assumption is that injecting the database schema into the model's context solves the grounding problem. It doesn't.
When a schema contains 700 columns across 200 tables, the model has to identify which 5–10 columns are relevant to a given query, construct the correct joins, and apply the right filter conditions—all before writing a single line of SQL. Schema linking (mapping natural language phrases to specific table/column names) accounts for roughly 27% of SQL errors on enterprise-scale benchmarks even when the full schema is in context.
The failure modes are predictable once you've seen them:
Naming convention drift. Column names like acct_revenue_usd_net_30d are internally consistent but opaque to anyone outside the team that created them. The model maps "net revenue last month" to plausible-sounding alternatives that may or may not exist. When they don't exist, you get an error. When they do but mean something slightly different, you get a wrong number.
Compositional reasoning failures. Queries involving multiple joins, subqueries, or aggregations require the model to hold the semantic intent of the original question against an increasingly complex SQL structure. Models that handle single-table queries at high accuracy drop significantly when forced to compose multiple clauses correctly.
Cross-table hallucination. The model knows a concept exists somewhere in the schema but isn't certain which table owns it. It picks the closest-sounding table. Half the time, that table exists. The other half, it's a table from a different but related system that happens to share naming patterns.
The solution isn't a bigger model. The performance collapse on enterprise schemas comes from schema complexity and missing business context, not from the model's underlying reasoning capability. The teams that reach 90%+ accuracy in production do it through schema management, not model selection.
Schema Subsetting: Give Agents Less
The counterintuitive fix for schema hallucination is to show the model less schema, not more.
The approach is called schema subsetting, and it has two components. First, build a retrieval layer that identifies which tables are relevant to a given query before any SQL is generated. This uses embeddings of table summaries, column descriptions, and example queries to retrieve a candidate set of tables—similar to how document RAG retrieves relevant passages, but operating over schema metadata rather than document chunks.
Pinterest's implementation is illustrative. They found that embedding table documentation (plain-language descriptions of what a table contains) was more useful than embedding raw DDL. Documentation provides the semantic signal that links natural language questions to tables; DDL provides structural detail for SQL construction. They weight documentation more heavily in retrieval and saw search hit rates improve from 40% to 90% after properly investing in metadata quality.
Second, once the relevant tables are identified, strip the schema down to only those tables and their relevant columns before injecting it into the model's context. A query about customer churn doesn't need the payments schema visible. A query about order fulfillment times doesn't need user authentication tables. This isn't just a token optimization—it reduces the noise the model has to reason through and makes hallucination less likely because there are fewer wrong answers in context.
The subsetting approach requires investment in schema documentation. If your tables have no descriptions and your columns have names like flg_prm_a, retrieval is guessing. The metadata layer is load-bearing infrastructure, not a nice-to-have.
Runtime Validation Before Execution
Even with schema subsetting, agents generate queries that shouldn't execute. A runtime validation layer catches these before they reach the database.
The validation chain has a specific order:
Parse the SQL, don't regex it. Structural validation using a real SQL parser (not pattern matching) catches malformed queries that would fail at execution time. This is also where you validate that the query references only tables and columns from the allowed set.
Enforce an allowlist. Maintain an explicit list of tables and columns the agent is permitted to query. Any query that references something outside the allowlist fails validation. This is separate from database permissions—it's an application-layer check that fails early with a specific error message the model can use to self-correct.
Block forbidden operations. For read-only agents, block any query containing INSERT, UPDATE, DELETE, DROP, ALTER, or TRUNCATE at parse time. Don't rely on the model not generating these—validate that none are present.
Run EXPLAIN before executing. A query that would scan a billion-row table without using an index will return, eventually, but it will degrade performance for every other user on the system. Run an EXPLAIN or EXPLAIN ANALYZE (without execution) to check the estimated cost. If the planner estimates a full table scan on a large table, reject the query and ask the model to add appropriate filters or use indexed columns.
Enforce LIMIT by default. Append a LIMIT clause to any query that doesn't include one. An unbounded SELECT * on an analytics table isn't the analyst's fault—the system should have caught it.
This validation layer also doubles as the feedback signal for self-correction. When a query fails validation, the error message tells the model exactly what it did wrong: "Column order_ts not found in table orders. Available columns: order_date, created_at, updated_at." The model can generate a corrected query in the next step. Teams that implement structured error feedback see meaningful improvement in multi-turn accuracy compared to systems that surface raw database error messages or nothing at all.
The Read-Only Boundary Is Not Optional
A SQL agent connected to a database with read-write permissions is a system that can corrupt or delete production data if something goes wrong. Something always goes wrong.
The read-only boundary is enforced at the database layer, not the application layer. Create a dedicated database user with SELECT-only grants, scoped to the specific tables the agent should access. Even if the agent generates a DELETE statement that somehow passes application-layer validation, the database will reject it. This is defense in depth: your code can have a bug, but the database permission model doesn't.
The scope of "read-only" deserves more precision. A database user with SELECT on all tables isn't least-privilege—it's the broadest read permission possible. Structure permissions so the agent's database user only has SELECT on the tables in the allowed schema subset. A customer analytics agent shouldn't be able to query the internal tools schema; a product usage agent shouldn't touch financial data.
For agents that need both read and write access—generating SQL updates as part of an automated workflow—the permission structure changes significantly. Write access requires:
- Scoping to specific tables only
- Requiring a human approval step before execution for any mutating query
- Maintaining a full audit log of every query executed and the natural language request that produced it
- Limiting to INSERT-only (append) rather than UPDATE/DELETE where possible
The audit trail is often underestimated. When something goes wrong—and with write access, "when" is more appropriate than "if"—you need to reconstruct exactly what the agent executed and why. A log that stores only the SQL query is insufficient; you need the original user request, the intermediate queries the agent considered, the validation results, and the final execution result.
The Real Cost of Getting This Wrong
A bad document retrieval returns an irrelevant passage. A bad SQL query has a different cost profile:
A full table scan on a 500GB table ties up read replicas and degrades query performance for every other user until it times out or completes. In a system without resource controls, a single agent query can knock out a reporting cluster for 20–30 minutes.
A query against the wrong tenant's data—because the agent failed to apply row-level security filters—is a data breach. The agent had SELECT permission; it just used it incorrectly. Whether a customer's data got exposed is a legal question, not just an engineering one.
A confidently wrong answer that feeds downstream decisions is the failure mode that's hardest to detect. The query ran, it returned results, the number looked plausible. The agent hallucinated that net_revenue excluded refunds when it didn't. The analyst built a board deck on it.
The asymmetry between document RAG and SQL agents comes down to this: bad retrieval produces answers that are obviously imprecise. Bad SQL produces answers that are indistinguishable from correct until someone checks the underlying data.
Building Toward Production Reliability
Teams that reach 90%+ accuracy on production text-to-SQL systems share a common architecture, and none of it is a prompting trick.
The foundation is a semantic layer: centralized, maintained definitions for every table and column the agent can access, written in plain language that maps business terminology to schema elements. "Monthly active users" maps to a specific combination of user_events rows, not to a guess. This metadata is what retrieval uses, what the model reasons over, and what makes schema subsetting precise.
On top of that, a multi-agent structure outperforms a single agent trying to do everything. A schema agent responsible for identifying relevant tables, a SQL generation agent focused on constructing the query, and a validation agent that checks the result before surfacing it to the user. Each agent has a narrow job and can be tested independently. Errors in schema selection don't propagate silently into the SQL generation step—they fail at a boundary where you can catch and correct them.
Finally, evaluation matters more for SQL agents than for most LLM applications. The evaluation signal is strong—queries either return the correct result or they don't—and continuous evaluation against a test set of known queries gives you a real signal on whether changes to schema documentation, model versions, or prompt structure improved or regressed accuracy. Teams without this signal are flying blind, optimizing for benchmark numbers while production accuracy drifts.
The engineers who've shipped reliable SQL agents aren't the ones who found a better prompt. They're the ones who built better infrastructure around the model.
SQL agents are worth building. Enabling analysts to query data without writing SQL is genuinely valuable, and the underlying model capability is real. But deploying one without schema subsetting, runtime validation, and least-privilege database permissions isn't cutting corners—it's skipping the parts of the system that make it safe to run.
- https://medium.com/pinterest-engineering/how-we-built-text-to-sql-at-pinterest-30bad30dabff
- https://www.lamini.ai/blog/llm-text-to-sql
- https://www.salesforce.com/blog/text-to-sql-agent/
- https://dev.to/kowshik_jallipalli_a7e0a5/safe-text-to-sql-giving-an-agent-database-access-without-dropping-tables-or-leaking-pii-i47
- https://medium.com/wrenai/reducing-hallucinations-in-text-to-sql-building-trust-and-accuracy-in-data-access-176ac636e208
- https://arxiv.org/html/2410.06011v1
- https://medium.com/dataherald/text-to-sql-benchmarks-and-the-current-state-of-the-art-63dd3b3943fe
- https://cloud.google.com/blog/products/databases/techniques-for-improving-text-to-sql
- https://openreview.net/forum?id=XmProj9cPs
- https://www.osohq.com/learn/ai-agent-permissions-delegated-access
- https://workos.com/blog/ai-agent-access-control
