Text-to-SQL in Production: Why Natural Language Queries Fail at the Schema Boundary
The demo works every time. The LLM translates "show me last quarter's top ten customers by revenue" into pristine SQL, the results pop up instantly, and everyone in the room nods. Then you deploy it against your actual warehouse — 130 tables, 1,400 columns, a decade of organic naming conventions — and the model starts confidently generating queries that return the wrong numbers. No errors. Just wrong answers.
This is the schema boundary problem, and it's why text-to-SQL has the widest gap of any AI capability between benchmark performance and production reality. A model that scores 86% on Spider 1.0 (the canonical academic benchmark) drops to around 6% accuracy on Spider 2.0, which approximates real enterprise schema complexity. Vendors demo on clean, toy schemas. You're deploying on yours.
Understanding why this gap exists — and what actually closes it — requires looking at four failure modes that compound each other: schema context overload, join graph blindness, ambiguous column names, and the silent wrong-answer problem.
The Schema Context Packaging Problem
LLMs don't understand your schema. They pattern-match against whatever schema context you inject into the prompt. That creates an immediate engineering dilemma: include too little context and the model hallucinates table and column names; include too much and the model drowns in irrelevant tokens, producing worse results while burning your budget.
Academic benchmarks test models on schemas with 50–100 total columns across 10–20 tables. Real enterprise databases run 1,000+ columns across 100+ tables. Dumping the full schema into a GPT-4o prompt costs roughly 8,400 tokens to answer a query that produces 16 tokens of SQL. Across 10,000 queries per day, that's around $43,800 annually in wasted context — and the model's accuracy suffers from the noise.
The right approach is schema pruning before the model ever sees the schema. A deterministic pruning stage, using FK graph traversal and layered entity resolution, can reduce injected schema tokens by 93% while maintaining the tables and columns the query actually needs. The three layers are: direct table name matching from the question, business terminology mapping (if the user says "revenue," which column in which table represents that), and column-level detection for specific filters or groupings.
The problem is that purely text-based matching misses tables that are structurally necessary but semantically invisible — specifically, junction tables in many-to-many relationships. A table named order_line_items will match a question about orders; a table named xref_product_campaign won't match anything in the question even though the query may require it to bridge two entities. This is where schema pruning alone breaks down.
Join Graph Injection
Foreign key relationships encode knowledge that text similarity can't recover. When a user asks "which campaigns drove sales in Q3," the query needs a join path that may pass through two or three intermediate tables. A vector similarity search over table summaries will retrieve the campaigns table and the orders table; it won't retrieve the bridge table with a generic name that has no semantic similarity to the question but is the only way to connect the two.
Injecting the foreign key graph into the prompt — either as a structured schema linking representation or as a subgraph of tables reachable from the entities mentioned in the question — substantially reduces join errors. Research on the BIRD benchmark (which uses real, messy databases across 37 professional domains) shows that graph-aware schema linking approaches achieve 92% recall on required schema elements while reducing injected context by 83%.
The practical implementation: before constructing the LLM prompt, run a graph traversal from each entity mentioned in the question, walking FK relationships to discover structurally necessary intermediate tables. Include those tables in the schema context even when they don't match the question semantically. This is cheaper and more reliable than asking the LLM to reason about join paths from a subset of the schema.
Graph-based techniques also help with a subtler problem: the LLM may know a join is needed but construct it incorrectly. Models frequently "invent plausible-looking join keys" when the actual FK relationship uses a non-obvious column pair. Injecting the exact FK constraints removes that ambiguity.
The Ambiguous Column Name Problem
Enterprise schemas are full of columns that mean different things in different tables: created_at appears in 40 tables, status appears in 60, user_id might reference different user populations depending on the table. When the model isn't given disambiguation context, it resolves these ambiguities by pattern-matching against what "should" exist in a schema — which is whatever e-commerce or SaaS schema dominated its training data.
One production failure pattern that appears repeatedly in engineering postmortems: the query executes cleanly and returns plausible-looking numbers, but it averaged prices at the wrong granularity, used order_date instead of shipped_date, or pulled from payments instead of revenue_recognition. The SQL is syntactically correct and semantically coherent — to the model. It just doesn't match the business definition.
Addressing this requires enriching the schema context with human-readable metadata: column descriptions, which columns represent canonical business metrics, which tables contain "ground truth" vs. derived or cached data, and domain-specific synonyms. In documented schemas, this metadata can be auto-extracted. In under-documented schemas (most real warehouses), it requires a one-time annotation pass. The return on investment is significant: teams that invested in documenting table descriptions and canonical metric definitions reported correct-query rates improving from 40% to 90% in production.
Beyond metadata, multi-shot prompting with representative query-SQL pairs from your specific schema teaches the model your conventions. If your schema uses _amt suffixes for monetary values and _cnt for counts, a handful of examples makes that pattern explicit in a way that descriptions alone don't.
The Clarification Oracle
The most insidious failure mode in text-to-SQL isn't a broken query — it's an executed query that returns wrong results with no signal that anything went wrong. The user asks "what's our best-selling product?" and gets an answer. The answer is based on unit volume. The user meant by revenue. Both are plausible interpretations of the question, and the model chose one silently.
Building a clarification oracle into the pipeline is the production-grade fix. Before committing to SQL generation, the system evaluates whether the question is sufficiently specified:
- Can the intent be satisfied by a single, unambiguous SQL query?
- Are there multiple plausible interpretations that would produce different results?
- Is a required filter implicit (e.g., "active customers" — active by what definition)?
When ambiguity exceeds a threshold, the system generates a targeted follow-up question rather than proceeding to generation. "Are you measuring sales by unit count or revenue?" costs one interaction turn. Returning the wrong answer costs a business decision.
The engineering decision is where to set the threshold. Too aggressive, and the system asks clarifying questions for simple queries, frustrating users. Too permissive, and silent errors accumulate. A practical heuristic: ask when the question contains under-specified aggregate metrics (best, top, most), relative time references without a clear anchor ("last month" when the schema has both calendar months and fiscal months), or entity references that map to multiple tables with different granularity.
The expected information gain framework formalizes this: score each possible clarification question by the probability it resolves ambiguity multiplied by the entropy it would eliminate. Ask the question that maximally reduces uncertainty per turn. In practice, most queries benefit from at most one clarification round; questions that require more are typically underspecified enough that the user themselves doesn't know exactly what they want.
Closing the Loop: Iterative Error Correction
Even with good schema pruning, FK injection, and a clarification oracle, some queries will be wrong. The final defensive layer is execution-based feedback: run the generated SQL, inspect whether it executes, whether it returns results, and whether those results pass basic sanity checks — and feed failures back to the model.
A generate-execute-critique loop catches a large class of errors automatically. Syntax errors fail immediately. Schema reference errors (hallucinated column names) fail with database error messages that can be re-injected into the prompt with the instruction to fix the specific reference. Semantic errors are harder — a query that returns zero results when results are expected is suspicious; a query that returns a single row for a question expecting a distribution is suspicious. These signals aren't definitive, but they catch the most egregious misses.
Multi-agent architectures help here: a generator agent produces SQL, a critic agent evaluates whether the query is consistent with the question and schema, and a loop agent orchestrates retries. The generator-critic pattern catches approximately 30–40% of initial errors in production deployments before they reach the user.
An important constraint: never execute LLM-generated SQL directly against a production database on the first attempt. Parse and validate the AST first using a library like sqlglot. Confirm all referenced tables and columns exist. Run a dry-run EXPLAIN to catch execution-level issues without returning data. Only after these gates pass should the query execute.
What Actually Works in Production
The teams shipping text-to-SQL systems that users trust have generally converged on the same architecture: aggressive schema pruning driven by FK graph traversal, metadata enrichment for canonical business terminology, a pre-flight clarification step for ambiguous questions, and a post-execution validation loop with automatic retry.
The benchmark numbers will continue to improve — the best models on BIRD are approaching 72% execution accuracy as of early 2026, up from under 50% two years ago. But the benchmark-to-production gap is structural, not just a matter of model capability. Benchmarks test on clean, normalized schemas with documented semantics. Your warehouse has fifteen years of naming drift, undocumented business rules in the ETL, and columns that mean different things in different contexts.
Closing that gap is an engineering problem, not a model problem. The LLM is pattern-matching; your job is to give it accurate, sufficient, and unambiguous patterns to match against. That means investing in schema metadata quality, building the infrastructure to prune context intelligently, and designing the pipeline to fail informatively rather than fail silently. Models will keep getting better. The schemas won't clean themselves.
- https://medium.com/google-cloud/the-six-failures-of-text-to-sql-and-how-to-fix-them-with-agents-ef5fd2b74b68
- https://www.nirmalya.net/posts/2026/02/text-to-sql-schema-pruning/
- https://arxiv.org/html/2411.00073v1
- https://bird-bench.github.io/
- https://www.usedatabrain.com/blog/llm-sql-evaluation
- https://medium.com/pinterest-engineering/how-we-built-text-to-sql-at-pinterest-30bad30dabff
- https://aws.amazon.com/blogs/machine-learning/build-a-robust-text-to-sql-solution-generating-complex-queries-self-correcting-and-querying-diverse-data-sources/
- https://arxiv.org/html/2402.14851v1
- https://arxiv.org/html/2507.06467
- https://cloud.google.com/blog/products/databases/techniques-for-improving-text-to-sql
