Text-to-SQL in Production: Why Correct SQL Is the Easy Part
GPT-4o scores 86.6% on the Spider benchmark. Deploy it against your actual data warehouse and you might get 10%. That gap is not a rounding error—it is the entire problem. The queries that make up the missing 76% execute without errors, return rows with the correct schema, and are completely wrong.
Text-to-SQL is not a syntax problem. Every serious production deployment discovers the same uncomfortable truth: the hard failures are silent ones. A query that scans a 10TB Snowflake table, returns revenue figures that are 30% too high due to a duplicated join, or quietly bypasses row-level security looks identical to a correct query from the outside. It finishes, it returns data, and nobody flags it.
This post covers the failure modes that actually bite teams in production, and the layered architecture that prevents them.
