Skip to main content

4 posts tagged with "sql"

View all tags

Text-to-SQL in Production: Why Natural Language Queries Fail at the Schema Boundary

· 9 min read
Tian Pan
Software Engineer

The demo works every time. The LLM translates "show me last quarter's top ten customers by revenue" into pristine SQL, the results pop up instantly, and everyone in the room nods. Then you deploy it against your actual warehouse — 130 tables, 1,400 columns, a decade of organic naming conventions — and the model starts confidently generating queries that return the wrong numbers. No errors. Just wrong answers.

This is the schema boundary problem, and it's why text-to-SQL has the widest gap of any AI capability between benchmark performance and production reality. A model that scores 86% on Spider 1.0 (the canonical academic benchmark) drops to around 6% accuracy on Spider 2.0, which approximates real enterprise schema complexity. Vendors demo on clean, toy schemas. You're deploying on yours.

Text-to-SQL at Scale: What Nobody Tells You Before Production

· 11 min read
Tian Pan
Software Engineer

Text-to-SQL demos are deceptively easy to build. You paste a schema into a prompt, ask GPT-4 a question, get back a clean SELECT statement, and suddenly your Slack is full of "what if we built this into our data platform?" messages. Then you try to actually ship it. The benchmark says 85% accuracy. Your internal data team reports that about half the answers are wrong. Your security team asks who reviewed the generated queries before they hit production. Nobody has a good answer.

This is the gap between text-to-SQL as a research problem and text-to-SQL as an engineering problem. The research problem is about getting models to produce syntactically valid SQL. The engineering problem is about schema ambiguity, access control, query validation, and the fact that your enterprise database looks nothing like Spider or BIRD.

Why SQL Agents Fail in Production: Grounding LLMs Against Live Relational Databases

· 11 min read
Tian Pan
Software Engineer

The Spider benchmark looks great. GPT-4 scores above 85% on text-to-SQL translation across hundreds of test queries. Teams read those numbers, wire up a LangChain SQLDatabaseChain, and ship an "ask your data" feature. Two weeks later, an analyst's innocent question about revenue by region triggers a full table scan that takes down reporting for thirty minutes.

The benchmark number was real. The problem is that benchmarks don't use your schema.

Spider 1.0 tests models on databases with 5–30 tables and 50–100 columns. Your production data warehouse has 200 tables, 700+ columns, three dialects of SQL depending on which system you're querying, and column names that made sense to the engineer who wrote them four years ago but are meaningless to anyone else. When researchers introduced Spider 2.0—a benchmark with enterprise-scale schemas and real-world complexity—GPT-4o dropped from 86.6% to 10.1% success rate. That collapse is what production actually looks like.

Your Database Schema Is Your Agent's Mental Model

· 9 min read
Tian Pan
Software Engineer

Most teams building agents treat their database schema as a backend concern. The schema was designed by engineers, for engineers, following decades of relational database best practices: normalize aggressively, avoid redundancy, split reference tables, enforce foreign keys. This approach is correct for OLTP systems. It is often wrong for AI agents.

When an agent reads your schema to figure out how to answer a question, it is not parsing a data structure. It is constructing a mental model of your business. If your schema was built for application code that already understands the domain, the agent will be working against a map drawn for someone else. The result is hallucinated joins, incorrect aggregations, and tool call chains that should take two steps but take eight.