780 posts tagged with "ai-engineering"

Text-to-SQL in Production: Why Natural Language Queries Fail at the Schema Boundary

April 20, 2026 · 9 min read

Software Engineer

The demo works every time. The LLM translates "show me last quarter's top ten customers by revenue" into pristine SQL, the results pop up instantly, and everyone in the room nods. Then you deploy it against your actual warehouse — 130 tables, 1,400 columns, a decade of organic naming conventions — and the model starts confidently generating queries that return the wrong numbers. No errors. Just wrong answers.

This is the schema boundary problem, and it's why text-to-SQL has the widest gap of any AI capability between benchmark performance and production reality. A model that scores 86% on Spider 1.0 (the canonical academic benchmark) drops to around 6% accuracy on Spider 2.0, which approximates real enterprise schema complexity. Vendors demo on clean, toy schemas. You're deploying on yours.

The Token Economy of Multi-Turn Tool Use: Why Your Agent Costs 5x More Than You Think

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Every team that builds an AI agent does the same back-of-the-envelope math: take the expected number of tool calls, multiply by the per-call cost, add a small buffer. That estimate is wrong before it leaves the whiteboard — not by 10% or 20%, but by 5 to 30 times, depending on agent complexity. Forty percent of agentic AI pilots get cancelled before reaching production, and runaway inference costs are the single most common reason.

The problem is structural. Single-call cost estimates assume each inference is independent. In a multi-turn agent loop, they are not. Every tool call grows the context that every subsequent call must pay for. The result is a quadratic cost curve masquerading as a linear one, and engineers don't discover it until the bill arrives.

What Your Vendor's Model Card Doesn't Tell You

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

A model card will tell you that the model scores 88.7 on MMLU. It will not tell you that the model systematically attributes blame to whichever technology appears first in a list of possibilities, causing roughly 10% of its attribution answers to be semantically wrong even when factually correct. It will not tell you that adding "you are a helpful assistant" to your system prompt degrades performance on structured reasoning tasks compared to leaving the system prompt blank. It will not tell you that under load the 99th-percentile latency is 4x the median, or that the model's behavior on legal and financial queries changes measurably depending on whether you include a compliance disclaimer.

None of this is in the model card. You will learn it by shipping to production and watching things break.

Vibe Code at Scale: Managing Technical Debt When AI Writes Most of Your Codebase

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

In March 2026, a major e-commerce platform lost 6.3 million orders in a single day — 99% of its U.S. order volume gone. The cause wasn't a rogue deployment or a database failure. An AI coding tool had autonomously generated and deployed code based on outdated internal documentation, corrupting delivery time estimates across every marketplace. The company had mandated that 80% of engineers use the tool weekly. Adoption metrics were green. Engineering discipline was not.

This is what vibe coding at scale actually looks like. Not the fast demos that ship in four days. The 6.3 million orders that vanish on day 365.

The Vibe Coding Productivity Plateau: Why AI Speed Gains Reverse After Month Three

April 20, 2026 · 8 min read

Tian Pan

Software Engineer

In a controlled randomized trial, developers using AI coding assistants predicted they'd be 24% faster. They were actually 19% slower. The kicker: they still believed they had gotten faster. This cognitive gap — where the feeling of productivity diverges from actual delivery — is the early warning signal of a failure mode that plays out over months, not hours.

The industry has reached near-universal AI adoption. Ninety-three percent of developers use AI coding tools. Productivity gains have stalled at around 10%. The gap between those numbers is not a tool problem. It is a compounding debt problem that most teams don't notice until it's expensive to reverse.

A/B Testing AI Features When the Treatment Is Non-Deterministic

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Your team ships a new LLM-powered feature, runs a clean A/B test for two weeks, and sees a statistically significant improvement. You roll it out. Three weeks later, retention metrics are flat and support tickets are up. What went wrong? You ran a textbook experiment on a non-textbook treatment — and the textbook assumption that "the treatment is stable" broke silently.

Standard A/B testing was designed for deterministic or near-deterministic treatments: a button color change, a ranking algorithm with fixed parameters, a checkout flow. LLM features violate almost every assumption that makes classical frequentist experiments reliable. The treatment variance is high, the treatment itself mutates mid-experiment when providers push model updates, success is hard to operationalize, and novelty effects are strong enough to produce results that evaporate after users adapt.

This post is about the adjustments that make experimentation work anyway.

The Cascade Problem: Why Agent Side Effects Explode at Scale

April 19, 2026 · 12 min read

Tian Pan

Software Engineer

A team ships a document-processing agent. It works flawlessly in development: reads files, extracts data, writes results to a database, sends a confirmation webhook. They run 50 test cases. All pass.

Two weeks after deployment, with a hundred concurrent agent instances running, the database has 40,000 duplicate records, three downstream services have received thousands of spurious webhooks, and a shared configuration file has been half-overwritten by two agents that ran simultaneously.

The agent didn't break. The system broke because no individual agent test ever had to share the world with another agent.

The Agent Specification Gap: Why Your Agents Ignore What You Write

April 19, 2026 · 12 min read

Tian Pan

Software Engineer

You wrote a careful spec. You described the task, listed the constraints, and gave examples. The agent ran — and did something completely different from what you wanted.

This is the specification gap: the distance between the instructions you write and the task the agent interprets. It's not a model capability problem. It's a specification problem. Research on multi-agent system failures published in 2025 found that specification-related issues account for 41.77% of all failures, and that 79% of production breakdowns trace back to how tasks were specified, not to what models can do.

The majority of teams writing agent specs are committing the same category of mistake: writing instructions the way you'd write an email to a competent colleague, then expecting an autonomous system with no shared context to execute them correctly across thousands of runs.

AI as a CI/CD Gate: What Agents Can and Cannot Reliably Block

April 19, 2026 · 9 min read

Tian Pan

Software Engineer

An AI reviewer blocks a merge. A developer stares at the failing check, clicks "view details," skims three paragraphs of boilerplate, and files a "force-push exception" without reading the actual finding. Within a week, every engineer on the team has internalized that the AI gate is background noise — something to dismiss, not engage with.

This is the outcome most teams building AI CI/CD gates actually ship, even when the underlying model is technically capable. The problem is not whether AI can review code. The problem is what you ask it to block, and what you expect to happen when it does.

AI Coding Agents on Legacy Codebases: What Works and What Backfires

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Most AI coding demos show an agent building a greenfield Todo app or implementing a clean API from scratch. Your codebase, however, is a fifteen-year-old monolith with undocumented implicit contracts, deprecated dependencies that three teams depend on in ways nobody fully understands, and a service layer that started as a single class and now spans forty files. The gap between demo and reality is not just a size problem — it's a structural one, and understanding it before you hand your agents the keys prevents a specific category of subtle, expensive failures.

AI coding agents genuinely help with legacy systems, but only within certain task boundaries. Outside those boundaries, they don't just fail noisily — they produce plausible-looking, syntactically valid, semantically wrong changes that slip through code review and surface in production.

Why Users Ignore the AI Feature You Spent Three Months Building

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Your team spent three months integrating an LLM into your product. The model works. The latency is acceptable. The demo looks great. You ship. And then you watch the usage metrics flatline at 4%.

This is the typical arc. Most AI features fail not at the model level but at the adoption level. The underlying cause isn't technical — it's a cluster of product decisions that were made (or not made) around discoverability, trust, and habit formation. Understanding why adoption fails, and what to actually measure and change, separates teams that ship useful AI from teams that ship impressive demos.

When Your AI Feature Ages Out: Knowledge Cutoffs and Temporal Grounding in Production

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Your AI feature shipped in Q3. Evals looked good. Users were happy. Six months later, satisfaction scores have dropped 18 points, but your dashboards still show 99.9% uptime and sub-200ms latency. Nothing looks broken. Nothing is broken — in the traditional sense. The model is responding. The infrastructure is healthy. The feature is just quietly wrong.

This is what temporal decay looks like in production AI systems. It doesn't announce itself with errors. It accumulates as a gap between what the model knows and what the world has become — and by the time your support queue reflects it, the damage has been running for months.

About Tian Pan