Skip to main content

553 posts tagged with "ai-engineering"

View all tags

Temperature Governance in Multi-Agent Systems: Why Variance Is a First-Class Budget

· 11 min read
Tian Pan
Software Engineer

Most production multi-agent systems apply a single temperature value—copied from a tutorial, set once, never revisited—to every agent in the pipeline. The classifier, the generator, the verifier, and the formatter all run at 0.7 because that's what the README said. This is the equivalent of giving every database query the same timeout regardless of whether it's a point lookup or a full table scan. It feels fine until you start debugging failure modes that look like model errors but are actually sampling policy errors.

Temperature is not a global dial. It's a per-role policy decision, and getting it wrong creates distinct failure signatures depending on which direction you miss in.

Text-to-SQL in Production: Why Natural Language Queries Fail at the Schema Boundary

· 9 min read
Tian Pan
Software Engineer

The demo works every time. The LLM translates "show me last quarter's top ten customers by revenue" into pristine SQL, the results pop up instantly, and everyone in the room nods. Then you deploy it against your actual warehouse — 130 tables, 1,400 columns, a decade of organic naming conventions — and the model starts confidently generating queries that return the wrong numbers. No errors. Just wrong answers.

This is the schema boundary problem, and it's why text-to-SQL has the widest gap of any AI capability between benchmark performance and production reality. A model that scores 86% on Spider 1.0 (the canonical academic benchmark) drops to around 6% accuracy on Spider 2.0, which approximates real enterprise schema complexity. Vendors demo on clean, toy schemas. You're deploying on yours.

The Token Economy of Multi-Turn Tool Use: Why Your Agent Costs 5x More Than You Think

· 10 min read
Tian Pan
Software Engineer

Every team that builds an AI agent does the same back-of-the-envelope math: take the expected number of tool calls, multiply by the per-call cost, add a small buffer. That estimate is wrong before it leaves the whiteboard — not by 10% or 20%, but by 5 to 30 times, depending on agent complexity. Forty percent of agentic AI pilots get cancelled before reaching production, and runaway inference costs are the single most common reason.

The problem is structural. Single-call cost estimates assume each inference is independent. In a multi-turn agent loop, they are not. Every tool call grows the context that every subsequent call must pay for. The result is a quadratic cost curve masquerading as a linear one, and engineers don't discover it until the bill arrives.

What Your Vendor's Model Card Doesn't Tell You

· 10 min read
Tian Pan
Software Engineer

A model card will tell you that the model scores 88.7 on MMLU. It will not tell you that the model systematically attributes blame to whichever technology appears first in a list of possibilities, causing roughly 10% of its attribution answers to be semantically wrong even when factually correct. It will not tell you that adding "you are a helpful assistant" to your system prompt degrades performance on structured reasoning tasks compared to leaving the system prompt blank. It will not tell you that under load the 99th-percentile latency is 4x the median, or that the model's behavior on legal and financial queries changes measurably depending on whether you include a compliance disclaimer.

None of this is in the model card. You will learn it by shipping to production and watching things break.

Vibe Code at Scale: Managing Technical Debt When AI Writes Most of Your Codebase

· 9 min read
Tian Pan
Software Engineer

In March 2026, a major e-commerce platform lost 6.3 million orders in a single day — 99% of its U.S. order volume gone. The cause wasn't a rogue deployment or a database failure. An AI coding tool had autonomously generated and deployed code based on outdated internal documentation, corrupting delivery time estimates across every marketplace. The company had mandated that 80% of engineers use the tool weekly. Adoption metrics were green. Engineering discipline was not.

This is what vibe coding at scale actually looks like. Not the fast demos that ship in four days. The 6.3 million orders that vanish on day 365.

The Vibe Coding Productivity Plateau: Why AI Speed Gains Reverse After Month Three

· 8 min read
Tian Pan
Software Engineer

In a controlled randomized trial, developers using AI coding assistants predicted they'd be 24% faster. They were actually 19% slower. The kicker: they still believed they had gotten faster. This cognitive gap — where the feeling of productivity diverges from actual delivery — is the early warning signal of a failure mode that plays out over months, not hours.

The industry has reached near-universal AI adoption. Ninety-three percent of developers use AI coding tools. Productivity gains have stalled at around 10%. The gap between those numbers is not a tool problem. It is a compounding debt problem that most teams don't notice until it's expensive to reverse.

A/B Testing AI Features When the Treatment Is Non-Deterministic

· 10 min read
Tian Pan
Software Engineer

Your team ships a new LLM-powered feature, runs a clean A/B test for two weeks, and sees a statistically significant improvement. You roll it out. Three weeks later, retention metrics are flat and support tickets are up. What went wrong? You ran a textbook experiment on a non-textbook treatment — and the textbook assumption that "the treatment is stable" broke silently.

Standard A/B testing was designed for deterministic or near-deterministic treatments: a button color change, a ranking algorithm with fixed parameters, a checkout flow. LLM features violate almost every assumption that makes classical frequentist experiments reliable. The treatment variance is high, the treatment itself mutates mid-experiment when providers push model updates, success is hard to operationalize, and novelty effects are strong enough to produce results that evaporate after users adapt.

This post is about the adjustments that make experimentation work anyway.

The Cascade Problem: Why Agent Side Effects Explode at Scale

· 12 min read
Tian Pan
Software Engineer

A team ships a document-processing agent. It works flawlessly in development: reads files, extracts data, writes results to a database, sends a confirmation webhook. They run 50 test cases. All pass.

Two weeks after deployment, with a hundred concurrent agent instances running, the database has 40,000 duplicate records, three downstream services have received thousands of spurious webhooks, and a shared configuration file has been half-overwritten by two agents that ran simultaneously.

The agent didn't break. The system broke because no individual agent test ever had to share the world with another agent.

The Agent Specification Gap: Why Your Agents Ignore What You Write

· 12 min read
Tian Pan
Software Engineer

You wrote a careful spec. You described the task, listed the constraints, and gave examples. The agent ran — and did something completely different from what you wanted.

This is the specification gap: the distance between the instructions you write and the task the agent interprets. It's not a model capability problem. It's a specification problem. Research on multi-agent system failures published in 2025 found that specification-related issues account for 41.77% of all failures, and that 79% of production breakdowns trace back to how tasks were specified, not to what models can do.

The majority of teams writing agent specs are committing the same category of mistake: writing instructions the way you'd write an email to a competent colleague, then expecting an autonomous system with no shared context to execute them correctly across thousands of runs.

AI as a CI/CD Gate: What Agents Can and Cannot Reliably Block

· 9 min read
Tian Pan
Software Engineer

An AI reviewer blocks a merge. A developer stares at the failing check, clicks "view details," skims three paragraphs of boilerplate, and files a "force-push exception" without reading the actual finding. Within a week, every engineer on the team has internalized that the AI gate is background noise — something to dismiss, not engage with.

This is the outcome most teams building AI CI/CD gates actually ship, even when the underlying model is technically capable. The problem is not whether AI can review code. The problem is what you ask it to block, and what you expect to happen when it does.

AI Coding Agents on Legacy Codebases: What Works and What Backfires

· 10 min read
Tian Pan
Software Engineer

Most AI coding demos show an agent building a greenfield Todo app or implementing a clean API from scratch. Your codebase, however, is a fifteen-year-old monolith with undocumented implicit contracts, deprecated dependencies that three teams depend on in ways nobody fully understands, and a service layer that started as a single class and now spans forty files. The gap between demo and reality is not just a size problem — it's a structural one, and understanding it before you hand your agents the keys prevents a specific category of subtle, expensive failures.

AI coding agents genuinely help with legacy systems, but only within certain task boundaries. Outside those boundaries, they don't just fail noisily — they produce plausible-looking, syntactically valid, semantically wrong changes that slip through code review and surface in production.

Why Users Ignore the AI Feature You Spent Three Months Building

· 10 min read
Tian Pan
Software Engineer

Your team spent three months integrating an LLM into your product. The model works. The latency is acceptable. The demo looks great. You ship. And then you watch the usage metrics flatline at 4%.

This is the typical arc. Most AI features fail not at the model level but at the adoption level. The underlying cause isn't technical — it's a cluster of product decisions that were made (or not made) around discoverability, trust, and habit formation. Understanding why adoption fails, and what to actually measure and change, separates teams that ship useful AI from teams that ship impressive demos.