Skip to main content

780 posts tagged with "ai-engineering"

View all tags

Why Gradual Rollouts Don't Work for AI Features (And What to Do Instead)

· 9 min read
Tian Pan
Software Engineer

Canary deployments work because bugs are binary. Code either crashes or it doesn't. You route 1% of traffic to the new version, watch error rates and latency for 30 minutes, and either roll back or proceed. The system grades itself. A bad deploy announces itself loudly.

AI features don't do that. A language model that starts generating subtly wrong advice, outdated recommendations, or plausible-sounding nonsense will produce zero 5xx errors. Latency stays within SLOs. The canary looks green while the product is silently failing its users.

This isn't a tooling problem. It's a conceptual mismatch. The entire mental model behind gradual rollouts — deterministic code, self-grading systems, binary pass/fail — breaks down the moment you introduce a component whose correctness cannot be measured by observing the request itself.

The Hybrid Automation Stack: A Decision Framework for Mixing Rules and LLMs

· 9 min read
Tian Pan
Software Engineer

Teams that replace all their Zapier flows and RPA scripts with LLM agents tend to discover the same thing six months later: they've traded brittle-but-auditable for flexible-but-unmaintainable. The Zapier flows broke in predictable ways—step 14 failed because the API changed. The LLM workflows break invisibly—the model quietly routes support tickets to the wrong queue, and nobody finds out until a customer escalates. The audit log says "AI decision," which is lawyer-speak for "no one knows."

The answer isn't to avoid LLMs in automation. It's to be deliberate about which tasks go to which system, and to architect the seam between them so failures don't cross over.

The Multi-Tenant Prompt Problem: When One System Prompt Serves Many Masters

· 9 min read
Tian Pan
Software Engineer

You ship a new platform-level guardrail — a rule that prevents the AI from discussing competitor pricing. It goes live Monday morning. By Wednesday, your largest enterprise customer files a support ticket: their sales assistant, which they'd carefully tuned to compare vendor options for their procurement team, stopped working. They didn't change anything. You changed something, and the blast radius hit them invisibly.

This is the multi-tenant prompt problem. B2B AI products that allow customer customization are actually running a layered instruction system, and most teams don't treat it like one. They treat it like string concatenation: take the platform prompt, append the customer's instructions, maybe append user preferences, and call the LLM. The model figures out the rest.

The model doesn't figure it out. It silently picks a winner, and you don't find out which one until someone complains.

The Multi-Variable Regression Problem: Isolating AI Failures When Everything Changed at Once

· 11 min read
Tian Pan
Software Engineer

The ticket comes in on a Monday morning: user satisfaction for your AI-powered feature dropped 18% over the weekend. You open the deployment log and your stomach drops. Friday's release included a model version bump from your provider, a prompt refinement by the product team, a retrieval corpus refresh after a content audit, and a tool schema update for a renamed API field. Four changes. One regression. Zero idea which variable to blame.

This is the multi-variable regression problem, and it's the hardest class of failure in production AI systems. Not because the failure is exotic — behavioral regressions happen constantly — but because the conditions that produce it are nearly guaranteed when teams move fast. The changes that individually look safe pile up, release together, and then leave you debugging in the dark.

Prompt Linting: The Pre-Deployment Gate Your AI System Is Missing

· 8 min read
Tian Pan
Software Engineer

Every serious engineering team runs a linter before merging code. ESLint catches undefined variables. Prettier enforces formatting. Semgrep flags security anti-patterns. Nobody ships JavaScript to production without running at least one static check first.

Now consider what your team does before shipping a prompt change. If you're like most teams, the answer is: review it in a PR, eyeball it, maybe test it manually against a few inputs. Then merge. The system prompt for your production AI feature — the instruction set that controls how the model behaves for every single user — gets less pre-deployment scrutiny than a CSS change.

This gap is not a minor process oversight. A study analyzing over 2,000 developer prompts found that more than 10% contained vulnerabilities to prompt injection attacks, and roughly 4% had measurable bias issues — all without anyone noticing before deployment. The tooling to catch these automatically exists. Most teams just haven't wired it in yet.

Schema Entropy: Why Your Tool Definitions Are Rotting in Production

· 10 min read
Tian Pan
Software Engineer

Your agent was working fine in January. By March, it started failing on 15% of tool calls. By May, it was silently producing wrong outputs on another 20%. Nothing in your deployment logs changed. No one touched the agent code. The tool definitions look exactly like they did six months ago — and that's the problem.

Tool schemas don't have to be edited to become wrong. The services they describe change underneath them. Enum values get added. Required fields become optional in a backend refactor. A parameter that used to accept strings now expects an ISO 8601 timestamp. The schema document stays frozen while the underlying API keeps moving, and your agent keeps calling it confidently, with no idea the contract has shifted.

This is schema entropy: the gradual divergence between the tool definitions your agent was trained to use and the tool behavior your production services actually exhibit. It is one of the most underappreciated reliability problems in production AI systems, and research suggests tool versioning issues account for roughly 60% of production agent failures.

The Selective Abstention Problem: Why AI Systems That Always Answer Are Broken

· 10 min read
Tian Pan
Software Engineer

Here is a pattern that appears in almost every production AI deployment: the team ships a feature that handles 90% of queries well. Then they start getting complaints. A user asked something outside the training distribution; the model confidently produced a wrong answer. A RAG pipeline retrieved a stale document; the model answered as though it were current. A legal query hit an edge case the prompt didn't cover; the model speculated its way through it. The fix, in each case, wasn't a better model. It was teaching the system to say "I don't know."

Abstention — the principled decision to not answer — is one of the hardest and most undervalued capabilities in AI system design. Virtually all product effort goes toward making answers better. Almost none goes toward making the system reliably know when to withhold one. That asymmetry is a design debt that compounds in production.

Staffing AI Engineering Teams: Who Owns What When Every Feature Has an AI Component

· 11 min read
Tian Pan
Software Engineer

Three years ago, "AI team" meant a group of specialists tucked into a corner of the org chart, mostly invisible to product engineers. Today, a senior software engineer at a fintech company ships a fraud-scoring feature using a fine-tuned model on Monday, wires up a RAG pipeline for customer support on Wednesday, and debugs LLM latency on Friday. The specialists didn't go away—but the boundary between "AI work" and "product engineering" dissolved faster than almost anyone planned for.

Most teams responded by bolting new titles onto existing job descriptions and calling it done. That's the wrong answer, and the dysfunction shows up quickly: unclear ownership, duplicated tooling, and an ML platform team that spends half its time explaining why product teams can't just call the OpenAI API directly.

This post is about getting the structure right—not in the abstract, but for the actual stages of AI adoption most engineering organizations go through.

Your LLM Eval Is Lying to You: The Statistical Power Problem

· 9 min read
Tian Pan
Software Engineer

Your team spent three days iterating on a system prompt. The eval score went from 82% to 85%. You ship it. Three weeks later, production metrics are flat. What happened?

The short answer: your eval lied to you. Not through malice, but through insufficient sample size and ignored variance. A 3-point accuracy lift on a 100-example test set is well within the noise floor of most LLM systems. You cannot tell signal from randomness at that scale — but almost no one does the math to verify this before acting on results.

This is the statistical power problem in LLM evaluation, and it is quietly corrupting the iteration loops of most teams building AI products.

The Curriculum Trap: Why Fine-Tuning on Your Best Examples Produces Mediocre Models

· 10 min read
Tian Pan
Software Engineer

Every fine-tuning effort eventually hits the same intuition: better data means better models, and better data means higher-quality examples. So teams build elaborate annotation pipelines to filter out the mediocre outputs, keep only the gold-standard responses, and train on a dataset they're proud of. The resulting model then underperforms on the exact use cases that motivated the project. This failure is so common it deserves a name: the curriculum trap.

The trap is this — curating only your best, most confident, most authoritative outputs doesn't teach the model to be better. It teaches the model to perform confidence regardless of whether confidence is warranted. You produce something that looks impressive in demos and falls apart in production, because production is full of the messy edge cases your curation process systematically excluded.

The Integration Test Mirage: Why Mocked Tool Outputs Hide Your Agent's Real Failure Modes

· 11 min read
Tian Pan
Software Engineer

Your agent passes every test. The CI pipeline is green. You ship it.

A week later, a user reports that their bulk-export job silently returned 200 records instead of 14,000. The agent hit the first page of a paginated API, got a clean response, assumed there was nothing more, and moved on. Your mock returned all 200 items in one shot. The real API never told the agent there were 70 more pages.

This is not a model failure. The model reasoned correctly. This is a test infrastructure failure — and it's endemic to how teams build and test agentic systems.

The Overclaiming Trap: When Being Right for the Wrong Reasons Destroys AI Product Trust

· 10 min read
Tian Pan
Software Engineer

Most AI product post-mortems focus on the same story: the model was wrong, users noticed, trust eroded. The fix is obvious — improve accuracy. But there is a more insidious failure mode that post-mortems rarely capture because standard accuracy metrics don't surface it: the model was right, but for the wrong reasons, and the power users who checked the reasoning never came back.

Call it the overclaiming trap. It is the failure mode where correct final answers are backed by fabricated, retrofitted, or structurally unsound reasoning chains. It is more dangerous than ordinary wrongness because it looks like success until your most sophisticated users start quietly leaving.