Skip to main content

553 posts tagged with "ai-engineering"

View all tags

The Silent Regression: How to Communicate AI Behavioral Changes Without Losing User Trust

· 9 min read
Tian Pan
Software Engineer

Your power users are your canaries. When you ship a new model version or update a system prompt, aggregate evaluation metrics tick upward — task completion rates improve, hallucination scores drop, A/B tests declare victory. Then your most sophisticated users start filing bug reports. "It used to just do X. Now it lectures me first." "The formatting changed and broke my downstream parser." "I can't get it to stay in character anymore." They aren't imagining things. You shipped a regression, you just didn't see it in your dashboards.

This is the central paradox of AI product development: the users most harmed by behavioral drift are the ones who invested most in understanding the system's quirks. They built workflows around specific output patterns. They learned which prompts reliably triggered which behaviors. When you change the model, you don't just ship updates — you silently invalidate months of their calibration work.

The Debugging Regression: How AI-Generated Code Shifts the Incident-Response Cost Curve

· 9 min read
Tian Pan
Software Engineer

In March 2026, a single AI-assisted code change cost one major retailer 6.3 million lost orders and a 99% drop in North American order volume — a six-hour production outage traced to a change deployed without proper review. It wasn't a novel attack. There was no exotic failure mode. The system just did what the AI told it to do, and no one on-call had the mental model to understand why that was wrong until millions of customers had already seen errors.

This is the debugging regression. The productivity gains from AI-generated code are front-loaded and visible on dashboards. The costs are back-loaded and invisible until your alerting wakes you up at 3am.

AI-Assisted Codebase Migration at Scale: Automating the Upgrades Nobody Wants to Touch

· 11 min read
Tian Pan
Software Engineer

When Airbnb needed to migrate 3,500 React test files from Enzyme to React Testing Library, they estimated the project at 1.5 years of manual effort. They shipped it in 6 weeks using an LLM-powered pipeline. When Google studied 39 distinct code migrations executed over 12 months by a team of 3 developers—595 code changes, 93,574 edits—they found that 74% of the edits were AI-generated, 87% of those were committed without human modification, and the overall migration timeline was cut by 50%.

These numbers are real. But so is this: during those same migrations, engineers spent approximately 50% of their time validating AI output—fixing context window failures, cleaning up hallucinated imports, and untangling business logic errors the tests didn't catch. The efficiency gains are genuine and the pain points are genuine. The question isn't whether AI belongs in code migrations; it's knowing exactly where it helps and where it creates more cleanup than it saves.

The AI Engineering Career Ladder: Why Your SWE Leveling Framework Is Lying to You

· 10 min read
Tian Pan
Software Engineer

A senior engineer at a mid-sized startup recently got a mediocre performance review. Their velocity was inconsistent — some weeks they shipped a ton of code, others almost nothing. Their manager, trained on traditional SWE frameworks, marked them down for output variability. Six weeks later, that engineer left for a competing team. What the manager didn't understand: the engineer's "slow" weeks were spent building evaluation infrastructure that prevented three categories of silent failures. Without it, the product would have been subtly broken in ways nobody would have noticed for months.

This pattern is playing out across engineering orgs right now. Teams that built their career ladders for deterministic software systems are applying those same frameworks to AI engineers — and systematically misidentifying their best people.

The AI-Everywhere Antipattern: When Adding LLMs Makes Your Pipeline Worse

· 9 min read
Tian Pan
Software Engineer

There is a type of architecture that emerges at almost every company that ships an AI feature and then keeps shipping: a pipeline where every transformation, every routing decision, every classification, every formatting step passes through an LLM call. It usually starts with a legitimate use case. The LLM actually helps with one hard problem. Then the team, having internalized the pattern, reaches for it again. And again. Until the whole system is an LLM-to-LLM chain where a string of words flows in at one end and a different string of words comes out the other, with twelve API calls in between and no determinism anywhere.

This is the AI-everywhere antipattern, and it is now one of the most reliable ways to build a production system that is slow, expensive, and impossible to debug.

The AI Feature Deprecation Playbook: Shutting Down LLM Features Without Destroying User Trust

· 12 min read
Tian Pan
Software Engineer

When OpenAI first tried to retire GPT-4o in August 2025, the backlash forced them to reverse course within days. Users flooded forums with petitions and farewell letters. One user wrote: "He wasn't just a program. He was part of my routine, my peace, my emotional balance." That is not how users react to a deprecated REST endpoint. That is how they react to losing a relationship.

AI features break the mental model engineers bring to deprecation planning. Traditional software has a defined behavior contract: given the same input, you get the same output, forever, until you change it. An LLM-powered feature has a personality. It has warmth, hedges, phrasing preferences, and a characteristic way of saying "I'm not sure." Users don't just use these features — they calibrate to them. They build workflows, emotional dependencies, and intuitions around specific behavioral quirks that will never appear in any spec document.

When you shut that down, you are not removing a function. You are changing the social contract.

What 'Done' Means for AI-Powered Features: Engineering the Perpetual Beta

· 10 min read
Tian Pan
Software Engineer

Shipping a feature in traditional software ends with a merge. The unit tests pass. The integration tests pass. QA signs off. You flip the flag, and unless a bug surfaces in production, you move on. The feature is done. For AI-powered features, that moment doesn't exist — and if you're pretending it does, you're accumulating a stability debt that will eventually show up as a user trust problem.

The reason is straightforward but rarely designed around: deterministic software produces the same output from the same input every time. AI features do not. Not because of a bug, but because the behavior is defined by a model that lives outside your codebase, trained on data that reflects a world that keeps changing, consumed by users whose expectations evolve as they see what's possible.

AI Infrastructure Carbon Accounting: The Sustainability Cost Your Team Hasn't Measured Yet

· 9 min read
Tian Pan
Software Engineer

Every engineering team building on LLMs right now is making infrastructure decisions with a hidden cost they're not measuring. You track tokens. You track latency. You track API spend. But almost nobody tracks the carbon output of the inference workload they're running — and that gap is closing fast, from both the regulatory side and the market side.

AI systems now account for 2.5–3.7% of global greenhouse gas emissions, officially surpassing aviation's 2% contribution, and growing at 15% annually. US data centers running AI-specific servers consumed 53–76 TWh in 2024 alone — enough to power 7.2 million homes for a year. The scale is not hypothetical anymore, and the expectation that engineering teams will have visibility into their contribution is becoming a real organizational pressure.

AI Oncall: What to Page On When Your System Thinks

· 11 min read
Tian Pan
Software Engineer

A team running a multi-agent market research pipeline spent eleven days watching their system run normally — green dashboards, zero errors, normal latency — while four LangChain agents looped against each other in an infinite cycle. By the time someone glanced at the billing dashboard, the week's projected cost of $127 had become $47,000. The agents had never crashed. The API never returned an error. Every infrastructure alert stayed silent.

This is the defining problem of AI oncall: your system can be operationally green while failing catastrophically at the thing it's supposed to do. Traditional monitoring was built to detect crashes, latency spikes, and error rates. AI systems can hit all their infrastructure SLOs while silently producing wrong outputs, looping on a task indefinitely, or spending thousands of dollars on computation that produces nothing useful. The absence of errors is not evidence of correctness.

The AI Product Metrics Trap: When Engagement Looks Like Value but Isn't

· 11 min read
Tian Pan
Software Engineer

A METR study published in 2025 asked 16 experienced open-source developers to predict how much faster AI tools would make them. They guessed 24% faster. The study then measured what actually happened across 246 real tasks — bug fixes, features, refactors — randomly assigned to AI-allowed and AI-disallowed conditions. The result: developers with AI access were 19% slower. After the study concluded, participants were surveyed again. They still believed AI had made them 20% faster.

That gap — between perceived productivity and measured productivity — is not a quirk of one study. It is the central problem with how most teams currently measure AI features. The signals that feel like success are, in many cases, measuring the novelty of the tool rather than its usefulness. And the first 30 days are the worst time to look.

AI Succession Planning: What Happens When the Team That Knows the Prompts Leaves

· 11 min read
Tian Pan
Software Engineer

The engineer who built your customer support AI leaves for another job. On their last day, you do an offboarding interview and ask them to document what they know. They write a few paragraphs explaining how the system works. Six months later, customer satisfaction scores start slipping. Someone suggests tightening the tone of the system prompt. Another engineer makes the edit, runs a few manual tests, and ships it. Three weeks later, you discover that a specific phrasing in the original system prompt was load-bearing in ways nobody knew — it was the only thing preventing the model from over-escalating tickets on Friday afternoons, a pattern the original engineer had noticed and quietly fixed with a single sentence.

No one knew that sentence existed for a reason. It looked like implementation detail. It was actually institutional knowledge.

Ambient AI Architecture: Designing Always-On Agents That Don't Get Disabled

· 9 min read
Tian Pan
Software Engineer

Most teams building ambient AI ship something users immediately turn off.

The pattern is consistent: the team demos the feature internally, everyone agrees it's useful in theory, and within two weeks of launch the disable rate exceeds 60%. This isn't a model quality problem. It's an architecture problem — and specifically an interrupt threshold problem. Teams design their ambient agents around what the AI can do rather than what users will tolerate when they didn't ask for help.

The gap between explicit invocation ("ask the AI") and ambient monitoring ("the AI watches and acts") is not just a UX question. It demands a fundamentally different system architecture, a different event model, and a different mental model for when an AI agent earns the right to speak.