Skip to main content

44 posts tagged with "engineering"

View all tags

AI Co-Pilot vs. AI Pilot: The Evidence-Based Product Decision Framework

· 9 min read
Tian Pan
Software Engineer

Every product team building with AI faces the same fork in the road: should the AI advise humans, or should it act on its own? The framing sounds philosophical, but the answer is actually measurable — and getting it wrong is expensive in ways that don't show up until six months after launch, when your override metrics look fine and your user trust scores are quietly collapsing.

Klarna replaced 700 customer service agents with an autonomous AI system in early 2024. By 2025, the CEO admitted they had "gone too far" and began quietly rehiring humans for complex cases. The AI handled 2.3 million conversations in a month and resolved issues in under 2 minutes instead of 11. The numbers looked great. The underlying problem — that customer service for financial products requires empathy and judgment, not just resolution speed — showed up later, in declining satisfaction on anything outside the happy path.

The Dev-to-Prod Cost Shock: Why Your AI Feature Costs Pennies in Staging and Dollars in Production

· 8 min read
Tian Pan
Software Engineer

A proof-of-concept costs you $200 in API tokens. You get the green light to ship. Six weeks later, the invoice is $18,000. This is not a pricing change or a billing mistake — it is a failure of cost modeling, and it is the most predictable surprise in AI engineering.

The gap between staging and production costs for AI features is not random. It follows a consistent pattern: staging is structurally designed, often by accident, to hide every single cost driver that matters in production. Understanding those drivers is how you avoid the first invoice being a crisis.

The Organizational Immune System: Why Companies Kill AI Features That Actually Work

· 10 min read
Tian Pan
Software Engineer

Your AI feature works. It passes every benchmark you built. It handles edge cases your team spent weeks stress-testing. Users in the pilot loved it. Your model isn't hallucinating. Latency is under 300ms. The eval suite is green.

Then six months go by and it still isn't in production. Legal wants three more reviews. A senior VP is concerned about "scope." The team that owns the adjacent workflow says they weren't consulted. Finance says the ROI model needs rework. You're told to "socialize it more broadly."

This is the organizational immune system at work — and it kills more AI projects than bad models ever will.

Profiling LLM Pipelines: The Bottlenecks That Aren't Inference

· 8 min read
Tian Pan
Software Engineer

Your team just spent three weeks optimizing inference. You swapped to a quantized model, tuned your batching policy, squeezed out 12% off time-to-first-token, and shipped it. Then you looked at the actual user-facing latency and it barely moved.

This is the inference trap. It's the most common profiling failure mode in LLM-powered applications, and it happens because engineers measure what's easy to measure — GPU utilization, inference throughput, tokens per second — rather than what's actually slow. In a typical RAG pipeline, inference accounts for around 80% of latency when you include everything that touches the GPU. But that remaining 20% is often distributed across six or seven stages that nobody is tracing. Each one seems small in isolation, but together they dominate the optimization opportunity.

Ship Your AI Feature Before It Feels Ready

· 9 min read
Tian Pan
Software Engineer

Most AI features that ship late don't ship late because they're broken. They ship late because the team is still optimizing for a test suite that doesn't reflect how real users behave. The benchmarks look better each week. The evals trend upward. And the gap between "lab performance" and "production value" quietly widens.

The uncomfortable truth is that the first 500 real users will surface more actionable problems in two weeks than four more weeks of prompt tuning ever could. This is not an argument for shipping garbage. It's an argument for recognizing that your current calibration of "ready" is almost certainly miscalibrated — and that real usage data is the only thing that corrects it.

The Two-Speed Organization: Why AI Teams and Product Teams Run on Incompatible Clocks

· 10 min read
Tian Pan
Software Engineer

Your ML team ran a promising experiment. The model beat the baseline by 8 points on your eval set. Stakeholders are excited. Then it took four months to ship — and by the time the feature launched, the product roadmap had moved on, the team that requested it had a different priority, and half the infra work got redone because the deployment target changed mid-flight. Sound familiar?

This is the clock-mismatch problem: AI teams and product teams run on fundamentally different time scales, and most organizations treat this as a coordination failure when it is actually an architectural one. You cannot fix a structural mismatch with a better standup cadence.

When to Reach for an LLM vs. a Simple Heuristic: A Four-Factor Framework

· 10 min read
Tian Pan
Software Engineer

A logistics company spent $800K and twelve months trying to use AI for route optimization. At the end of the engagement, their routes were marginally better than the heuristics they already had. Leadership rejected the next three AI proposals. A food delivery company faced the same route problem and solved it in a single night with a set of explicit business rules.

The expensive lesson both teams discovered: route optimization with real-time constraints, driver preferences, and time windows is not an AI problem — it's a combinatorial scheduling problem. The patterns you need to learn aren't hidden in data; they're explicit domain logic that someone in operations already knows.

This plays out across every industry. A 2025 MIT study found 95% of enterprise AI pilots delivered zero measurable impact despite $30–40 billion in combined investment. The dominant failure mode wasn't bad models or insufficient data. It was teams building AI solutions for problems where AI was the wrong tool.

Adding AI to Trusted Features: How Variance Destroys the Trust You Spent Years Building

· 11 min read
Tian Pan
Software Engineer

Your most-trusted feature is also your most dangerous AI deployment target. That's the counterintuitive reality that product teams keep discovering the hard way: the features users rely on the most, the ones where trust is deep and automatic, are exactly the ones where AI-introduced variance causes the most catastrophic trust damage. A new feature that fails is a disappointment. An existing feature that suddenly behaves unpredictably is a betrayal.

This is the AI product retrofit trap. Not the decision to add AI — that's often right. The trap is the belief that adding AI to an established feature is safer than building a new one because you already have the users. In reality, the reverse is true. The trust you've spent months or years earning is not a foundation for AI experiments; it's a liability if the experiment fails.

The Dual Newspaper Test for AI Features: Catching the Failure Modes Your Post-Mortems Miss

· 9 min read
Tian Pan
Software Engineer

Your AI feature passed load testing. It hit the latency SLA. The rollback procedure works. Cost estimates came in under budget. Your post-mortem template has a green checkmark next to every line.

Two months after launch, the product appears in an investigative piece about discriminatory outcomes. You spend six weeks in legal review.

This is the gap the dual newspaper test is designed to close. Most engineering teams build thorough pre-ship processes for technical failures — reliability regressions, API instability, infrastructure cost blowouts. They read post-mortems about outages and optimize accordingly. But a second class of AI failures gets shipped right through those processes because it doesn't look like a bug: the feature works exactly as designed, and the harm happens anyway.

AI Feature Payback: The ROI Model Your Finance Team Won't Fight You On

· 10 min read
Tian Pan
Software Engineer

Every engineering team shipping AI features eventually hits the same wall: finance wants a spreadsheet that justifies the spend, and the spreadsheet you built doesn't actually work.

The problem isn't that AI features lack ROI. The problem is that AI economics break every assumption the standard ROI model was built on — fixed capital, linear cost curves, predictable timelines. Teams that treat AI spending like SaaS licensing get numbers that either look deceptively good before launch or collapse six months into production. The ten-fold gap between measured AI initiatives (55% ROI) and ad-hoc deployments (5.9% ROI) comes almost entirely from whether teams got the measurement model right before they shipped.

Building Trust Recovery Flows: What Happens After Your AI Makes a Visible Mistake

· 9 min read
Tian Pan
Software Engineer

When Google's AI Overview told users to add glue to pizza sauce and eat rocks for digestive health, it didn't just embarrass a product team — it exposed a systemic gap in how we think about AI reliability. The failure wasn't just that the model was wrong. The failure was that the model was confidently wrong, in a high-visibility context, with no recovery path for the users it misled.

Trust in AI systems doesn't erode gradually. Research shows it follows a cliff-like collapse pattern: a single noticeable error can produce a disproportionate trust decline with measurable effect sizes. Only 29% of developers say they trust AI tools — an 11-point drop from the previous year, even as adoption climbs to 84%. We're building systems that people use but don't trust. That gap matters when your product ships agentic features that act on behalf of users.

This post is about what engineers and product builders should do after the mistake happens — not just how to prevent it.

The Inherited AI System Audit: How to Take Ownership of an LLM Feature You Didn't Build

· 10 min read
Tian Pan
Software Engineer

Someone left. The onboarding doc says "ask Sarah" but Sarah is at a different company now. You're staring at a 900-line system prompt with sections titled things like ## DO NOT REMOVE THIS SECTION, and you have no idea what happens if you do.

This is the inherited AI system problem, and it's different from inheriting regular code. With legacy code, a determined engineer can trace execution paths, read tests, and reconstruct intent from behavior. With an inherited LLM feature, the prompt is the logic — but it's written in natural language, its failure modes are probabilistic, and the author's intent is trapped inside their head. There are no stack traces that tell you which guardrail fired and why.