Skip to main content

75 posts tagged with "ai"

View all tags

The Show Your Work UX Trap: When the Reasoning Trace Is Debug Output Wearing a Product Costume

· 11 min read
Tian Pan
Software Engineer

A reasoning model emits a chain-of-thought trace because that is how it computes. A product team renders that trace in the UI because hiding it feels like throwing away tokens the user paid for. Those are two different decisions, and almost nobody on the product side notices they made the second one. The trace becomes a panel, the panel becomes a feature, the feature gets a docs page, and six months later someone in a quarterly review asks why the support queue is full of users arguing with the reasoning instead of the answer.

The trace is debug output. It exists for engineers who need to know why the model picked one tool, hedged on a date, or quietly switched personas mid-paragraph. Pushing it to the end user without a design pass is the AI-product equivalent of leaving console.log calls in production and calling them "transparency." It looks like a feature, it costs almost nothing to render, and it quietly degrades trust in ways that don't show up in any of the dashboards the team built.

AI Co-Pilot vs. AI Pilot: The Evidence-Based Product Decision Framework

· 9 min read
Tian Pan
Software Engineer

Every product team building with AI faces the same fork in the road: should the AI advise humans, or should it act on its own? The framing sounds philosophical, but the answer is actually measurable — and getting it wrong is expensive in ways that don't show up until six months after launch, when your override metrics look fine and your user trust scores are quietly collapsing.

Klarna replaced 700 customer service agents with an autonomous AI system in early 2024. By 2025, the CEO admitted they had "gone too far" and began quietly rehiring humans for complex cases. The AI handled 2.3 million conversations in a month and resolved issues in under 2 minutes instead of 11. The numbers looked great. The underlying problem — that customer service for financial products requires empathy and judgment, not just resolution speed — showed up later, in declining satisfaction on anything outside the happy path.

The AI Efficiency Paradox: When Your Best Feature Kills Your Revenue

· 9 min read
Tian Pan
Software Engineer

In early 2026, Atlassian reported something that hadn't happened in the company's history: a decline in enterprise seat counts. For a company whose entire growth model rests on expansion revenue — selling more seats as customer organizations grow — this was a structural alarm, not a blip. The proximate cause wasn't churn or product failure. It was that Atlassian's own AI features had made teams so much more productive that fewer seats were needed to do the same amount of work.

This is the AI efficiency paradox: build a feature that genuinely saves users time, and you may be training them to need less of your product. The more useful your AI, the faster your pricing model breaks.

AI Feature PMF Signals: Why Your Metrics Are Lying to You

· 9 min read
Tian Pan
Software Engineer

When your AI feature ships and the metrics light up — DAU spikes, NPS climbs, thumbs-up feedback floods in — you could be looking at genuine product-market fit. Or you could be watching the first act of a two-part story where the second act ends with a retention cliff nobody saw coming.

The problem is these signals are structurally broken for probabilistic AI features. They were designed for deterministic software where "activated" means something, where a five-star rating predicts future use, where the novelty fades in days rather than masking a six-month churn wave. AI features behave differently, and the standard PMF toolkit is calibrated for the wrong inputs.

Your System Prompts Are Still in English: The Silent Cost of Incomplete AI Localization

· 8 min read
Tian Pan
Software Engineer

Your team ships an AI feature. You celebrate the localization work: every button label, tooltip, and error message has been translated into twelve languages. The product manager signs off. The feature goes live globally.

Then, six weeks later, a user in Germany posts a screenshot. The AI's response has the right words but wrong register — awkward formality for a casual support context. A Japanese user reports that structured outputs contain dates formatted as MM/DD/YYYY, confusing their downstream tooling. A Brazilian support engineer notices the AI occasionally slips into English mid-sentence when reasoning through complex queries. These aren't infrastructure failures. Your dashboards show green. But for non-English users, the product is quietly worse.

The root cause is almost always the same: teams translate UI strings but leave system prompts in English. It feels like localization. It isn't.

The Dev-to-Prod Cost Shock: Why Your AI Feature Costs Pennies in Staging and Dollars in Production

· 8 min read
Tian Pan
Software Engineer

A proof-of-concept costs you $200 in API tokens. You get the green light to ship. Six weeks later, the invoice is $18,000. This is not a pricing change or a billing mistake — it is a failure of cost modeling, and it is the most predictable surprise in AI engineering.

The gap between staging and production costs for AI features is not random. It follows a consistent pattern: staging is structurally designed, often by accident, to hide every single cost driver that matters in production. Understanding those drivers is how you avoid the first invoice being a crisis.

DORA in the Age of AI: When Deployment Frequency Lies

· 9 min read
Tian Pan
Software Engineer

Here is a number that should unsettle you: according to the 2025 DORA State of AI-Assisted Software Development report, developer PRs merged per person rose 98% while incidents per PR rose 242.7%. Deployment frequency looks elite. The system is breaking more often per unit of change than at any point DORA has measured.

Your dashboard is green. Your on-call engineers are exhausted. Something is wrong with the measuring tape.

Enterprise AI's Last Mile Problem: Why Most Pilots Never Reach Production

· 8 min read
Tian Pan
Software Engineer

A model that scores 94% on your internal benchmark, impresses stakeholders in a demo, and passes every offline evaluation can still reach production and drop to 7% effective accuracy on real customer data. This isn't a hypothetical. It's a documented outcome from multiple enterprise AI deployments, and it's one symptom of a broader pattern: the gap between "pilot success" and "production value" is where most enterprise AI quietly dies.

Across industries, roughly 85–88% of enterprise AI pilots never reach production. For every 33 PoCs an organization starts, only four ship. That ratio has barely moved in three years despite massive increases in model capability. The failure mode has nothing to do with whether the model is good enough — it's almost always about what happens between the successful demo and the moment a real user relies on the system to do real work.

The Feedback Signal Timing Problem: Why Your AI Metrics Are Lying to You

· 9 min read
Tian Pan
Software Engineer

When Klarna deployed its AI customer service chatbot in early 2024, it processed 2.3 million conversations in the first month. Satisfaction scores matched human agents. Executives declared victory. By 2025, the company was quietly hiring back the human agents it had replaced.

What went wrong? The metrics told one story while users experienced another. The chatbot aced simple, transactional queries—order status, payment questions—but fell apart on complex disputes, fraud claims, and emotionally difficult conversations. CSAT scores averaged across all interaction types couldn't detect this. The system appeared to be working even as it was slowly eroding user trust.

This isn't a Klarna-specific failure. It's a pattern that repeats across AI product development: teams collect satisfaction signals, optimize against them, and discover too late that the signals were measuring something other than actual value. The problem isn't the tools—it's the timing mismatch between when feedback arrives and when the consequences of a response become clear.

The Organizational Immune System: Why Companies Kill AI Features That Actually Work

· 10 min read
Tian Pan
Software Engineer

Your AI feature works. It passes every benchmark you built. It handles edge cases your team spent weeks stress-testing. Users in the pilot loved it. Your model isn't hallucinating. Latency is under 300ms. The eval suite is green.

Then six months go by and it still isn't in production. Legal wants three more reviews. A senior VP is concerned about "scope." The team that owns the adjacent workflow says they weren't consulted. Finance says the ROI model needs rework. You're told to "socialize it more broadly."

This is the organizational immune system at work — and it kills more AI projects than bad models ever will.

The Two-Speed Organization: Why AI Teams and Product Teams Run on Incompatible Clocks

· 10 min read
Tian Pan
Software Engineer

Your ML team ran a promising experiment. The model beat the baseline by 8 points on your eval set. Stakeholders are excited. Then it took four months to ship — and by the time the feature launched, the product roadmap had moved on, the team that requested it had a different priority, and half the infra work got redone because the deployment target changed mid-flight. Sound familiar?

This is the clock-mismatch problem: AI teams and product teams run on fundamentally different time scales, and most organizations treat this as a coordination failure when it is actually an architectural one. You cannot fix a structural mismatch with a better standup cadence.

Adding AI to Trusted Features: How Variance Destroys the Trust You Spent Years Building

· 11 min read
Tian Pan
Software Engineer

Your most-trusted feature is also your most dangerous AI deployment target. That's the counterintuitive reality that product teams keep discovering the hard way: the features users rely on the most, the ones where trust is deep and automatic, are exactly the ones where AI-introduced variance causes the most catastrophic trust damage. A new feature that fails is a disappointment. An existing feature that suddenly behaves unpredictably is a betrayal.

This is the AI product retrofit trap. Not the decision to add AI — that's often right. The trap is the belief that adding AI to an established feature is safer than building a new one because you already have the users. In reality, the reverse is true. The trust you've spent months or years earning is not a foundation for AI experiments; it's a liability if the experiment fails.