Skip to main content

16 posts tagged with "ai-product"

View all tags

Free-Tier Traffic Is Your Real Eval Set

· 10 min read
Tian Pan
Software Engineer

The team optimizing the model against paid-cohort traces is grading itself on the easy distribution. Paying users have a workflow. They self-selected into the product because something about it justified pulling out a credit card, which means by the time they're in the eval set, they've already learned which prompts work, which features deliver, and which corners not to wander into. Free-tier users do none of that. They're anonymous, exploratory, often adversarial, often non-native English speakers stress-testing a product in their second language, and they exercise the long tail of failure modes the eval set was never built to cover.

This is the asymmetry that quietly eats the conversion funnel of every freemium AI product. The team grades the model against a curated sample drawn disproportionately from paid traces. The free-tier weird traces — the ones with no template, the ones where someone is genuinely trying to figure out what the product does — never get labeled, never get a regression test, and never inform the next prompt edit. The model gets better against the paid distribution and slowly worse against the distribution that decides whether free users ever upgrade.

User Trust Half-Life: Why One Bad Session Erases Weeks of Calibration

· 10 min read
Tian Pan
Software Engineer

A user's calibration of an AI feature is one of the most expensive things you ship. It costs them weeks of attention: learning which prompts work, where the model's reliable, when to double-check, what to ignore entirely. Then a single visible failure — a wrong number in a generated report, a hallucinated citation the user pasted into a deck, a confidently-incorrect recommendation they acted on — can vaporize all of it in one session. The recovery curve isn't symmetric. The user's prior was "this is reliable," and the update doesn't land as a data point. It lands as a betrayal.

The team measuring DAU sees nothing for weeks. The user keeps opening the app out of habit, runs a few queries, doesn't act on the output, and then quietly stops. By the time engagement metrics flinch, the trust event that caused it is two months old and nobody on the team remembers shipping it.

The Quiet Quitter Pattern: Why Your AI Engagement Metrics Are Lying to You

· 10 min read
Tian Pan
Software Engineer

There's a specific failure mode that quietly destroys AI product metrics without anyone noticing. Your dashboard shows a 34% suggestion acceptance rate, strong DAU, and growing feature engagement. What the dashboard doesn't show is that 60% of those accepted suggestions get immediately rewritten, the users who "engage" most are the ones who click the AI output, select all, and type their own response anyway, and the feature has zero measurable effect on downstream task completion.

The Quiet Quitter Pattern: Why Your AI Engagement Metrics Are Lying to You

This is the quiet quitter pattern: users who systematically route around an AI feature while still generating all the surface metrics of engaged users. They don't disable the feature — they just ignore its output. In your analytics, they look identical to your best AI users.

Ship Your AI Feature Before It Feels Ready

· 9 min read
Tian Pan
Software Engineer

Most AI features that ship late don't ship late because they're broken. They ship late because the team is still optimizing for a test suite that doesn't reflect how real users behave. The benchmarks look better each week. The evals trend upward. And the gap between "lab performance" and "production value" quietly widens.

The uncomfortable truth is that the first 500 real users will surface more actionable problems in two weeks than four more weeks of prompt tuning ever could. This is not an argument for shipping garbage. It's an argument for recognizing that your current calibration of "ready" is almost certainly miscalibrated — and that real usage data is the only thing that corrects it.

Why The Weekly Transcript Review Beats Your AI Dashboard

· 12 min read
Tian Pan
Software Engineer

The most underpriced asset in your AI organization is the hour every week when three people sit in a room and read what your product actually said to users. Not the aggregate scores. Not the rolling averages. Not the dashboard. The actual transcripts. The verbatim outputs. The lazy phrasing the model has quietly settled into. The intent your taxonomy doesn't have a bucket for. The user trying for the third time to express what they want, in three different ways, while your eval rubric scores all three turns "satisfactory."

Teams who institutionalize this hour develop a mental model of their AI feature their dashboards will never surface. Teams who skip it ship for six months on metrics that look fine and learn at the next QBR that the median experience drifted somewhere unfortunate when nobody was looking.

The 70% Reliability Uncanny Valley: Where AI Features Go to Lose User Trust

· 12 min read
Tian Pan
Software Engineer

A feature that fails 70% of the time is harmless. The user learns within a week that they have to verify every output, treats the system as an unreliable assistant, and adjusts. A feature that succeeds 70% of the time is worse than that. It is right often enough that the user stops verifying, and wrong often enough that the failures are concentrated, visible, and personal. The user's mental model collapses into "I cannot tell when to trust this" — which, as a product experience, is strictly worse than "I know not to trust this."

This is the 70% uncanny valley, and it is where most AI features built in the last two years live. The team measures aggregate accuracy, watches the number cross some "good enough" threshold, and ships. The realized user experience does not improve monotonically with that number. Between roughly 60% and 85% accuracy, the product gets worse as it gets more accurate, because the cost of a wrong answer the user did not think to check exceeds the value of a right answer they no longer have to verify.

The team that ships at 70% without designing for the predictability problem is not shipping a worse version of a 95% product. They are shipping a different product entirely: one whose primary failure mode is silent.

The Two-PM Problem: When Prompt Ownership and Product Ownership Drift Apart

· 11 min read
Tian Pan
Software Engineer

A support ticket lands on Tuesday morning: a customer was given a confidently wrong answer about their refund window. Engineering pulls the trace and finds the model picked the wrong intent. The product PM looks at the dashboard and sees the new "express refund" affordance — shipped last sprint — surfaced an intent the prompt was never tuned to handle. The platform PM points at the eval suite, which is green. Both are technically right. The customer is still wrong.

This is the two-PM problem, and most AI teams have it without naming it. The product PM owns the user-facing surface — intents, success metrics, the support escalation path. The platform or ML PM owns the prompt, the model choice, the eval suite, and the cost ceiling. The roadmaps are coordinated at the quarterly-planning level and drift at the weekly-shipping level, because the two PMs are optimizing for different metrics on different dashboards with different change-control processes.

The interesting failure mode isn't that the two PMs disagree. It's that they ship correctly relative to their own scope and still produce a regression nobody owns.

The AI Feature You Should Not Have Shipped: A Task-Shape Checklist

· 10 min read
Tian Pan
Software Engineer

The demo always works. That is the most expensive sentence in AI product development. The product manager sees the model handle the happy path, the engineer ships the obvious version of the feature, and six weeks later the support queue is full of complaints that the metric did not predict. Nothing in the model regressed. Nothing in the prompt got worse. The feature was simply not the shape the model could do well, and the team did not have a way to say so before the work began.

A meaningful fraction of shipped AI features fail this way — not because the model is bad, but because the task is wrong. The output the product needs is deterministic and the engine is stochastic. The user's tolerance for the tail is one bad answer per thousand and the model's failure distribution is heavier than that. The latency budget the unit economics require is half of what the model can deliver at any tier you can afford. The ground truth required to evaluate quality does not exist and cannot be cheaply created. None of these are model problems. They are task-shape problems, and they should have been screened before the first prompt was written.

The AI Off-Switch That Doesn't Exist: Retiring Features After Users Co-Author the Archive

· 11 min read
Tian Pan
Software Engineer

Six months after you launched the AI writing assistant, you open the analytics dashboard and find the metric you wanted: 40% of user-generated documents on the platform now contain AI-authored prose. The board meeting calls this engagement lift. Three weeks later, the model provider raises prices, the unit economics flip, and someone asks the obvious question: can we turn it off? You go looking for the toggle and discover that it isn't a toggle. It's a migration with product, legal, and UX surfaces attached, and pulling it cleanly will take two quarters and burn political capital with three teams who didn't know they were stakeholders.

This is the part of the AI product lifecycle that nobody planned for. The launch playbook covered prompt engineering, rate limits, eval harnesses, and a kill switch for runaway costs. It did not cover what happens when users have spent half a year producing artifacts that only exist because the generator existed, and now the read path through your archive depends on a feature you want to retire. The "off switch" was conceptual: a flag in a config file. The actual decommissioning is a coordinated set of decisions about grandfathering, versioning, content provenance, and the uncomfortable conversation about whether the engagement lift was ever value or just dependency.

The Missing Arm: Your AI Experiment Has No 'AI-Off' Control

· 9 min read
Tian Pan
Software Engineer

Look at the last six experiment readouts your team shipped on an AI feature. What were the arms? Odds are good you tested "new prompt vs. old prompt," or "GPT-5 router vs. GPT-4 fallback," or "reasoning model vs. fast model," or "with retrieval vs. without retrieval." You reported lift on engagement, task completion, or session length. You called it product impact. A quarter rolled by. Inference spend climbed. Nobody paused to ask the question the CFO eventually will: what would have happened if the feature simply weren't there?

That question is the missing arm. The lift your experiments keep measuring is "better AI vs. worse AI," but the one your business runs on is "AI vs. nothing" — or more uncomfortably, "AI vs. the three-line heuristic we never wrote down." These are different experiments with different conclusions, and most AI product programs in 2026 have only ever run the first one. The second is the one that tells you whether the feature is earning its inference bill.

AI User Research: What Users Actually Need Before You Write the First Prompt

· 10 min read
Tian Pan
Software Engineer

Most teams decide they're building an AI feature, then ask users: "Would you want this?" Users say yes. The feature ships. Three months later, weekly active usage is at 12% and plateauing. The postmortem blames implementation or adoption, but the real failure happened before a single line of code was written — in the user research phase that felt thorough but was methodologically broken.

The core problem: users cannot accurately predict their preferences for capabilities they have never experienced. This isn't a minor wrinkle. A study on AI writing assistance found that systems designed from users' stated preferences achieved only 57.7% accuracy — actually underperforming naive baselines that ignored user-stated preferences entirely. You can do a user research sprint that runs for weeks, collect extensive qualitative feedback, and end up with a product nobody uses — not despite the research, but partly because of how it was conducted.

The AI Capability Ratchet: How One Smart Feature Breaks Your Entire Product

· 10 min read
Tian Pan
Software Engineer

Your AI-powered search just shipped. It's fast, conversational, and handles nuanced queries in ways your old keyword search never could. The feature review was glowing. The launch post got shared. And then, two weeks later, the support tickets start — not about search, but about the customer support widget, the help documentation, and the notification center. Nobody changed any of those things. But users are suddenly furious.

Welcome to the AI capability ratchet. The moment you ship one demonstrably intelligent feature, you have permanently recalibrated what users consider acceptable across your entire product. The ratchet clicks up. It does not click back down.

This pattern is one of the least-discussed failure modes in AI product development. Teams celebrate individual feature launches without accounting for the expectation debt they are distributing to every team that didn't ship anything.