Skip to main content

22 posts tagged with "ai-product"

View all tags

The Customer Who Cancelled Because Your Agent Was Too Confident

· 9 min read
Tian Pan
Software Engineer

The user asked the agent a routine question. The agent answered with the assured cadence of someone who knew. The user trusted the answer, took the action, and spent the afternoon walking back a customer email that was sent on bad information. Six weeks later the renewal call came and went. The line item in the churn deck read "low engagement." The actual reason — "I can't trust it anymore" — never made it onto any dashboard, because the user never opened the CSAT survey that would have asked.

This is the failure mode that most teams shipping AI products are systematically blind to. Not hallucinations — those are the visible tip. The submerged mass is confidence miscalibration: the gap between what the model actually knows and how certain it sounds when it says it. And the cost of that gap is not paid in a survey response. It is paid at the renewal table.

The Prompt Log Is the Product Roadmap You Threw Away

· 9 min read
Tian Pan
Software Engineer

Somewhere in your observability stack is a table that holds every prompt a user typed into your AI feature last quarter. If your team is like most, that table is used for three things: cost attribution, abuse detection, and the occasional debugging session when a customer reports a bad answer. Nobody on the product team has ever opened it. Nobody on the research team has clustered it. The PM running the AI roadmap has never read a single row.

This is the most expensive oversight in your product organization. The prompts your users typed — especially the ones your feature handled badly — are the highest-resolution form of "what users wish this product did" you will ever collect. You are paying inference costs to generate this signal in real time, and you are throwing it away because nobody decided whose job it was to read it.

The Token Budget That Ran Out Mid-Conversation: Why Free-Tier Users Think Your Model Got Dumber

· 12 min read
Tian Pan
Software Engineer

A product manager I know spent two weeks triaging a churn spike on her company's AI writing assistant. Free-tier session length had collapsed by 30%, the support inbox filled up with variations of "your model used to be smart, now it's lazy," and the team's first instinct was to blame a model upgrade that had shipped the same week. The model had not changed. What had changed was that finance had quietly tightened the per-user token budget mid-quarter, and the app had been silently truncating system prompts, dropping tool calls, and shortening responses for any user who crossed the new threshold. From the user's seat, the AI had degraded. From the dashboard, nothing was wrong. Both were true, and that is the failure mode.

This pattern is everywhere now. ChatGPT's free tier drops to a smaller model when the limit is hit, with no in-product label other than "responses may be shorter for a while." Anthropic's free tier behaves similarly. Build a feature on top of either, layer on your own per-user budget for cost control, and you have stacked two invisible cliffs in series — the platform's and yours — and the user, who only sees one chat box, has no way to tell which one they just walked off.

The Power User Who Learned Your Prompt By Trial

· 10 min read
Tian Pan
Software Engineer

There is a user in your product right now who is having a much better experience than the median. Not because they pay more, not because they have a different tier, not because they were rolled into a different cohort. They have figured out, through patient probing, that the AI feature responds beautifully if you ask in a certain way. They know which verbs trigger the structured output. They know that a one-word follow-up gives them the terse version and a complete sentence gives them the expansive one. They know that the assistant gets defensive about certain topics unless you frame the question as a hypothetical. None of this is written down anywhere on your site. They reverse-engineered it.

The interesting thing is not that this user exists. It is that this user is now your documentation. Your AI feature has a contract with its users — an undocumented one, encoded entirely in the system prompt — and the only way anyone learns the contract is by trial. A small fraction of users have the patience to run those trials. Everyone else gets a worse product.

When Two AI Features Compete for the Same Click

· 9 min read
Tian Pan
Software Engineer

A user lands on a search results page. Team A's smart summary fires in the top banner: "Here's the gist — skip the list." Team B's inline assistant pulses on the side: "Stay here, I'll keep reading with you." Both prompts compete for the same 800ms of attention, and the user — annoyed — closes the tab. The next morning, Team A reports a 6% lift in summary clicks; Team B reports a 4% lift in assistant opens; nobody in the room is wrong, and the product is worse than it was a quarter ago.

This is the failure mode that the standard playbook of independent feature teams and per-feature A/B tests cannot see. Each team locally optimized against its own metric. The user — who only has one attention budget, one mental model, and one click to give — paid the bill for the integration both teams declined to do.

The First-Time User Cliff Your Aggregate Metrics Are Hiding

· 10 min read
Tian Pan
Software Engineer

Your AI feature looks healthy. Weekly active is flat-to-up, satisfaction scores are positive, the dashboard says ship more of this. The PM cites the metric in the next planning round. The engineering lead nods. The roadmap gets another adjacent feature.

Then someone segments the chart by user tenure and the picture inverts. Long-time users — the ones who were already there when the feature shipped — go deep on it daily. First-time users bounce within two interactions. The "flat" line is two cohorts cancelling each other out: a power curve sloping up, and a churn curve sloping down, summed into a lie.

Free-Tier Traffic Is Your Real Eval Set

· 10 min read
Tian Pan
Software Engineer

The team optimizing the model against paid-cohort traces is grading itself on the easy distribution. Paying users have a workflow. They self-selected into the product because something about it justified pulling out a credit card, which means by the time they're in the eval set, they've already learned which prompts work, which features deliver, and which corners not to wander into. Free-tier users do none of that. They're anonymous, exploratory, often adversarial, often non-native English speakers stress-testing a product in their second language, and they exercise the long tail of failure modes the eval set was never built to cover.

This is the asymmetry that quietly eats the conversion funnel of every freemium AI product. The team grades the model against a curated sample drawn disproportionately from paid traces. The free-tier weird traces — the ones with no template, the ones where someone is genuinely trying to figure out what the product does — never get labeled, never get a regression test, and never inform the next prompt edit. The model gets better against the paid distribution and slowly worse against the distribution that decides whether free users ever upgrade.

User Trust Half-Life: Why One Bad Session Erases Weeks of Calibration

· 10 min read
Tian Pan
Software Engineer

A user's calibration of an AI feature is one of the most expensive things you ship. It costs them weeks of attention: learning which prompts work, where the model's reliable, when to double-check, what to ignore entirely. Then a single visible failure — a wrong number in a generated report, a hallucinated citation the user pasted into a deck, a confidently-incorrect recommendation they acted on — can vaporize all of it in one session. The recovery curve isn't symmetric. The user's prior was "this is reliable," and the update doesn't land as a data point. It lands as a betrayal.

The team measuring DAU sees nothing for weeks. The user keeps opening the app out of habit, runs a few queries, doesn't act on the output, and then quietly stops. By the time engagement metrics flinch, the trust event that caused it is two months old and nobody on the team remembers shipping it.

The Quiet Quitter Pattern: Why Your AI Engagement Metrics Are Lying to You

· 10 min read
Tian Pan
Software Engineer

There's a specific failure mode that quietly destroys AI product metrics without anyone noticing. Your dashboard shows a 34% suggestion acceptance rate, strong DAU, and growing feature engagement. What the dashboard doesn't show is that 60% of those accepted suggestions get immediately rewritten, the users who "engage" most are the ones who click the AI output, select all, and type their own response anyway, and the feature has zero measurable effect on downstream task completion.

The Quiet Quitter Pattern: Why Your AI Engagement Metrics Are Lying to You

This is the quiet quitter pattern: users who systematically route around an AI feature while still generating all the surface metrics of engaged users. They don't disable the feature — they just ignore its output. In your analytics, they look identical to your best AI users.

Ship Your AI Feature Before It Feels Ready

· 9 min read
Tian Pan
Software Engineer

Most AI features that ship late don't ship late because they're broken. They ship late because the team is still optimizing for a test suite that doesn't reflect how real users behave. The benchmarks look better each week. The evals trend upward. And the gap between "lab performance" and "production value" quietly widens.

The uncomfortable truth is that the first 500 real users will surface more actionable problems in two weeks than four more weeks of prompt tuning ever could. This is not an argument for shipping garbage. It's an argument for recognizing that your current calibration of "ready" is almost certainly miscalibrated — and that real usage data is the only thing that corrects it.

Why The Weekly Transcript Review Beats Your AI Dashboard

· 12 min read
Tian Pan
Software Engineer

The most underpriced asset in your AI organization is the hour every week when three people sit in a room and read what your product actually said to users. Not the aggregate scores. Not the rolling averages. Not the dashboard. The actual transcripts. The verbatim outputs. The lazy phrasing the model has quietly settled into. The intent your taxonomy doesn't have a bucket for. The user trying for the third time to express what they want, in three different ways, while your eval rubric scores all three turns "satisfactory."

Teams who institutionalize this hour develop a mental model of their AI feature their dashboards will never surface. Teams who skip it ship for six months on metrics that look fine and learn at the next QBR that the median experience drifted somewhere unfortunate when nobody was looking.

The 70% Reliability Uncanny Valley: Where AI Features Go to Lose User Trust

· 12 min read
Tian Pan
Software Engineer

A feature that fails 70% of the time is harmless. The user learns within a week that they have to verify every output, treats the system as an unreliable assistant, and adjusts. A feature that succeeds 70% of the time is worse than that. It is right often enough that the user stops verifying, and wrong often enough that the failures are concentrated, visible, and personal. The user's mental model collapses into "I cannot tell when to trust this" — which, as a product experience, is strictly worse than "I know not to trust this."

This is the 70% uncanny valley, and it is where most AI features built in the last two years live. The team measures aggregate accuracy, watches the number cross some "good enough" threshold, and ships. The realized user experience does not improve monotonically with that number. Between roughly 60% and 85% accuracy, the product gets worse as it gets more accurate, because the cost of a wrong answer the user did not think to check exceeds the value of a right answer they no longer have to verify.

The team that ships at 70% without designing for the predictability problem is not shipping a worse version of a 95% product. They are shipping a different product entirely: one whose primary failure mode is silent.