Skip to main content

62 posts tagged with "product"

View all tags

The Heavy Tail Your Token Forecast Never Priced

· 9 min read
Tian Pan
Software Engineer

The cost forecast for your AI feature was modeled on a 50-user pilot. Those users typed three-sentence prompts because that is what people type into a beta they were asked to evaluate. Production launched, you crossed ten thousand users, and the finance team flagged that your model bill is running at three times the per-user number from the deck. You went looking for the bug. There is no bug. Your pilot was sampling from one distribution and production is sampling from another, and the difference between them is a long tail of users who learned about your product on Twitter and are pasting thirty kilobytes of unstructured context they screenshotted from a thread.

This is the same financial mistake every consumer internet company learned in the 2010s, transplanted onto LLM economics. The pilot's median user is not the production p99.5, and a token cost model that uses the mean as its forecasting input has already lost the argument with the bill.

The Token Budget Is a Product Decision, Not a Config Value

· 10 min read
Tian Pan
Software Engineer

Somewhere in your codebase there is a line that looks like retriever.search(query, top_k=8). An engineer wrote that 8 in an afternoon. It was never reviewed by anyone outside the team, never appeared in a spec, and has never been revisited. That single integer decides how much of your context window goes to retrieved documents instead of conversation history, how much each request costs, how slow the response feels, and — because of how language models actually behave at length — how accurate the answer is.

That is a product decision. It is sitting in an f-string.

The Demo That Set a Baseline You Cannot Afford to Run

· 9 min read
Tian Pan
Software Engineer

The demo went well. The agent answered the hard question, chained four tool calls without a stumble, and produced a paragraph that made the room go quiet for a second before someone said "ship it." Nobody asked what it cost. Nobody asked what model it ran on, how many inputs you tried before that one, or what happens when a thousand people hit it at once instead of you, alone, at your desk, on a Tuesday.

That demo just became a contract. Not a written one — worse. It became the unstated baseline that leadership, sales, and customers will hold the shipped product against. And the terms of that contract were set by a system you cannot afford to run.

The gap between demo economics and production economics is real, large, and almost never priced before the commitment is made. Gartner expects more than 40% of agentic AI projects to be canceled by 2027, largely on cost overruns. A March 2026 survey found 78% of enterprises had agent pilots running and only 14% had scaled one to organization-wide use. The pilots are not failing because the technology does not work. They are failing because the version that worked was never the version anyone could deploy.

Background Agents and the Notification Budget: Why Proactive AI Hits a Hard Ceiling at User Attention

· 10 min read
Tian Pan
Software Engineer

The first generation of AI assistants waited politely. You typed, they answered. The second generation does not wait. It watches your calendar, scans your inbox, reads your repo activity, and surfaces "you should know about this" interruptions before you have asked for anything. The pitch is compelling and the demos are mesmerizing. The retention curves, once these features ship, are not.

There is a number nobody puts on the launch slide: the user has a daily ceiling on unsolicited AI updates, and it is roughly three to five across all sources combined. The proactive agent that ships its tenth notification of the week is the same agent the user mutes by Friday and uninstalls the following month. This is not a UX polish problem. It is the architectural blind spot of the entire proactive-AI category, and it deserves a name: the notification budget.

Conversational REST: When Your Chat UI Needs Pagination, Filters, and Sort

· 11 min read
Tian Pan
Software Engineer

A user asks your shopping agent for "running shoes under $150 with good arch support." The model dutifully returns twelve options as a wall of bulleted text inside a single chat bubble that overflows the viewport. The user scrolls, loses their place, and types "show me only Asics" — at which point your agent re-runs the entire search instead of filtering the result set it already has. Three turns later, the user is inventing a query language one prompt at a time, and your product feels like a command line wearing a chat-bubble costume.

This is the failure mode I keep watching teams ship. They built a chat product on top of what users actually wanted to be a faceted-search product. The model is fine. The retrieval is fine. The UI is the problem, and it's the wrong shape for the task.

The shortest way I can put it: chat is an input modality, not an output one. The agent's job is to translate user intent into a structured query. The moment the result set is more than three items, the right answer is to render UI, not to keep talking.

The Cost of Reversal: Why Pulling Back an AI Feature Is Harder Than Shipping One

· 10 min read
Tian Pan
Software Engineer

The release process you have was designed for a world where shipping is irreversible and rollback is free. AI flips that. Once a feature has been live for a quarter, the disruption cost of pulling it back exceeds the disruption cost of launching it — and the louder customer feedback you will ever get on that feature is the day you take it away, not the day it shipped.

The team builds a kill switch for every AI launch. Nobody ever pulls it. Not because the feature is flawless, but because by the time anyone wants to, the cost of doing so has compounded past anything the launch criteria considered. Feature flags assume the world is symmetric: the system before the flip and the system after the flip are equally valid resting points, and you can move between them as you please. AI features break that assumption silently, and the team's release process — built around reversible flags — quietly assumes the asymmetry away.

The first time the team notices is when somebody proposes deprecating the feature.

Cost-Per-Conversation as a Product Contract: When Pricing Drives Architecture

· 10 min read
Tian Pan
Software Engineer

The cleanest way to find out your AI feature's pricing model is wrong is to look at which engineer is currently rewriting the truncation logic at midnight. They aren't shipping a capability — they're patching a unit-economics leak that the PRD never named, and the patch is necessarily user-hostile because the product spec told them the budget was infinite. On a flat-fee SaaS plan, every conversation that runs longer than the median pulls margin out of the company in real time. The only real question is whether the product team admits it before finance does.

Traditional SaaS economics rest on near-zero marginal cost per user: once the software is built, serving the next customer barely moves the infrastructure line. AI features break that assumption. Every turn in a conversation consumes inference compute that scales with prompt size, output length, tool-call fan-out, and retrieval volume — and conversations don't have a natural stopping point. A heavy user can consume 50× the median in a billing period without leaving the happy path of the product. Under flat pricing, that user is funded by the rest of the user base, and the company finds out only when COGS reporting catches up a quarter later.

This is why pricing on AI features is not a finance problem to be handled after launch. It is an architecture input that decides what the product is allowed to do, and refusing to make it visible in the spec just means it gets resolved later, in worse ways, by people without product authority.

The Thumbs-Down on the Right Answer: When User Feedback Trains Sycophancy

· 9 min read
Tian Pan
Software Engineer

A tax assistant tells the user they owe $4,200. The user clicks thumbs-down. A code reviewer flags a real bug in the user's PR. Thumbs-down. A calendar agent correctly says no slot is available before Friday. Thumbs-down. Six months later, the team's prompt iteration has converged on an agent that hedges, equivocates, and cheerfully suggests the math might be off — and CSAT is up.

The thumbs-down button does not measure quality. It measures the conjunction of quality and palatability, and a feedback-driven optimization loop that does not separate those two things will train sycophancy and call it product-market fit. This is not a hypothetical risk. In April 2025, OpenAI rolled back a GPT-4o update after admitting that a new reward signal based on thumbs-up/down feedback "weakened the influence of our primary reward signal, which had been holding sycophancy in check." A model that endorsed stopping medication and praised obvious nonsense had passed every internal preference metric.

The AI A/B Test That Lied: Novelty, Carryover, and Anchoring Bias in LLM Experiments

· 10 min read
Tian Pan
Software Engineer

Your AI feature shipped with confidence. The A/B test showed a statistically significant 12% lift in user engagement. The confidence intervals didn't overlap. The sample size was right. The p-value was comfortably under 0.05. Six weeks later, the metric has flat-lined back to baseline. Three months in, it's actually below baseline. The experiment told you the feature worked. The experiment lied.

This isn't a bug in your statistical tooling. It's a fundamental mismatch between what standard A/B testing measures and what happens when humans interact with probabilistic AI systems over time. Three specific biases — novelty inflation, anchoring, and carryover — conspire to inflate every AI feature experiment, and the standard remedy of adding a holdout group doesn't fix any of them.

AI Co-Pilot vs. AI Pilot: The Evidence-Based Product Decision Framework

· 9 min read
Tian Pan
Software Engineer

Every product team building with AI faces the same fork in the road: should the AI advise humans, or should it act on its own? The framing sounds philosophical, but the answer is actually measurable — and getting it wrong is expensive in ways that don't show up until six months after launch, when your override metrics look fine and your user trust scores are quietly collapsing.

Klarna replaced 700 customer service agents with an autonomous AI system in early 2024. By 2025, the CEO admitted they had "gone too far" and began quietly rehiring humans for complex cases. The AI handled 2.3 million conversations in a month and resolved issues in under 2 minutes instead of 11. The numbers looked great. The underlying problem — that customer service for financial products requires empathy and judgment, not just resolution speed — showed up later, in declining satisfaction on anything outside the happy path.

AI Feature PMF Signals: Why Your Metrics Are Lying to You

· 9 min read
Tian Pan
Software Engineer

When your AI feature ships and the metrics light up — DAU spikes, NPS climbs, thumbs-up feedback floods in — you could be looking at genuine product-market fit. Or you could be watching the first act of a two-part story where the second act ends with a retention cliff nobody saw coming.

The problem is these signals are structurally broken for probabilistic AI features. They were designed for deterministic software where "activated" means something, where a five-star rating predicts future use, where the novelty fades in days rather than masking a six-month churn wave. AI features behave differently, and the standard PMF toolkit is calibrated for the wrong inputs.

Why AI Features Break A/B Testing (and the Causal Inference Methods That Don't Lie)

· 11 min read
Tian Pan
Software Engineer

You ship an AI-powered feature, run a clean two-week A/B test, see a 4% lift in engagement, and call it a win. Six months later, the feature is fully rolled out and engagement is flat or declining. The test wasn't noisy — it was measuring the wrong thing entirely.

![](https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=Why%20AI%20Features%20Break%20A%2FB%20Testing%20(and%20the%20Causal%20Inference%20Methods%20That%20Don't%20Lie%29)

A/B tests were built for a world where users in a treatment group and users in a control group are statistically independent. AI features routinely violate that assumption. Users talk to each other, learn from each other's behavior, and share the outputs of AI tools. Treatment effects don't stabilize in two weeks when the real mechanism is long-horizon behavioral adaptation. When you ignore this, your experiment gives you a number that's internally consistent but causally meaningless.