Skip to main content

12 posts tagged with "product-design"

View all tags

Trust Ceilings: The Autonomy Variable Your Product Team Can't See

· 10 min read
Tian Pan
Software Engineer

Every agentic feature has a maximum autonomy level above which users start checking work, intervening, or abandoning the feature entirely. That maximum is not a property of your model. It is a property of your users, your domain, and the cost of being wrong, and it does not move because a launch deck says it should. Most teams discover their ceiling the hard way: a feature ships designed for full autonomy, adoption stalls at "agent suggests, human approves," the metrics blame the model, and the next quarter is spent tuning a knob that was never the bottleneck.

The shape of the ceiling is consistent enough across products that it deserves a name. Anthropic's own usage data on Claude Code shows new users using full auto-approve about 20% of the time, climbing past 40% only after roughly 750 sessions. PwC's 2025 survey of 300 senior executives found 79% of companies are using AI agents, but most production deployments operate at "collaborator" or "consultant" levels — the model proposes, the human disposes — not at the fully autonomous tier the marketing implied. The story underneath those numbers is not that users are timid. It is that trust is calibrated to the cost of a recoverable mistake, and your product almost certainly does not let users see, undo, or bound that cost the way they need to.

Async Agents Need an Inbox, Not a Chat

· 11 min read
Tian Pan
Software Engineer

The chat metaphor has a fuse, and it burns out around thirty seconds. Past that, the spinner stops being a progress indicator and becomes a commitment device — the one making the commitment is your user, and most of them bail. You can watch it in session replays: the typing indicator appears, the user waits, tabs away at about twelve seconds, half never come back. The product team sees a completed agent run with no human on the other end and files it as a success. It is not a success. It is an abandoned artifact that happened to finish.

This is the first contact with a structural problem that most agent products paper over with spinners and streaming text: the chat interface was designed for turn-taking humans and fast models, and it fails silently when either assumption breaks. If your agent takes minutes, you are not shipping a chat feature with a longer wait. You are shipping a different product, and it needs a different UI primitive.

The Output Commitment Problem: Why Streaming Self-Correction Destroys User Trust More Than the Original Error

· 10 min read
Tian Pan
Software Engineer

A user asks your agent a question. Tokens start flowing. Three sentences in, the model writes "Actually, let me reconsider — " and pivots to a different answer. The revised answer is better. The user closes the tab.

This is the output commitment problem, and it is one of the most consistently underestimated UX failures in shipped AI products. The engineering mindset treats self-correction as a feature — the model noticed its own error, that is the system working as intended. The user-perception mindset treats it as a disaster — the product demonstrated, live, that its first confident claim was wrong. Those two readings are both correct, and they do not reconcile on their own.

The core asymmetry is that streaming makes thinking legible, and legible thinking is auditable thinking. A model that hallucinated silently and then produced a clean final answer would look competent. The same model, streaming every half-thought, looks like it is flailing. The answer quality is identical. The perception is not.

The Enterprise AI Capability Discovery Problem

· 10 min read
Tian Pan
Software Engineer

You shipped the AI feature. You put it in the product. You wrote the help doc. And still, six months later, your most sophisticated enterprise users are copy-pasting text into ChatGPT to do the same thing your feature already does natively. This is not a training problem. It is a discoverability problem, and it is one of the most consistent sources of wasted AI investment in enterprise software today.

The pattern is well-documented: 49% of workers report they never use AI in their role, and 74% of companies struggle to scale value from AI deployments. But the interesting failure mode is not the late-adopters who explicitly resist. It is the engaged users who open your product every day, never knowing that the AI capability they would have paid for is sitting one click away from where their cursor already is.

Feedback Surfaces That Actually Train Your Model

· 10 min read
Tian Pan
Software Engineer

Most AI products ship with a thumbs-up/thumbs-down widget and call it feedback infrastructure. It isn't. What it is, in practice, is a survey that only dissatisfied or unusually conscientious users bother completing — and a survey that tells you nothing about what the correct output would have looked like.

The result is a dataset shaped not by what your users want, but by which users felt like clicking a button. That selection bias propagates into fine-tuning runs, reward models, and DPO pipelines, quietly steering your model toward the preferences of a tiny and unrepresentative minority. Implicit signals — edit rate, retry rate, session abandonment — cover every user who touches the product. They don't require a click. They're generated by the act of using the software.

Here's how to design feedback surfaces that produce high-fidelity training signal as a natural side effect of product use, and how to route those signals into your training pipeline.

The Jagged Frontier: Why AI Fails at Easy Things and What It Means for Your Product

· 10 min read
Tian Pan
Software Engineer

A common assumption in AI product development goes something like this: if a model can handle a hard task, it can definitely handle an easier one nearby. This assumption is wrong, and it's responsible for a category of production failures that no amount of benchmark reading prepares you for.

The research term for the underlying phenomenon is the "jagged frontier" — AI's capability boundary isn't a smooth line that hard tasks sit outside of and easy tasks sit inside. It's a ragged, unpredictable shape. AI systems can write production-grade database query optimizers and still miscalculate whether two line segments on a diagram intersect. They can pass PhD-level science exams and fail children's riddle questions that involve spatial relationships. They can synthesize 50-page documents and then confidently hallucinate a summary of a paragraph they just read.

The Agent Loading State Problem: Designing for the 45-Second UX Abyss

· 11 min read
Tian Pan
Software Engineer

There is a hole in your product between second ten and second forty-five where nothing you designed still works. Users abandon a silent UI around the ten-second mark — Jakob Nielsen pinned that threshold back in the nineties, and modern eye-tracking studies have not moved it by more than a second or two. Modern agent work routinely takes thirty to one hundred twenty seconds. Multi-step planning, retrieval, a couple of tool calls, maybe a reflection pass before the final write — the latency budget is not a budget anymore, it is a crater.

Most teams discover this the first time they ship an agent feature and watch session recordings. Users hammer the submit button. They paste the query into a second tab. They close the window and retry from scratch, convinced it is broken. The feature works; the waiting does not. The gap between "spinner appeared" and "answer arrived" is the most neglected surface in AI product design, and it is the one that decides whether users perceive your agent as intelligent or stuck.

Ambient AI Design: When the Chat Interface Is the Wrong Abstraction

· 8 min read
Tian Pan
Software Engineer

Most engineering teams default to building AI features as chat interfaces. A user types something; the model responds. The pattern feels natural because it maps to human conversation, and the tooling makes it easy. But when you watch those chat-based AI features in production, you often see the same dysfunction: the UI sits idle, waiting for a user who is too busy, too distracted, or simply unaware that they should be asking something.

Chat is a pull model. The user initiates. The AI reacts. For a meaningful subset of the valuable AI work in any product—monitoring, anomaly detection, workflow automation, proactive notification—pull is the wrong shape. The work needs to happen whether or not the user remembered to open the chat window.

The Overclaiming Trap: When Being Right for the Wrong Reasons Destroys AI Product Trust

· 10 min read
Tian Pan
Software Engineer

Most AI product post-mortems focus on the same story: the model was wrong, users noticed, trust eroded. The fix is obvious — improve accuracy. But there is a more insidious failure mode that post-mortems rarely capture because standard accuracy metrics don't surface it: the model was right, but for the wrong reasons, and the power users who checked the reasoning never came back.

Call it the overclaiming trap. It is the failure mode where correct final answers are backed by fabricated, retrofitted, or structurally unsound reasoning chains. It is more dangerous than ordinary wrongness because it looks like success until your most sophisticated users start quietly leaving.

The Trust Calibration Gap: Why AI Features Get Ignored or Blindly Followed

· 9 min read
Tian Pan
Software Engineer

You shipped an AI feature. The model is good — you measured it. Precision is 91%, recall is solid, the P99 latency is under 400ms. Three months later, product analytics tell a grim story: power users have turned it off entirely, while a different cohort is accepting every suggestion without changing a word, including the ones that are clearly wrong.

This is the trust calibration gap. It's not a model problem. It's a design problem — and it's more common than most AI product teams admit.

Trust Transfer in AI Products: Why the Same Feature Ships at One Company and Dies at Another

· 9 min read
Tian Pan
Software Engineer

Two product teams at two different companies build the same AI writing assistant. Same model. Similar feature surface. Comparable accuracy numbers. One team celebrates record activation at launch. The other quietly disables the feature after three months of ignored adoption and one scathing internal all-hands question.

The engineering debrief at the struggling company focuses on the obvious variables: latency, accuracy, UX polish. None of them fully explain the gap. The real variable was trust — specifically, whether the AI feature could borrow enough existing trust to earn the right to make mistakes while it proved itself.

Trust transfer is the invisible force that determines whether an AI feature lands or dies. And most teams shipping AI products have never explicitly designed for it.

The Accuracy Threshold Problem: When Your AI Feature Is Too Good to Ignore and Too Bad to Trust

· 10 min read
Tian Pan
Software Engineer

McDonald's deployed its AI voice ordering system to over 100 locations. In testing, it hit accuracy numbers that seemed workable — low-to-mid 80s percent. Customers started posting videos of the system adding nine sweet teas to their order unprompted, placing bacon on ice cream, and confidently mishearing simple requests. Within two years, the partnership was dissolved and the technology removed from every location. The lab accuracy was real. The real-world distribution was not what the lab tested.

This is the accuracy threshold problem. There is a zone — roughly 70 to 85 percent accuracy — where an AI feature is precise enough to look like it works, but not reliable enough to actually work without continuous human intervention. Teams ship into this zone because the numbers feel close enough. Users get confused because the feature is just good enough to lure them into reliance and just bad enough to fail when it matters.