Skip to main content

55 posts tagged with "ux"

View all tags

The Are-You-Sure Confirmation Step Your Users Learned to Click Through

· 11 min read
Tian Pan
Software Engineer

The confirmation dialog is the cheapest safety layer in the AI agent toolbox. It's a string, a button, and a callback. The product manager who asked for it left the meeting believing the agent was now safe. The engineer who built it shipped it in an afternoon. The compliance reviewer who audited it ticked the box. And the user who saw it for the seventh time that morning had already moved their mouse to the Confirm button before their eyes finished reading the title.

Within a week, the confirmation step is no longer a decision point. It's a rhythm. The agent says "are you sure you want to send this email?" and the user says yes the way they say bless-you at a sneeze. The day the agent proposes an action that is actually wrong — wrong recipient, wrong amount, wrong tone — the user confirms it with the same automaticity they used for the six correct ones before it, and the email goes out, and the team writes a postmortem that says "user error."

It wasn't user error. It was a system that mistook the existence of a click for the existence of consent.

The Interrupt UI That Taught Your Users to Never Interrupt the Agent

· 10 min read
Tian Pan
Software Engineer

The interrupt button on your streaming agent has a 0.4% click rate. The product team reads that number and concludes the feature is working as intended — most generations don't need to be interrupted, the implementation is fine, ship it, move on. The actual reading is that the interrupt button taught your users not to press it. Within a week of using the product, they figured out that pressing stop discards the partial response, clears the context, and dumps them back at an empty input box. The lesson they learned is to wait through a bad answer rather than risk losing the thread.

That 0.4% is not a usage signal. It is an aversion signal. Your users are not happy with the answers — they are afraid of the cost of trying to redirect them, and their adaptation is to sit quietly while the agent finishes saying something they already know is wrong. The engineering team treated "stop generation" as a model-call cancellation. The user treated it as "redirect, don't restart." The two definitions never met, and the product shipped a feature that quietly drained user agency from every long-running conversation.

The Latency Budget Your Agent Loop Stole from the Search Box

· 12 min read
Tian Pan
Software Engineer

The launch metrics looked clean. Answer quality up, citation rate up, the eval suite green. The team that replaced the old keyword search with an agent-backed retriever shipped, took the win, and moved on. Six weeks later somebody noticed the weekly active number on that surface had drifted down twelve percent and nobody could find the regression. There was no regression. The agent worked. The users left because the box that used to answer in two hundred milliseconds now took four seconds, and nothing in the launch retro had a budget for that.

This is the latency-budget transfer problem, and almost nobody draws the org chart that catches it. A search box is not just a function call. It is a thirty-year contract with the user's nervous system: type, see results, scan, click. The 200-millisecond response is not a performance metric on a dashboard somewhere — it is the reason the user's attention is still on the screen when the results arrive. When the team underneath the box replaces a keyword index with an agent loop, the function-call surface looks identical and the SLA on the new call lives in a completely different regime. The latency budget moved from the team that owned the index to the team that owns the agent, and from the team that owns the agent to the user, and the only one who showed up to the meeting was the user.

The Streaming Response That Contradicts Itself

· 8 min read
Tian Pan
Software Engineer

The model says "the answer is yes" in the first sentence. By the third paragraph it has walked it back to "actually, on reflection, no — and here is why." The end-state is correct. The user already left. They read the first paragraph, took it as the answer, and acted on it before the model finished revising. Your eval scored the response correct. Your user got the wrong one.

This is the failure mode streaming UX hides. Token-by-token rendering treats every chunk as if it were committed truth, but the model has no notion of commit. There is no boundary between hedge and conclusion, no signal that says "the next two paragraphs are going to overturn what I just said." The interface is shipping partial state as final state, and the longer the response, the worse the gap gets.

The Token Budget That Ran Out Mid-Conversation: Why Free-Tier Users Think Your Model Got Dumber

· 12 min read
Tian Pan
Software Engineer

A product manager I know spent two weeks triaging a churn spike on her company's AI writing assistant. Free-tier session length had collapsed by 30%, the support inbox filled up with variations of "your model used to be smart, now it's lazy," and the team's first instinct was to blame a model upgrade that had shipped the same week. The model had not changed. What had changed was that finance had quietly tightened the per-user token budget mid-quarter, and the app had been silently truncating system prompts, dropping tool calls, and shortening responses for any user who crossed the new threshold. From the user's seat, the AI had degraded. From the dashboard, nothing was wrong. Both were true, and that is the failure mode.

This pattern is everywhere now. ChatGPT's free tier drops to a smaller model when the limit is hit, with no in-product label other than "responses may be shorter for a while." Anthropic's free tier behaves similarly. Build a feature on top of either, layer on your own per-user budget for cost control, and you have stacked two invisible cliffs in series — the platform's and yours — and the user, who only sees one chat box, has no way to tell which one they just walked off.

The Power User Who Learned Your Prompt By Trial

· 10 min read
Tian Pan
Software Engineer

There is a user in your product right now who is having a much better experience than the median. Not because they pay more, not because they have a different tier, not because they were rolled into a different cohort. They have figured out, through patient probing, that the AI feature responds beautifully if you ask in a certain way. They know which verbs trigger the structured output. They know that a one-word follow-up gives them the terse version and a complete sentence gives them the expansive one. They know that the assistant gets defensive about certain topics unless you frame the question as a hypothetical. None of this is written down anywhere on your site. They reverse-engineered it.

The interesting thing is not that this user exists. It is that this user is now your documentation. Your AI feature has a contract with its users — an undocumented one, encoded entirely in the system prompt — and the only way anyone learns the contract is by trial. A small fraction of users have the patience to run those trials. Everyone else gets a worse product.

The Streaming Response That Committed Before the User Said Yes

· 12 min read
Tian Pan
Software Engineer

The user is reading the agent's reasoning as it streams in. Around token 1200, the model decides to call send_email, then create_ticket, then kick_off_deploy. The user, watching the partial output and realizing the agent has misread the request, hits the stop button half a second too late. The email is already sent. The ticket is already filed. The deploy is already running. The stop button cancelled the next token, not the consequences of the last one.

The bug is not in the cancel handler. The bug is the assumption — borrowed from every other streaming UI on the team's roadmap — that an incrementally rendered output is an incrementally reversible one. Tool calls do not honor that contract. They are point-in-time commits that the streaming layer happily fires while the rest of the response is still being generated, and the cancel button has no way to chase them down the wire.

This is one of those failure modes that nobody owns because it lives in the seam between two teams that each shipped their half cleanly. The UX team shipped streaming because it tested better in user studies. The platform team shipped tool calls because the framework supports them. Neither team had a meeting where someone asked: what is "stop" supposed to mean when the response has already left the building?

Streamed Tokens Are a Promise You Can't Take Back

· 9 min read
Tian Pan
Software Engineer

The model has streamed seventy percent of a confident-sounding answer to the user's screen. Then the tool call it was about to make returns an error, or no rows, or a 429. You now get to pick between two losses: let the model finish gracefully by inventing the rest, or stop mid-sentence with no clean way to walk it back. Neither is a recovery — both are damage.

This is the part of streaming UX that nobody priced when they turned the feature on. Streaming was framed as a perceived-latency win: time-to-first-token is the metric, the user starts reading sooner, the app feels alive. What the framing leaves out is that every token you stream is a commitment. You have published a draft of an answer that you do not yet know is correct, and the back half of your system has not yet finished running. When it finishes and disagrees, your UI has no native way to retract what it already showed.

The Confidence Score Your Users Learned to Ignore

· 11 min read
Tian Pan
Software Engineer

You wanted to be honest. You put a little "92%" next to every answer your agent gave. After the third time the agent was confidently wrong at 92%, your users stopped reading the number. They did not get angry about it. They just learned, the way humans always learn around a misbehaving signal, that the gauge on the dashboard is not connected to the engine. The number is still there. It costs you tokens to produce it. It informs no decision anyone makes.

This is the failure mode that calibration UX research keeps rediscovering: surfacing a probability is a trust commitment, and the commitment goes one direction. The moment the number turns out to be uncorrelated with correctness in the user's lived experience, the score is dead — and the trust you spent putting it there is dead with it. You cannot un-ring that bell by fixing the number later. The number is now decoration.

The First-Time User Cliff Your Aggregate Metrics Are Hiding

· 10 min read
Tian Pan
Software Engineer

Your AI feature looks healthy. Weekly active is flat-to-up, satisfaction scores are positive, the dashboard says ship more of this. The PM cites the metric in the next planning round. The engineering lead nods. The roadmap gets another adjacent feature.

Then someone segments the chart by user tenure and the picture inverts. Long-time users — the ones who were already there when the feature shipped — go deep on it daily. First-time users bounce within two interactions. The "flat" line is two cohorts cancelling each other out: a power curve sloping up, and a churn curve sloping down, summed into a lie.

Your Agent Has No Concept of Business Hours

· 10 min read
Tian Pan
Software Engineer

A support agent at a mid-size SaaS company resolved a billing dispute correctly. It read the ticket, checked the customer's account, found the duplicate charge, issued the refund, and sent a polite confirmation email. Every step was right. The only problem was the timestamp: 3:14 a.m. in the customer's timezone. The customer woke up to a refund notification, assumed their card had been compromised, and opened a fraud case with their bank before anyone at the company was awake to explain.

Nothing in that workflow was a bug in the conventional sense. The agent didn't hallucinate, didn't pick the wrong account, didn't miscalculate the refund. It just had no idea that 3 a.m. is a bad time to tell someone money moved. The model has read more text about human sleep schedules than any person alive, and it still acted as if the recipient were a server endpoint that is awake whenever you call it.

The Streaming Rollback Problem: You Can't Un-Say a Token

· 10 min read
Tian Pan
Software Engineer

Watch someone use a chat product for the first time and you'll notice they start reading before the model finishes. That reading-as-it-appears behavior is the entire reason streaming exists: it turns a multi-second wait into something that feels like a conversation. It is also the reason your output guardrails are quietly broken.

Here is the uncomfortable sequence. The model generates token 1, token 2, token 150. Each one is rendered the instant it arrives. At token 200, the model produces a hallucinated dosage, a leaked email address, or a sentence that violates your content policy. Your output-side guardrail fires correctly and immediately. But "immediately" is too late — the user has already read 200 tokens. You cannot un-render them. The guardrail did its job, and the violation still reached a human being.