Skip to main content

16 posts tagged with "metrics"

View all tags

The A/B Test Powered by Token Counts Instead of Outcomes

· 13 min read
Tian Pan
Software Engineer

A team I worked with shipped a prompt change that reduced output tokens by 22%. The experiment dashboard lit up green — variance was tight, the p-value was clean, and the cost savings extrapolated to six figures a year. Two weeks later, a product analyst poking at conversion funnels flagged that the downstream task completion rate had dropped 11% in the same window. The shorter outputs were leaving out a clarifying step that users had been quietly relying on to know what to click next.

The experiment platform had not lied. It had reported the exact metric the team configured as primary, and that metric had moved in the right direction. The problem was that the metric measured something the team did not actually care about. Tokens were cheap to count, the experiment infra had a turnkey integration for them, and outcomes were hard to instrument — so the team picked what the platform made easy. The result was a clean win on the dashboard and a regression in the product.

The Agent Budget That Approved Cost-Per-Call and Never Measured Cost-Per-Resolved-Task

· 10 min read
Tian Pan
Software Engineer

A quarter into the rollout, the AI team reported a 25% reduction in average cost-per-API-call. The support team reported that average handle time on AI-routed tickets had drifted from four turns to seven. Both numbers were correct. Both teams were measuring the system they had been told to optimize. The finance team, sitting between them, could not reconcile the dashboards because neither one was denominated in the thing the customer was actually paying for: a resolved ticket. The cost-per-call had gone down. The cost-per-resolved-task had gone up 40%. Nobody owned that number, so nobody was watching it move.

This is the most common unit-economics failure I see in agentic deployments, and it is not a measurement bug. It is a definitional one. The vendor's pricing page exposes cost-per-call because that is the unit they bill. The spreadsheet line item inherits that unit because it fits in a cell. The engineering team optimizes against the unit they were given. By the time the gap between API economics and business economics becomes visible, it has been compounding for a quarter, and the agent has been quietly trained on the wrong loss function the entire time.

The ChatOps Bot That Mistook Silence for Consent

· 10 min read
Tian Pan
Software Engineer

Your deploy bot has been live for nine months. The dashboard says message volume is up and to the right. The thumbs-down rate is stable below two percent. The team that ships it interprets this as adoption. Then a staff engineer mentions, almost in passing, that everyone on his squad muted the channel back in February — they trust the bot's hourly digest about as much as they trust a vendor newsletter, and they got tired of the buzz. The bot is talking to an empty room and the metric calls that traction.

This is the failure mode most chatops teams hit and almost none of them measure. When a bot in Slack or Teams stops getting replies, the easy read is "the agent has reached a steady state — users don't need to argue with it anymore." The honest read is usually the opposite: users are routing around it, muting it, or learning that ignoring the prompt is cheaper than reading it. The engagement chart can't tell you which. The instrumentation has to be redesigned around the assumption that silence is the default and that interpreting it correctly is the whole job.

The Deflection Metric That Lied: When AI Support Success Hides User Churn

· 10 min read
Tian Pan
Software Engineer

A support leader I spoke with last quarter was glowing about a 78% deflection rate from the new AI agent. Tickets routed to humans had collapsed; cost per contact looked beautiful; the dashboard sparkled green for three straight months. Then revenue ops ran a cohort analysis. The customers who had hit the bot at least once during a billing question were churning at 1.7x the rate of customers who had not. The deflection metric had not measured help. It had measured silence — and silence turned out to be the sound of paying users walking out the door.

This is the failure mode that the industry is now naming aloud. Deflection counts conversations where the customer did not reach a human. It does not distinguish "I got my answer" from "I gave up." Treat those as the same number and you will optimize for the second one, because making the bot harder to escape is much easier than making it actually resolve issues. Klarna learned this publicly in 2026 when it began rehiring customer service staff a year after announcing AI had replaced roughly 700 agents; repeat contacts had jumped about 25%, and the savings line that justified the layoffs evaporated against the cost of re-handling everything the bot mishandled the first time.

Re-Ask Rate: The Failure Signal Your Eval Pipeline Never Extracts

· 10 min read
Tian Pan
Software Engineer

Open any production chat transcript long enough and you will find a user who asks the same question three times. The phrasing changes a little each turn — pronouns swap to nouns, a clarifier gets bolted on, the polite hedge falls away by the third try — but the underlying request is identical. They are not asking three questions. They are asking the same question, and the agent is failing to answer it, and the user is hoping that this time the words will land differently.

The transcript-level signal here is so loud it is almost obscene. The user has told you, with their own keystrokes, that the previous response did not help. They did not need to fill out a survey. They did not need to leave a thumbs-down. They told you by typing the question again. And in most production AI stacks, this signal is silently discarded by an eval pipeline that scores each turn in isolation and a satisfaction survey that only fires at session end — by which point the user who re-asked three times has usually already churned and will never grade anything.

Task Completion Goes Green While Users Quietly Suffer

· 8 min read
Tian Pan
Software Engineer

Your agent dashboard says 94% task completion. Leadership is happy. The roadmap gets funded. And yet support tickets are climbing, power users have gone quiet, and the one engineer who actually watches traces keeps muttering that something is wrong. Both things are true at once. The agent is completing tasks. It is also taking twelve minutes and four thousand tokens to do a two-step job, backtracking three times, and asking the user to confirm a fact it could have inferred from the first message.

Task completion is a binary that hides a distribution. "The agent finished" tells you nothing about the path it took to finish, and the path is most of what users actually experience. A completion-rate dashboard is structurally incapable of seeing a slow, expensive, annoying agent. It will stay green right up until users churn.

This is not a measurement gap you can patch with a better prompt. It is a category error in what you chose to measure. Completion is the easiest thing to instrument and the least of what people are paying for.

The Difficulty Concentrator: AI Support Deflection Burns Out the Humans Left Behind

· 9 min read
Tian Pan
Software Engineer

The dashboard says everything is going well. Deflection up to 65 percent. Ticket volume down. Cost-per-contact halved. Then the support team starts quitting, and the exit interviews say something the dashboard has no column for: "every shift is the bad one."

This is the hidden mechanic of AI-augmented support. The deflection rate is not a measure of difficulty removed. It is a measure of difficulty concentrated. The cases that reach a human are no longer a representative sample of customer reality — they are the residue, the cases the AI couldn't close. And the residue is heavier than the average.

AI Feature PMF Signals: Why Your Metrics Are Lying to You

· 9 min read
Tian Pan
Software Engineer

When your AI feature ships and the metrics light up — DAU spikes, NPS climbs, thumbs-up feedback floods in — you could be looking at genuine product-market fit. Or you could be watching the first act of a two-part story where the second act ends with a retention cliff nobody saw coming.

The problem is these signals are structurally broken for probabilistic AI features. They were designed for deterministic software where "activated" means something, where a five-star rating predicts future use, where the novelty fades in days rather than masking a six-month churn wave. AI features behave differently, and the standard PMF toolkit is calibrated for the wrong inputs.

The Feedback Signal Timing Problem: Why Your AI Metrics Are Lying to You

· 9 min read
Tian Pan
Software Engineer

When Klarna deployed its AI customer service chatbot in early 2024, it processed 2.3 million conversations in the first month. Satisfaction scores matched human agents. Executives declared victory. By 2025, the company was quietly hiring back the human agents it had replaced.

What went wrong? The metrics told one story while users experienced another. The chatbot aced simple, transactional queries—order status, payment questions—but fell apart on complex disputes, fraud claims, and emotionally difficult conversations. CSAT scores averaged across all interaction types couldn't detect this. The system appeared to be working even as it was slowly eroding user trust.

This isn't a Klarna-specific failure. It's a pattern that repeats across AI product development: teams collect satisfaction signals, optimize against them, and discover too late that the signals were measuring something other than actual value. The problem isn't the tools—it's the timing mismatch between when feedback arrives and when the consequences of a response become clear.

The Quiet Quitter Pattern: Why Your AI Engagement Metrics Are Lying to You

· 10 min read
Tian Pan
Software Engineer

There's a specific failure mode that quietly destroys AI product metrics without anyone noticing. Your dashboard shows a 34% suggestion acceptance rate, strong DAU, and growing feature engagement. What the dashboard doesn't show is that 60% of those accepted suggestions get immediately rewritten, the users who "engage" most are the ones who click the AI output, select all, and type their own response anyway, and the feature has zero measurable effect on downstream task completion.

The Quiet Quitter Pattern: Why Your AI Engagement Metrics Are Lying to You

This is the quiet quitter pattern: users who systematically route around an AI feature while still generating all the surface metrics of engaged users. They don't disable the feature — they just ignore its output. In your analytics, they look identical to your best AI users.

Org-Level Goodhart: When Teams Game AI Adoption Metrics

· 9 min read
Tian Pan
Software Engineer

According to one study, 95% of generative AI pilots technically succeed—and 74% of companies using generative AI still haven't shown tangible business value. That gap is not a coincidence. It's a measurement problem dressed up as a technology problem, and most organizations won't diagnose it correctly because the people doing the measuring are the same ones being measured.

This is Goodhart's Law at the organizational level: once an AI adoption metric becomes a performance target, it ceases to measure what you care about. The metric keeps going up. The underlying outcome stays flat or gets worse.