Skip to main content

9 posts tagged with "metrics"

View all tags

The Difficulty Concentrator: AI Support Deflection Burns Out the Humans Left Behind

· 9 min read
Tian Pan
Software Engineer

The dashboard says everything is going well. Deflection up to 65 percent. Ticket volume down. Cost-per-contact halved. Then the support team starts quitting, and the exit interviews say something the dashboard has no column for: "every shift is the bad one."

This is the hidden mechanic of AI-augmented support. The deflection rate is not a measure of difficulty removed. It is a measure of difficulty concentrated. The cases that reach a human are no longer a representative sample of customer reality — they are the residue, the cases the AI couldn't close. And the residue is heavier than the average.

AI Feature PMF Signals: Why Your Metrics Are Lying to You

· 9 min read
Tian Pan
Software Engineer

When your AI feature ships and the metrics light up — DAU spikes, NPS climbs, thumbs-up feedback floods in — you could be looking at genuine product-market fit. Or you could be watching the first act of a two-part story where the second act ends with a retention cliff nobody saw coming.

The problem is these signals are structurally broken for probabilistic AI features. They were designed for deterministic software where "activated" means something, where a five-star rating predicts future use, where the novelty fades in days rather than masking a six-month churn wave. AI features behave differently, and the standard PMF toolkit is calibrated for the wrong inputs.

The Feedback Signal Timing Problem: Why Your AI Metrics Are Lying to You

· 9 min read
Tian Pan
Software Engineer

When Klarna deployed its AI customer service chatbot in early 2024, it processed 2.3 million conversations in the first month. Satisfaction scores matched human agents. Executives declared victory. By 2025, the company was quietly hiring back the human agents it had replaced.

What went wrong? The metrics told one story while users experienced another. The chatbot aced simple, transactional queries—order status, payment questions—but fell apart on complex disputes, fraud claims, and emotionally difficult conversations. CSAT scores averaged across all interaction types couldn't detect this. The system appeared to be working even as it was slowly eroding user trust.

This isn't a Klarna-specific failure. It's a pattern that repeats across AI product development: teams collect satisfaction signals, optimize against them, and discover too late that the signals were measuring something other than actual value. The problem isn't the tools—it's the timing mismatch between when feedback arrives and when the consequences of a response become clear.

The Quiet Quitter Pattern: Why Your AI Engagement Metrics Are Lying to You

· 10 min read
Tian Pan
Software Engineer

There's a specific failure mode that quietly destroys AI product metrics without anyone noticing. Your dashboard shows a 34% suggestion acceptance rate, strong DAU, and growing feature engagement. What the dashboard doesn't show is that 60% of those accepted suggestions get immediately rewritten, the users who "engage" most are the ones who click the AI output, select all, and type their own response anyway, and the feature has zero measurable effect on downstream task completion.

The Quiet Quitter Pattern: Why Your AI Engagement Metrics Are Lying to You

This is the quiet quitter pattern: users who systematically route around an AI feature while still generating all the surface metrics of engaged users. They don't disable the feature — they just ignore its output. In your analytics, they look identical to your best AI users.

Org-Level Goodhart: When Teams Game AI Adoption Metrics

· 9 min read
Tian Pan
Software Engineer

According to one study, 95% of generative AI pilots technically succeed—and 74% of companies using generative AI still haven't shown tangible business value. That gap is not a coincidence. It's a measurement problem dressed up as a technology problem, and most organizations won't diagnose it correctly because the people doing the measuring are the same ones being measured.

This is Goodhart's Law at the organizational level: once an AI adoption metric becomes a performance target, it ceases to measure what you care about. The metric keeps going up. The underlying outcome stays flat or gets worse.

Accept Rate Is a Vanity Metric: Your Copilot ROI Hides in the 90 Seconds After the Keystroke

· 11 min read
Tian Pan
Software Engineer

The dashboard says your engineers accepted 45% of AI suggestions last quarter. Leadership reads that as "45% of a developer's time saved" and signs the renewal. The engineers, meanwhile, are quietly rewriting half of what they accepted, debugging the other half, and wondering why their sprints still feel the same length. Both sides are looking at the same number. Only one of them is looking at the right number.

The most quoted study of 2025 should have ended the vendor-dashboard era on its own. METR measured experienced open-source maintainers working on real issues in their own repos, with and without AI. The developers predicted AI would speed them up by 24%. After the experiment they still believed AI had sped them up by 20%. The stopwatch said they were 19% slower. A thirty-nine-point gap between the story and the data — and the story is what went into the quarterly review.

Why Hallucination Rate Is the Wrong Primary Metric for Production LLM Systems

· 8 min read
Tian Pan
Software Engineer

Your LLM's hallucination rate is 3%. Your users hate it anyway. This isn't a contradiction — it's a symptom of measuring the wrong thing.

Hallucination rate has become the default headline metric for LLM quality because it's easy to explain to stakeholders and straightforward to compute on a benchmark. But in production, it correlates poorly with what users actually care about: did the task get done, was the result trustworthy enough to act on, and did the system save them time?

AI Product Metrics That Don't Lie: Behavioral Signals Over Thumbs-Up Scores

· 9 min read
Tian Pan
Software Engineer

Your AI feature has a 4.2/5 satisfaction score. Users click thumbs-up 68% of the time. The A/B test shows task completion rate is up 12%. Your team ships it. Six weeks later, users have quietly routed around it for anything they actually care about.

This is metric theater. You optimized for signals that look like success but aren't. The feedback you collected came from the 8% of users who bother rating anything — skewed toward the delighted and the furious, silent on the vast middle who found the feature unreliable just often enough to stop trusting it.

Building AI features requires a different measurement philosophy than traditional software. The signals you instrument from day one determine whether you learn fast enough to improve or spend six months chasing a satisfaction score that doesn't move.

Measuring Real AI Coding Productivity: The Metrics That Survive the 90-Day Lag

· 9 min read
Tian Pan
Software Engineer

Most teams adopting AI coding tools hit the same wall. Month one looks like a success story: PR throughput is up, sprint velocity is climbing, and the engineering manager is putting together a slide deck to share with leadership. By month three, something has quietly gone wrong. Incidents creep up. Senior engineers are spending more time in review. A simple bug fix now requires understanding code nobody on the team actually wrote. The productivity gains have evaporated — but the measurement system never caught it.

The problem is that the metrics most teams reach for first — lines generated, PRs merged, story points burned — are the wrong unit of measurement for AI-assisted development. They measure the cost of producing code, not the cost of owning it. And AI has made production nearly free while leaving ownership costs untouched.