17 posts tagged with "metrics"

Org-Level Goodhart: When Teams Game AI Adoption Metrics

May 5, 2026 · 9 min read

Software Engineer

According to one study, 95% of generative AI pilots technically succeed—and 74% of companies using generative AI still haven't shown tangible business value. That gap is not a coincidence. It's a measurement problem dressed up as a technology problem, and most organizations won't diagnose it correctly because the people doing the measuring are the same ones being measured.

This is Goodhart's Law at the organizational level: once an AI adoption metric becomes a performance target, it ceases to measure what you care about. The metric keeps going up. The underlying outcome stays flat or gets worse.

Accept Rate Is a Vanity Metric: Your Copilot ROI Hides in the 90 Seconds After the Keystroke

April 23, 2026 · 11 min read

Tian Pan

Software Engineer

The dashboard says your engineers accepted 45% of AI suggestions last quarter. Leadership reads that as "45% of a developer's time saved" and signs the renewal. The engineers, meanwhile, are quietly rewriting half of what they accepted, debugging the other half, and wondering why their sprints still feel the same length. Both sides are looking at the same number. Only one of them is looking at the right number.

The most quoted study of 2025 should have ended the vendor-dashboard era on its own. METR measured experienced open-source maintainers working on real issues in their own repos, with and without AI. The developers predicted AI would speed them up by 24%. After the experiment they still believed AI had sped them up by 20%. The stopwatch said they were 19% slower. A thirty-nine-point gap between the story and the data — and the story is what went into the quarterly review.

Why Hallucination Rate Is the Wrong Primary Metric for Production LLM Systems

April 19, 2026 · 8 min read

Tian Pan

Software Engineer

Your LLM's hallucination rate is 3%. Your users hate it anyway. This isn't a contradiction — it's a symptom of measuring the wrong thing.

Hallucination rate has become the default headline metric for LLM quality because it's easy to explain to stakeholders and straightforward to compute on a benchmark. But in production, it correlates poorly with what users actually care about: did the task get done, was the result trustworthy enough to act on, and did the system save them time?

AI Product Metrics That Don't Lie: Behavioral Signals Over Thumbs-Up Scores

April 16, 2026 · 9 min read

Tian Pan

Software Engineer

Your AI feature has a 4.2/5 satisfaction score. Users click thumbs-up 68% of the time. The A/B test shows task completion rate is up 12%. Your team ships it. Six weeks later, users have quietly routed around it for anything they actually care about.

This is metric theater. You optimized for signals that look like success but aren't. The feedback you collected came from the 8% of users who bother rating anything — skewed toward the delighted and the furious, silent on the vast middle who found the feature unreliable just often enough to stop trusting it.

Building AI features requires a different measurement philosophy than traditional software. The signals you instrument from day one determine whether you learn fast enough to improve or spend six months chasing a satisfaction score that doesn't move.

Measuring Real AI Coding Productivity: The Metrics That Survive the 90-Day Lag

April 14, 2026 · 9 min read

Tian Pan

Software Engineer

Most teams adopting AI coding tools hit the same wall. Month one looks like a success story: PR throughput is up, sprint velocity is climbing, and the engineering manager is putting together a slide deck to share with leadership. By month three, something has quietly gone wrong. Incidents creep up. Senior engineers are spending more time in review. A simple bug fix now requires understanding code nobody on the team actually wrote. The productivity gains have evaporated — but the measurement system never caught it.

The problem is that the metrics most teams reach for first — lines generated, PRs merged, story points burned — are the wrong unit of measurement for AI-assisted development. They measure the cost of producing code, not the cost of owning it. And AI has made production nearly free while leaving ownership costs untouched.

About Tian Pan