Skip to main content

One post tagged with "bayesian"

View all tags

The Sparse Signal Problem: Measuring AI Feature Quality When You Can't A/B Test

· 11 min read
Tian Pan
Software Engineer

You've shipped an AI writing assistant to your enterprise customers. Twenty-three people use it every day. Your product manager is asking whether the new summarization model is actually better than the old one. You have two weeks before the next sprint, and you need a decision.

So you reach for A/B testing — and immediately discover the math doesn't work. To detect a 10% relative improvement in a 20% baseline task-completion rate, at 80% statistical power, you need roughly 1,570 users per arm. At 23 daily users, you'd need 136 days to accumulate enough data. The feature will be deprecated before the test concludes.

This is the sparse signal problem. It isn't a B2B startup edge case. Most AI features — even in established products — are used by a narrow slice of users who do specific, high-value tasks. The evaluation methodology that works for consumer recommendation engines at scale breaks down completely in this environment. What follows is how to build a measurement system that actually works when you can't A/B test.