Why A/B Tests Fail for AI Features (And What to Use Instead)
Your AI feature shipped. The A/B test ran for two weeks. The treatment group looks better — 4% lift in engagement, p-value under 0.05. You ship it to everyone.
Six weeks later, the gains have evaporated. Engagement is back where it started, or lower. Your experiment said one thing; reality said another.
This is not a corner case. It is the default outcome when you apply standard two-sample A/B testing to AI-powered features without accounting for the ways these features break the assumptions baked into that methodology. The failure modes are structural, not statistical — you can run your experiment perfectly by the textbook and still get a wrong answer.
