The A/B Test Winner Whose Verbose Output Triggered Your Click Handler More Than the Better Answer
A prompt-variant experiment runs on the production traffic of an AI-assisted search product. The success metric is a click on any suggested action in the response. Variant B ships responses that are roughly forty percent longer with more enumerated options. The click-through rate is eleven percent higher with three nines of statistical significance. The experiment is declared a winner and shipped.
A month later, the weekly customer satisfaction survey drops two points. Nobody connects it to the launch because the experiment has already been written up as a success and the team has moved on. A quarterly review eventually traces the satisfaction drop back to the prompt change, and the diagnosis lands hard: variant B won not because it gave users better answers but because longer answers contained more clickable surfaces. The click handler fired more often per impression because there was more to click, not because what the user read was more worth acting on.
