The A/B Test Winner Whose Verbose Output Triggered Your Click Handler More Than the Better Answer
A prompt-variant experiment runs on the production traffic of an AI-assisted search product. The success metric is a click on any suggested action in the response. Variant B ships responses that are roughly forty percent longer with more enumerated options. The click-through rate is eleven percent higher with three nines of statistical significance. The experiment is declared a winner and shipped.
A month later, the weekly customer satisfaction survey drops two points. Nobody connects it to the launch because the experiment has already been written up as a success and the team has moved on. A quarterly review eventually traces the satisfaction drop back to the prompt change, and the diagnosis lands hard: variant B won not because it gave users better answers but because longer answers contained more clickable surfaces. The click handler fired more often per impression because there was more to click, not because what the user read was more worth acting on.
The mistake was not in the statistics. The p-value was real, the lift was real, the sample size was honest. The mistake was that the success metric measured the shape of the response, and the shape of the response was something the prompt variant could change directly without changing the underlying quality. The experiment was a fair fight on the wrong axis.
How Shape Wins Without Anyone Cheating
Engagement metrics on AI products carry a hidden coupling between the surface area of the output and the probability of any single engagement event firing. Click-through rate, action invocation rate, suggested-followup acceptance rate — each of these is computed against an event that the response itself produces. A response with three suggested actions has three chances to trigger the metric. A response with seven has seven. The user did not become more interested; the response became more clickable.
This is not unique to LLMs. Product teams have measured engagement on long-form content for years and have known that pagination, infinite scroll, and recommendation carousels all inflate engagement counts by inflating the inventory of things to engage with. What is new is how cheaply an LLM can change the output's shape. A one-line prompt edit can take an answer from three bullets to ten. There is no design review, no engineering ticket, no UX trade-off to negotiate. The variation space of "how much surface area does the response present" is wide open to any prompt experiment, and any metric tied to per-impression engagement will silently reward expansion until something else breaks.
The phenomenon also extends beyond clicks. Time-on-page goes up when responses are longer to read. Copy-to-clipboard rates rise when there are more discrete blocks to copy. Thumbs-up button presses can increase because verbose answers feel more "complete" even when they are wrong in the same places a short answer would be wrong, just at greater length. Anything you measure that is downstream of the response producing more text will reward producing more text.
Why The Length Bias Has a Familiar Provenance
If the verbosity-rewards-itself failure mode sounds familiar, it should. The LLM evaluation community has been wrestling with length bias in judge-based comparisons for years. Models like GPT-4, when asked to pick between two candidate responses, systematically prefer the longer one even when an explicit rubric tells them to value brevity. The bias is robust enough that practitioners now publish length-normalized win rates as a matter of routine, and the literature has named the phenomenon explicitly.
The same dynamic is showing up one layer out, in product analytics. The judge in the production case is not an LLM; it is the click handler. The mechanism is different — a user clicks more because there are more buttons, not because they cognitively prefer longer text — but the failure mode is structurally identical. A metric that is downstream of response shape is going to be biased toward the response shape that maximizes its own opportunities to fire. The team that ships the longer variant has discovered the product version of the same length bias that has been distorting LLM-as-judge benchmarks since the chat era began.
This connection matters because it tells you the fix is not a one-off correction for one experiment. The fix has to be a property of the experimentation system itself. Any team running prompt or model variants against an engagement metric on a product whose output shape is variable is exposed. The longer variant will tend to win. The team that does not name this dynamic as a covariate of every experiment will keep shipping verbosity until a downstream satisfaction signal forces a reckoning months later.
The Metric Was Doing Exactly What It Was Asked To
- https://aclanthology.org/2025.findings-emnlp.358/
- https://www.sebastiansigl.com/blog/llm-judge-biases-and-how-to-fix-them/
- https://arxiv.org/pdf/2509.26072
- https://towardsdatascience.com/goodharts-law-and-the-dangers-of-metric-selection-with-a-b-testing-91b48d1c1bef/
- https://www.sciencedirect.com/science/article/pii/S2666389922000563
- https://eugeneyan.com/writing/counterfactual-evaluation/
- https://www.searchenginejournal.com/ai-overviews-cut-organic-clicks-38-field-study-finds/573145/
- https://mixpanel.com/blog/guardrail-metrics/
- https://statsig.com/blog/what-are-guardrail-metrics-in-ab-tests
- https://arxiv.org/pdf/2402.11609
