The A/B Test Powered by Token Counts Instead of Outcomes
A team I worked with shipped a prompt change that reduced output tokens by 22%. The experiment dashboard lit up green — variance was tight, the p-value was clean, and the cost savings extrapolated to six figures a year. Two weeks later, a product analyst poking at conversion funnels flagged that the downstream task completion rate had dropped 11% in the same window. The shorter outputs were leaving out a clarifying step that users had been quietly relying on to know what to click next.
The experiment platform had not lied. It had reported the exact metric the team configured as primary, and that metric had moved in the right direction. The problem was that the metric measured something the team did not actually care about. Tokens were cheap to count, the experiment infra had a turnkey integration for them, and outcomes were hard to instrument — so the team picked what the platform made easy. The result was a clean win on the dashboard and a regression in the product.
