Skip to main content

The A/B Test Powered by Token Counts Instead of Outcomes

· 13 min read
Tian Pan
Software Engineer

A team I worked with shipped a prompt change that reduced output tokens by 22%. The experiment dashboard lit up green — variance was tight, the p-value was clean, and the cost savings extrapolated to six figures a year. Two weeks later, a product analyst poking at conversion funnels flagged that the downstream task completion rate had dropped 11% in the same window. The shorter outputs were leaving out a clarifying step that users had been quietly relying on to know what to click next.

The experiment platform had not lied. It had reported the exact metric the team configured as primary, and that metric had moved in the right direction. The problem was that the metric measured something the team did not actually care about. Tokens were cheap to count, the experiment infra had a turnkey integration for them, and outcomes were hard to instrument — so the team picked what the platform made easy. The result was a clean win on the dashboard and a regression in the product.

This pattern shows up across AI-powered features the same way that vanity-metric A/B testing showed up across the rest of the web a decade earlier. The mechanism is identical: when the easy-to-measure proxy and the hard-to-measure outcome are not perfectly correlated, the experimentation infrastructure will optimize for whichever metric it makes easy to ship on. The team that did not force itself to measure the outcome is shipping local maxima it cannot distinguish from regressions.

The Proxy Substitution Trap

The substitution happens almost without anyone noticing. The team starts with a question phrased at the outcome layer — "does this prompt change make the assistant more useful?" — and ends with an experiment configured at the proxy layer — "does this prompt change reduce average output tokens?" The translation looks reasonable on the way down because tokens correlate with cost, cost correlates with margin, and margin correlates with "useful" in a sufficiently abstract sense. But each step throws away signal, and by the time the dashboard is built, the question being answered bears only a passing resemblance to the question that was asked.

Goodhart's Law describes the same dynamic in the abstract: when a measure becomes a target, it ceases to be a good measure. The literature on it gets dense, but the engineering version is simple. The moment your experiment promotion criteria depend on a proxy, your team will start producing changes that move the proxy without moving the underlying thing the proxy was supposed to track. This is not malice. It is what optimization does. Reinforcement learning research has documented this so thoroughly that the field has its own name for it — reward hacking — and the same effect operates in human teams running prompt experiments.

The reason it is hard to resist is that proxy metrics are genuinely useful as secondary signals. Token counts tell you something real about cost. Latency tells you something real about user experience. The mistake is not measuring them; it is letting them carry the weight of a primary metric they were never designed to bear. A team that measures token reduction as a guardrail to make sure cost does not blow up is doing it right. A team that ships changes because token reduction moved is doing it wrong, and the dashboard cannot tell the difference.

Why LLM Features Are Especially Vulnerable

Three properties of LLM-powered features make this failure mode worse than in classical product experimentation.

The first is that the output is high-dimensional. A traditional A/B test measures whether a button got clicked, and the click is the outcome. An LLM A/B test measures whether a response was generated, and the response is a thousand tokens with internal structure that maps to user behavior in ways the experiment framework does not see. The natural primary metric — was this response good — is not directly observable. So teams reach for whatever is observable, and what is observable is the response's surface properties: token count, latency, refusal rate, format compliance. None of these is the outcome.

The second is that cost pressure is constant and visible. Token spend is a line item the CFO asks about. Task completion rate is a metric the product team has to build infrastructure to measure. The asymmetry of organizational attention means that token-related metrics get instrumented first, dashboards get built first, OKRs get tied to them first, and by the time anyone asks "but did users get what they wanted," the experiment culture has already calcified around what was easy.

The third is that the path from output quality to user outcome is mediated by a longer chain than in classical features. A shorter output might cause a user to ask a follow-up question instead of completing the task, which raises the conversation length, which raises the per-session cost, which cancels the savings from the original prompt change — but none of that shows up in a single-turn A/B test scoped to first-response token count. The cost win is real at the level the test measured, and the cost loss is real at the level the user experiences, and the experiment platform reports only the first one.

A Forced Taxonomy of Metrics

The first concrete fix is to make the proxy/outcome distinction structural rather than cultural. Cultural rules — "remember to also look at the user impact" — do not survive contact with quarterly reviews. Structural rules — "the experiment platform will not let you mark this experiment shippable without a classified primary metric" — do.

The taxonomy that has worked well is three-tier:

  • Outcome metrics: measurements of whether the user got what they were trying to get. Task completion rate, accepted code edit rate, ticket resolved without escalation, search result clicked through to a satisfied dwell time. These are expensive to instrument and slow to measure, and they are the only metrics on which an experiment can be declared a win.
  • Proxy metrics: cheap, fast, technical measurements that correlate with outcomes but are not outcomes. Token counts, latency, response length, format compliance, refusal rate. These are useful for debugging and for understanding the mechanism of a win, but a movement here alone is not sufficient evidence to ship.
  • Guardrail metrics: measurements of harm that the change must not cause. Cost per resolved task, p95 latency, hallucination rate on a held-out eval, user-reported negative feedback. These have looser detection thresholds because they exist to catch large regressions, not to drive ship decisions.

The taxonomy only does work if the experiment platform enforces it — if the "promote" button is greyed out when only proxy metrics have moved, and only the user-outcome layer can light it up green. Teams that classify metrics in a slide deck and then fail to enforce the classification at the tool layer end up with the same pre-taxonomy problem in different clothes. The point of the taxonomy is to remove the option of accidentally shipping on a proxy, not to remind people that they could.

This also forces a useful conversation about which metrics are outcomes for a given feature. A code completion product's outcome is the accepted-and-retained edit, not the edit suggestion shown. A search product's outcome is the satisfied query, not the result click. A support agent's outcome is the resolved ticket without escalation, not the response that closes the conversation. Different products have different outcomes, and the taxonomy work is mostly the work of naming them precisely.

The Slow Cohort That Anchors the Decision

A/B tests on AI features tend to run on short windows because the AI feature itself iterates fast. Two weeks is a generous experiment duration in a team shipping prompt changes weekly. The problem is that the outcome metrics you actually care about often have longer feedback loops than the experiment supports.

A user who got a slightly worse response on Tuesday might not have noticed it; they might have asked a follow-up that the dashboard counted as a separate session; they might have churned three weeks later for reasons the experiment will never connect back to that prompt. The two-week window is too short to detect the cohort effect, and the AB platform is set up to declare a verdict at the end of the window regardless.

The pattern that closes this gap is a longitudinal outcome cohort that runs alongside the A/B but on a slower clock. The A/B determines what gets tentatively shipped; the cohort determines what stays shipped. Concretely:

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates