The Missing Arm: Your AI Experiment Has No 'AI-Off' Control
Look at the last six experiment readouts your team shipped on an AI feature. What were the arms? Odds are good you tested "new prompt vs. old prompt," or "GPT-5 router vs. GPT-4 fallback," or "reasoning model vs. fast model," or "with retrieval vs. without retrieval." You reported lift on engagement, task completion, or session length. You called it product impact. A quarter rolled by. Inference spend climbed. Nobody paused to ask the question the CFO eventually will: what would have happened if the feature simply weren't there?
That question is the missing arm. The lift your experiments keep measuring is "better AI vs. worse AI," but the one your business runs on is "AI vs. nothing" — or more uncomfortably, "AI vs. the three-line heuristic we never wrote down." These are different experiments with different conclusions, and most AI product programs in 2026 have only ever run the first one. The second is the one that tells you whether the feature is earning its inference bill.
Why the null arm keeps getting skipped
The mechanics are boring. A PM launches an AI feature. The feature looks promising in qualitative review. They ship it to 50% of users and call the other 50% the control. Engagement goes up 4%. The feature graduates. Next quarter, a new model comes out, and the experiment is "old prompt vs. new prompt" — the 4% baseline is already baked in, already silently assumed to be the floor. The feature-off condition has not been tested in eighteen months, during which the product, the user base, and the model have all changed.
This is how the baseline drifts out of view. Every experiment after the first one compares two variants that both include the AI. The "AI itself" becomes background radiation in the product. At some point someone mentions a holdout group and a product manager, not unreasonably, points out that holding users back from a feature you've already validated feels like leaving revenue on the table. The statsig guidance of 1–2% long-running holdouts exists precisely to push back on this instinct, but culturally the bar for running one on an AI feature is higher than the bar for shipping a new AI feature — a revealing asymmetry.
There's also a softer reason. The null arm can produce findings nobody ordered. A team that ran a clean AI-off holdout at any point in the last two years has a non-trivial probability of discovering that the AI version underperforms the obvious rules-based fallback, or that the feature lifts engagement without lifting retention, or that the subgroup driving the aggregate win is 15% of users and the other 85% are indifferent or annoyed. Each of those findings is useful. None of them is what anyone hopes to read on Monday.
The shapes a null arm can take
"AI-off" is not one thing. Depending on the feature, the null arm can look like any of these, and the choice matters more than the mechanics:
- Feature absent. The product as it was before the AI surface existed. For a new-feature launch, this is the cleanest null: users don't see the thing at all. This is what marketing-incrementality literature calls the true holdout, and it answers the most expensive version of the question — is inference spend paying for something the product needed?
- Non-AI fallback. A rules-based, heuristic, or deterministic implementation of the same surface. For a "smart" autocomplete this is the old autocomplete. For a "smart" inbox sort this is the chronological inbox. For semantic search this is lexical BM25. This is the comparison that tells you whether the model is doing work a static rule couldn't do.
- Cheaper model. A deliberately weaker model holding the product shape constant — Haiku where you shipped Opus, an 8B open model where you shipped a frontier API. This one is the one most teams have actually run; it is useful for cost decisions but it is not the null arm. Including it and calling the job done is the main trap.
A program that only runs the third bucket is measuring the margins of a decision it has already made. The first two arms ask whether the decision was correct at all.
What the null arm tends to show
Teams that bite the bullet and run true AI-off arms tend to surface a few patterns that recur across companies:
Engagement lifts without retention lifts. A feature that clicks well and gets used enthusiastically in the first session doesn't always make the user come back. The a16z 2026 retention benchmarks make the point from the other side: "AI-powered" subscription apps churn roughly 30% faster than non-AI peers at the median. The aggregate story is that consumers sample AI features eagerly and unsubscribe from them quickly, which means a session-level engagement win can coexist with a retention loss if you only measure the first surface.
The heuristic was already good enough. When the non-AI fallback is a sharpened version of what the product used to do, the AI variant sometimes wins on a narrow metric by a margin smaller than the variance between two prompt revisions. In practice this means the feature's lift is real but cheap, and the part of it that's specifically AI is a rounding error on a deterministic system nobody wanted to invest in.
The lift is concentrated in a sub-population. An aggregate 6% win hides a 15%-of-users distribution where the feature does real work and an 85% distribution where it's a no-op or mildly negative. This is where the null arm pays off the most: it reveals which users actually need the feature, which lets you target the rollout and stop paying inference on the population that doesn't benefit.
The capability drift was the real story. Run the same null arm a year later and the "lift" may have grown, shrunk, or inverted — not because your feature changed, but because the model underneath it did. Without a periodic re-baseline, you cannot tell the difference between your product getting better and the underlying model getting better.
None of these patterns are arguments against shipping AI features. They are arguments for knowing what you shipped.
How to hold the line on the null arm discipline
The null arm is a cultural problem dressed as a statistical one. The tactical advice is straightforward; the hard part is preventing it from getting negotiated away in launch meetings. A few practices that tend to survive the negotiation:
Make the null arm a launch gate, not a post-launch option. Every new AI feature gets a null arm — feature-absent or deterministic-fallback, chosen deliberately — as part of the initial experiment design, sized to detect a meaningful effect, and not downsized to 1% once someone notices that 10% of users aren't seeing the feature. This works because changing the experiment contract after launch is uncomfortable in a way that declining to add a null arm later is not.
Size by detectable effect, not by comfort. A 1% holdout is cheap to run and statistically useless for most AI feature experiments, because most AI features do not produce the 50% effect sizes that 1% samples can detect. If the true effect you care about is 3%, you need the sample that can detect 3%. Don't let the discomfort of "holding back" be the function that picks the sample size.
Re-baseline on a schedule, not on a dare. A long-running null arm (3–6 months) with rolling cohorts rotates users through the AI-off state and catches the capability drift of the underlying model. This is boring to set up and essential to operate. Statsig and Eppo both ship tooling for this; the cost is low, the payoff is a standing answer to "is the AI still earning its keep."
Publish the null-arm number next to the lift. When the readout says "prompt B lifted engagement 4% vs. prompt A," it should also say "prompt A lifted engagement 2% vs. AI-off, last measured two months ago." The second number is what the business decision actually hinges on. Making it absent from the slide is how it disappears from the conversation.
Pre-commit to what you'll do if the null arm shows nothing. This is the piece most programs skip. If the null arm shows the AI feature adds zero incremental retention, what happens? Sunset the feature? Cut to the cheapest model? Rescope to the sub-population that benefits? Decide before you run the experiment, because the post-hoc pressure to reinterpret a null result will be intense, and the honest answer is nearly always uncomfortable for someone who shipped the feature.
The conversation the CFO is eventually going to have
Finance teams are starting to ask a version of the question that engineering teams could have been asking themselves for the last two years: what is the incremental value of the AI feature, measured against its actual absence, net of its ongoing inference bill? That is an incrementality question, and the field has a long vocabulary for answering it — lift, counterfactual, holdout, synthetic control. What it does not yet have is organizational muscle memory inside AI product programs.
The teams that will handle this conversation well are the ones that ran their null arms quietly, on a schedule, for long enough to have a defensible incremental-value number for each AI feature in the product. The teams that will handle it badly are the ones whose last two years of experiment readouts are all "variant vs. variant" with no feature-off baseline anywhere in the record, and who will have to scramble to build one under time pressure, against a product they have no clean counterfactual for. The two groups will look identical on the engagement dashboard right up until the budget review.
The null arm is the cheapest possible insurance against that second outcome. Run it now, run it small, run it continuously, and publish what it shows. The alternative is a program that knows its inference bill to the dollar and its incremental value to within an order of magnitude — which is to say, a program that does not actually know whether it's working.
- https://docs.statsig.com/experiments/holdouts-introduction
- https://www.geteppo.com/blog/holdouts-measuring-experiment-impact-accurately
- https://posthog.com/tutorials/holdout-testing
- https://cxl.com/blog/hold-out-groups/
- https://amplitude.com/explore/experiment/incrementality-testing
- https://www.triplewhale.com/blog/incrementality-testing-methods
- https://a16z.com/ai-retention-benchmarks/
- https://techcrunch.com/2026/03/10/ai-powered-apps-struggle-with-long-term-retention-new-report-shows/
- https://economics.mit.edu/sites/default/files/inline-files/draft_copilot_experiments.pdf
- https://arxiv.org/abs/2302.06590
- https://www.tecton.ai/blog/rules-heuristics-for-production-machine-learning/
- https://www.ibm.com/think/insights/why-ai-projects-fail-science-experiment-trap
