Skip to main content

The Feature Flag Your Model Already Learned to Predict From the Inputs It Could See

· 10 min read
Tian Pan
Software Engineer

The treatment arm shipped because the dashboard said "+4% conversion, p < 0.01, n = 2.3M." Six weeks after the global rollout the lift was gone, and the team filed the post-mortem under "scale effects" because nothing else fit. The actual cause was sitting in the prompt assembler the whole time: the routing hash that decided arm assignment was derived from a user-tier attribute, and the same attribute was being interpolated into the prompt template three lines later. The model was reading the assignment in band. The "treatment" wasn't the prompt change. The treatment was the population the prompt change happened to attract.

This is a failure mode that doesn't exist in the experimentation playbooks teams inherit from the web era. A button color does not read the user's tier and decide to behave differently. A prompt does. Once your treatment is a string that the model interprets, every input that touches the routing decision and also touches the prompt becomes a back channel the experiment cannot close.

The Model Is a Participant, Not an Outcome

In a classical A/B test, the treatment is a property of the rendered page and the user is a passive recipient. You change a CTA, you measure clicks, you call it a day. The randomization unit (the user) and the treatment-application unit (the page) are separated by a network boundary, and nothing about the user's attributes leaks into the treatment itself unless you choose to leak it.

LLM experiments collapse that boundary. The treatment is a prompt, and the prompt is a function of the user's attributes. You don't render a different paragraph; you render a different instruction set that the model then conditions on along with everything else in the request. If your prompt template branches on user tier earlier in the assembler — and almost every production assembler does, because enterprise users get longer reasoning budgets and free-tier users get terser refusals — then the model sees the assignment axis directly. It doesn't need to know which arm it's in. It already knows what kind of user it's talking to, and that knowledge is co-linear with arm membership by construction.

This is the LLM analog of treatment leakage in classical causal inference: when the assignment signal correlates with a confounder that also affects the outcome, the estimated effect of the assignment is contaminated by the effect of the confounder. The randomization protects you against confounders you didn't measure, but only if the assignment itself is orthogonal to the inputs your treatment ingests. With an LLM, that orthogonality is almost never free. You have to design it in.

How the Lift Gets Manufactured

Walk through the failure mode end-to-end. Your routing layer hashes user ID with a salt to assign arm. So far, fine. But the salt is the experiment ID and the hash bucket is keyed on a tuple that includes user-tier — because last year someone wanted to "make sure each tier gets balanced exposure." The same user-tier value gets passed into the prompt assembler, which selects a system-prompt fragment ("you are a senior advisor for enterprise customers" versus "you are a helpful assistant"). The model now receives a prompt whose structure is correlated, at the sub-arm level, with the population variable.

In the control arm, both tiers get the old prompt. In the treatment arm, both tiers get the new prompt. So far, still fine in theory: every tier sees both arms in the right proportion. But the new prompt has a different sensitivity to tier than the old one. Maybe the new prompt expands more aggressively on enterprise context because the author was tuning against enterprise traces. Maybe it truncates differently for free-tier requests because the new format hits a token limit on long inputs. The result is that the treatment effect, computed as a marginal average across both tiers, ends up dominated by whichever tier the new prompt happens to favor.

At rollout, the population mix shifts. The new prompt is now serving 100% of traffic, including the marginal users the experiment never exposed it to: the trial accounts, the deprecated tier the router silently merged into "free," the geographic cohort that doesn't go through the assignment layer at all. The "+4% conversion" was a property of the tier mix inside the assignment window, not a property of the prompt. The mix changes, the lift disappears, the post-mortem invents a scale effect that never existed.

This is the same shape as Twyman's law: the more unusual the experimental result, the more likely it came from an instrumentation error. With LLMs, the instrumentation error is structural rather than incidental — the model is wired to read the inputs that determine its arm.

Symptoms Your Experimentation Stack Is Missing

There are signals that the assignment is leaking, and most experimentation dashboards are not configured to surface them.

The first is arm-conditional input divergence. Pull the distribution of every prompt-side input — system fragment ID, tool catalog hash, model temperature, max-tokens, user-tier string, retrieved-doc count — and compare across arms. If any one of them has a non-trivial KL divergence between control and treatment, you have a leak: the arms are not seeing the same input distribution. In a clean experiment the only thing that should differ between arms is the prompt variant the experiment is testing.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates