The Feature Flag Your Model Already Learned to Predict From the Inputs It Could See
The treatment arm shipped because the dashboard said "+4% conversion, p < 0.01, n = 2.3M." Six weeks after the global rollout the lift was gone, and the team filed the post-mortem under "scale effects" because nothing else fit. The actual cause was sitting in the prompt assembler the whole time: the routing hash that decided arm assignment was derived from a user-tier attribute, and the same attribute was being interpolated into the prompt template three lines later. The model was reading the assignment in band. The "treatment" wasn't the prompt change. The treatment was the population the prompt change happened to attract.
This is a failure mode that doesn't exist in the experimentation playbooks teams inherit from the web era. A button color does not read the user's tier and decide to behave differently. A prompt does. Once your treatment is a string that the model interprets, every input that touches the routing decision and also touches the prompt becomes a back channel the experiment cannot close.
