Skip to main content

2 posts tagged with "causal-inference"

View all tags

The Feature Flag Your Model Already Learned to Predict From the Inputs It Could See

· 10 min read
Tian Pan
Software Engineer

The treatment arm shipped because the dashboard said "+4% conversion, p < 0.01, n = 2.3M." Six weeks after the global rollout the lift was gone, and the team filed the post-mortem under "scale effects" because nothing else fit. The actual cause was sitting in the prompt assembler the whole time: the routing hash that decided arm assignment was derived from a user-tier attribute, and the same attribute was being interpolated into the prompt template three lines later. The model was reading the assignment in band. The "treatment" wasn't the prompt change. The treatment was the population the prompt change happened to attract.

This is a failure mode that doesn't exist in the experimentation playbooks teams inherit from the web era. A button color does not read the user's tier and decide to behave differently. A prompt does. Once your treatment is a string that the model interprets, every input that touches the routing decision and also touches the prompt becomes a back channel the experiment cannot close.

Why AI Features Break A/B Testing (and the Causal Inference Methods That Don't Lie)

· 11 min read
Tian Pan
Software Engineer

You ship an AI-powered feature, run a clean two-week A/B test, see a 4% lift in engagement, and call it a win. Six months later, the feature is fully rolled out and engagement is flat or declining. The test wasn't noisy — it was measuring the wrong thing entirely.

![](https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=Why%20AI%20Features%20Break%20A%2FB%20Testing%20(and%20the%20Causal%20Inference%20Methods%20That%20Don't%20Lie%29)

A/B tests were built for a world where users in a treatment group and users in a control group are statistically independent. AI features routinely violate that assumption. Users talk to each other, learn from each other's behavior, and share the outputs of AI tools. Treatment effects don't stabilize in two weeks when the real mechanism is long-horizon behavioral adaptation. When you ignore this, your experiment gives you a number that's internally consistent but causally meaningless.