The Model Rollout Flag That Bucketed by Session and Drifted Your A/B Cohort
The post-mortem opened with a sentence everyone in the room wanted to be true: the new model won by 4 percent on satisfaction, p less than 0.01, ship it. A month later a colder analysis found that the lift was a confound, the model was actually flat or slightly worse, and the team had spent the intervening weeks debating which prompt change had "caused" the win. Nothing about the model had caused anything. The experiment had been measuring the wrong thing because the flag service and the analysis pipeline disagreed, silently, about what a cohort was.
This is one of the most expensive failure modes in A/B testing because nothing in the system is broken. The flag service works. The experiment tracker works. The dashboard renders. The statistics are computed correctly on the data they receive. The failure lives in the seam between three components that each carry a different assumption about identity, and the seam is invisible until you go looking for it.
