The Model Rollout Flag That Bucketed by Session and Drifted Your A/B Cohort
The post-mortem opened with a sentence everyone in the room wanted to be true: the new model won by 4 percent on satisfaction, p less than 0.01, ship it. A month later a colder analysis found that the lift was a confound, the model was actually flat or slightly worse, and the team had spent the intervening weeks debating which prompt change had "caused" the win. Nothing about the model had caused anything. The experiment had been measuring the wrong thing because the flag service and the analysis pipeline disagreed, silently, about what a cohort was.
This is one of the most expensive failure modes in A/B testing because nothing in the system is broken. The flag service works. The experiment tracker works. The dashboard renders. The statistics are computed correctly on the data they receive. The failure lives in the seam between three components that each carry a different assumption about identity, and the seam is invisible until you go looking for it.
The configuration that read correctly and behaved wrongly
The flag was configured the way every onboarding guide tells you to configure it. Variation A to 50 percent of users, variation B to 50 percent, bucketing attribute set to user_id, hash seed pinned to the experiment key. A senior engineer reviewed the YAML. A data scientist reviewed the experiment design doc. The targeting predicate read out loud sounded right: take a stable hash of the user, mod 100, send the bottom half to A and the top half to B.
What the diagram on the design doc did not show was the flag SDK's internal cache. To avoid a round trip on every page load, the SDK evaluated the flag once per session and stored the resulting variation on a session-scoped object. The cache was a pure performance optimization, invisible to product code, and the invalidation rule was the default: expire when the session ends. On a long-session product like a developer tool or a help-desk app, "expire when the session ends" approximates "never." On a short-session product like a chat assistant where users open a tab, ask one question, close it, and come back twenty minutes later, "expire when the session ends" means "every twenty minutes, re-evaluate from scratch."
Re-evaluation on its own is not the bug. A deterministic hash on a stable user_id should produce the same bucket every time, even if you re-evaluate it ten thousand times. The bug is what happens to a deterministic hash when the input space changes.
The ramp that quietly rebalanced the hash space
Halfway through the experiment, the team bumped the variation B rollout from 50 percent to 60 percent so they could collect more data on a long-tail segment. Most flag services document that a sticky deterministic hash will keep already-bucketed users in their original variation across small percentage changes, as long as the bucketing scheme is implemented with that property in mind. LaunchDarkly's documentation, for example, describes the percentage rollout as a partition from 1 to 100,000 where each context lands in a stable slot, so that increasing the variation A allocation simply extends the contiguous range that maps to A, leaving already-assigned users untouched. (LaunchDarkly's percentage rollouts)
That property holds when the implementation is careful. It does not hold by default in every flag service or every SDK, and it does not hold when other variables change underneath it. If the salt changes, if a new bucketing attribute is added, if the variation order is rearranged, if the rollout splits more than two variations and a third is introduced, the partition can shuffle. GrowthBook documents explicitly that sticky bucketing is a separate feature you have to enable on top of the default deterministic hashing, precisely because consistent assignment across configuration changes is not free. (GrowthBook sticky bucketing) Statsig ships an analogous "persistent assignment" feature with the same framing: deterministic hashing handles the simple case, persistent assignment handles the case where the experiment configuration mutates mid-flight. (Statsig persistent assignment)
In this team's case, the ramp adjustment combined with a flag-service migration that occurred in the same window caused the underlying salt to rotate. The salt rotation was documented in the flag service's release notes under a heading nobody read. The result was that users who had been in variation A on Monday could end up in variation B on Tuesday, and the SDK's session-scoped cache faithfully captured whichever variation the user happened to draw on the first call of that session. The user_id was stable. The hash was deterministic. The cohort assignment was not.
The analysis that treated every event as its own assignment
The data scientist running the analysis did exactly what most A/B test tutorials describe. For each event, look up the flag value at the time of the event, join the metrics on that flag value, compute the lift, compute the p-value, report. The metrics pipeline had been built to make this easy. Every event carried a flag_value field, written at the moment the event fired, and the analysis simply grouped by that field.
That grouping logic is correct when the assignment is sticky and the event-time flag value equals the user's first-assigned cohort. It is not correct when the two diverge. In a sticky world, the event-time flag_value for any given user is a constant, so grouping by event-time flag_value and grouping by user-first-assigned flag_value produce the same partition. In a drifted world, the event-time flag_value is a function of two things: the user's underlying preference and the day-of-week pattern of when they happened to log in relative to the salt rotation. Grouping by event-time flag_value is then a confounding operation: it sorts events by a variable that is correlated with the outcome through paths the experiment did not control.
The concrete shape of the confound in this team's case was that the salt rotation landed on a Sunday night. Users who logged in primarily on weekdays ended up over-represented in the post-rotation bucket. Users who logged in primarily on weekends ended up over-represented in the pre-rotation bucket. The two populations had measurably different baseline satisfaction scores because they used the product for different kinds of work. The "model won by 4 percent" was substantially "weekday users have a 4 percent higher satisfaction baseline than weekend users," surfaced through the analysis as if the model had caused it. The lift was as the statistician would say: precisely estimated, highly significant, and pointing at the wrong thing.
- https://docs.growthbook.io/app/sticky-bucketing
- https://www.statsig.com/updates/update/persistentassignment
- https://launchdarkly.com/docs/home/releases/percentage-rollouts
- https://blog.growthbook.io/changing-running-experiments-safely-and-flexibly-in-growthbook/
- https://docs.developers.optimizely.com/feature-experimentation/docs/how-bucketing-works-feature-experimentation
- https://www.statsig.com/perspectives/confounding-variables-ab-testing-solutions
- https://blog.growthbook.io/how-to-a-b-test-ai-a-practical-guide/
- https://www.statsig.com/blog/llm-optimization-online-experimentation
- https://latitude.so/blog/ab-testing-in-llm-deployment-ultimate-guide
- https://www.llama.com/docs/deployment/a-b-testing/
