The Canary Cohort Your Rollout Hashed by ID That Clustered Power Users Into One Arm
A rollout team ships a new model behind a percentage flag. The flag bucket is computed as hash(user_id) % 100, the canary is buckets 0–4, the lift on per-user engagement is large and stable for two weeks, and the team ramps to 20%, then 50%, then global. The lift evaporates somewhere between 50% and global, and the post-mortem traces it back to the canary cohort. The treatment didn't move the metric. The canary arm was a different population.
The team thought it had been sampling users. It had been sampling IDs.
The two words look interchangeable until you remember that IDs are not generated by the user. They are generated by whatever system happened to be running the day the account was created — a sequence in Postgres, a Snowflake-style time-encoded integer, a UUIDv7 with a timestamp prefix. If that generator embeds time, then "users with adjacent IDs" really means "users who signed up around the same time." And signup time is one of the strongest predictors of behavior most products have. The viral integration that brought in a power-user wave two years ago lives in a six-month window of ID space, and any bucketing scheme that doesn't actively scramble that window can land it whole inside one arm.
The hash function is doing exactly what you told it to
The cruel part is that the hash is innocent. A good hash on a dense ID space produces a population-uniform distribution of IDs. If you check the marginal distribution of bucket assignments, every bucket has roughly the same count, the chi-squared test for sample ratio mismatch comes back clean, and the experimentation platform declares the split healthy. SRM checks are the right tool for detecting a broken assignment pipeline — a missing variation script, a redirect that drops half the variant, a targeting rule that misfires — but they compare the count of users per bucket, not the composition. Equal counts with unequal compositions is a category of bias the standard health check is built to ignore.
The hash also reuses the same digest across experiments. Two concurrent experiments salted with the same string or run with the same partitioning policy can produce assignments that look independent in any one experiment but are correlated when you join them — a known failure mode that surfaces when the platform team starts looking for the source of phantom lifts. The fix in that case is per-experiment salts. The fix in this case is something different, because the problem isn't a collision between experiments. It's a collision between the ID space and the cohort space.
Why "uniform on IDs" is not the same as "uniform on users"
The implicit contract a hash bucketing scheme offers is: if I hash on a unique identifier, I get an exchangeable sample of users. That contract is true only when the identifier is statistically independent of the variables you care about. For a randomly assigned identifier with no information content, that holds. For an integer that monotonically increases with signup time, it doesn't — the identifier carries a time covariate, and any cohort behavior that correlates with signup time bleeds into the bucket assignment.
The most common version of this is heavy-user bias. The Microsoft Research paper "On Heavy-user Bias in A/B Testing" makes the point that a small fraction of accounts often drives a large fraction of metric movement, and the composition of heavy users inside an experiment window can deviate substantially from the long-run population. When heavy users cluster on a specific dimension — signup window, geography, plan tier, device — and your bucketing function is sensitive to that dimension, the experiment is measuring a different population than the rollout will eventually serve. The bias is not introduced by the experiment running too short. It is introduced by the assignment being non-random in a way that the count-based health check can't see.
A signup-time cohort is the textbook case. Power users sign up in concentrated bursts: a launch, a press cycle, a viral integration, a partnership. Those bursts produce dense ID ranges. A hash that maps adjacent IDs to adjacent buckets — and many practical hash functions do, especially modulo or fold-and-mix variants on small bucket counts — will route a contiguous ID range to a contiguous bucket range. A 5% canary that maps to buckets 0–4 is five contiguous bucket boundaries. There is no theoretical reason a six-month signup window with disproportionate engagement won't fall entirely inside those five buckets. There is a probabilistic reason it should be unlikely on average, but the variance of "where the power-user cohort lands" across a small number of buckets is high enough that any one rollout can hit the bad arrangement.
What the standard health checks miss
The experimentation platform's SRM detector fires when the count in the treatment arm is unusually far from the expected percentage. That check would catch a bucketing function with a hot spot, or a tracking script that fails to fire on half of variant pages, or a redirect that drops traffic. It wouldn't catch a count-balanced split where one arm's users happen to have spent 5x more on the product the previous month.
- https://arxiv.org/abs/1902.02021
- https://www.microsoft.com/en-us/research/publication/on-heavy-user-bias-in-a-b-testing/
- https://www.statsig.com/blog/sample-ratio-mismatch
- https://www.statsig.com/blog/stratified-sampling-in-ab-tests
- https://docs.geteppo.com/statistics/cuped/
- https://www.optimizely.com/insights/blog/introducing-optimizelys-automatic-sample-ratio-mismatch-detection/
- https://amplitude.com/docs/feature-experiment/troubleshooting/sample-ratio-mismatch
- https://medium.com/@ThinkingLoop/canary-metrics-lie-more-than-you-think-01f201a5c776
- https://techkluster.com/technology/a-b-test-bucketing-using-hashing/
- https://engineering.depop.com/a-b-test-bucketing-using-hashing-475c4ce5d07
