The Agent A/B Test Whose Variants Quietly Shared Long-Term Memory
You ran the cleanest A/B test of your career. Traffic split was 50/50, the hash function looked fine, the metric pipeline did not lie, the holdout was preserved, and after three weeks the analysis converged on a clear winner: variant B improved task completion by four points, with a p-value the stats team had no objections to. You shipped it to 100%. Two weeks after the rollout, the topline metric you launched on had drifted back toward the baseline, and nobody could explain why.
Here is the part that took a while to see. Both variants were writing to and reading from the same long-term memory store. Users in variant A wrote a memory like "this customer prefers blunt summaries" and the next day, when the same user happened to be on variant B, the variant B agent loaded that memory and read it into its prompt. The reverse happened in the other direction. The experiment was not comparing prompt A against prompt B. It was comparing "prompt A reading prompt-B-shaped memories" against "prompt B reading prompt-A-shaped memories." The result was an average over a contaminated joint distribution, and the launch was a regression to a different point on the same surface.
