Skip to main content

The Agent A/B Test Whose Variants Quietly Shared Long-Term Memory

· 11 min read
Tian Pan
Software Engineer

You ran the cleanest A/B test of your career. Traffic split was 50/50, the hash function looked fine, the metric pipeline did not lie, the holdout was preserved, and after three weeks the analysis converged on a clear winner: variant B improved task completion by four points, with a p-value the stats team had no objections to. You shipped it to 100%. Two weeks after the rollout, the topline metric you launched on had drifted back toward the baseline, and nobody could explain why.

Here is the part that took a while to see. Both variants were writing to and reading from the same long-term memory store. Users in variant A wrote a memory like "this customer prefers blunt summaries" and the next day, when the same user happened to be on variant B, the variant B agent loaded that memory and read it into its prompt. The reverse happened in the other direction. The experiment was not comparing prompt A against prompt B. It was comparing "prompt A reading prompt-B-shaped memories" against "prompt B reading prompt-A-shaped memories." The result was an average over a contaminated joint distribution, and the launch was a regression to a different point on the same surface.

This failure mode is not exotic. Every team that bolted long-term memory onto an agent product after the experimentation framework was already in place is at risk of it, and most of those teams have not yet looked. The experimentation framework was designed for a world where the unit of independence was the user, the partition primitive was a user-id hash, and the only state a variant could write was its own log line. Agent memory broke all three assumptions silently.

The partition primitive no longer matches the unit of independence

Classical A/B testing rests on the Stable Unit Treatment Value Assumption — SUTVA — which says one user's outcome is not influenced by the treatment assignment of any other user. The platform team that built your experimentation system encoded this assumption into the partition primitive: hash the user id, route to variant, log the outcome, compare. The math works because the user is the unit of independence and the partition is clean.

Add long-term agent memory and the unit of independence is no longer the user. It is the closure over every persistent surface the agent reads from or writes to during the variant call. That closure includes the vector store of past conversations, the per-user profile summary the agent maintains, the semantic cache that returns prior answers verbatim, the per-user RAG index that gets updated when the user uploads a document, the workflow memory that records "this agent tried X and it failed." None of those surfaces were on the platform team's mental model when they wrote the partition primitive.

Worse, the contamination is bidirectional and asymmetric. Variant A writes memories shaped like variant A's prompt. Variant B then reads those memories through variant B's prompt template. The bias is not just leakage; it is a covariance between treatment and the input distribution the other treatment sees. Statistical literature on recommender systems has a name for this — symbiosis bias — and the result holds here too: each variant's measured performance is partially a function of training data generated by its competitor.

The eBay finding is sobering. Their team measured 100% overestimation of treatment effects when comparing two-sided marketplace algorithms that competed for the same buyers. Airbnb measured 32.6% inflation on guest-fee experiments. Those numbers come from network interference, not memory interference, but the structural mechanism is the same: shared state coupling variants that the experimentation framework treated as independent. Agent memory is just a faster, denser version of the same coupling, because every variant call both reads and writes the shared state in the same turn.

The leak surfaces nobody declared

The reason this is hard to detect is that no single person on the team holds the full picture of what the variant reads and writes. The prompt engineer knows the prompt. The memory engineer knows the memory store. The infra engineer knows the experimentation framework. The contamination lives in the seam between them, and the seam has no owner.

Common leak surfaces include:

  • Conversational memory stores: vector indices populated by past turns, often namespaced by user id but not by variant. Variant A writes an embedding; variant B retrieves it on the next turn.
  • User-profile summaries: the periodically updated "what we know about this user" sketch the agent loads as system context. A variant whose prompt encourages aggressive personalization will update this sketch with stronger claims than a control variant, and the control variant inherits those claims next session.
  • Semantic answer caches: lookup tables that return a prior answer when the new query is semantically close enough. The variant that generated the cached answer is not encoded in the cache key, so any future variant can hit it.
  • Per-user RAG indices: documents the user uploaded, summaries the agent computed over them, derived metadata fields. If the variant influenced what gets indexed or how the summary is written, that influence outlives the variant.
  • Workflow state and skills: "the agent learned the user's repo layout" type memory. A variant that explores aggressively populates this faster, and a control variant later reads from a richer state than it would have built on its own.
  • Feedback signals folded back into memory: thumbs-up and thumbs-down events stored alongside the response, used by future retrieval to re-rank. A variant that elicited more positive feedback now biases retrieval for the other variant too.

The pattern across all of these is that the memory layer was designed before the experimentation requirement, with isolation primitives keyed on user or tenant rather than on the experiment cell. The framework grades on "did the user see variant A or variant B?" The memory store grades on "what is this user's id?" The two primitives do not compose, and nobody specified that they had to.

Why the rollout result diverges from the experiment result

When the experiment runs at 50/50, the contamination is roughly symmetric — variant A reads memories tainted by variant B at about the same rate that variant B reads memories tainted by variant A. The measured delta still reflects a real difference between the variants, just smaller and noisier than it would be in a clean experiment. Teams that have lived through this describe the experiment phase as deceptively well-behaved.

The phase that goes wrong is the rollout. As variant B ramps from 50% to 100%, the memory store starts to contain only variant-B-shaped memories. The variant B agent is now reading from a memory distribution it never saw during the experiment. The behavior shifts, the metric shifts, and the post-launch numbers do not match the experiment numbers. The team's first instinct is to suspect the analysis, the holdout, or seasonality — anything but the memory layer, because the memory layer never came up at experiment design time.

In some products the shift is favorable: variant B reads memories more aligned with its prompt and gets stronger. In others it is unfavorable: variant B was secretly benefiting from the variant-A-shaped memories that taught it how to recover from its own brittleness, and without them it degrades. The direction is product-specific. The lesson is not. A measured delta on contaminated state cannot be extrapolated to the clean-state distribution that the post-rollout product actually runs on.

Partitioning memory by experiment cell, not just by user

The closest analog from outside the AI world is the recommender system A/B test, where the platform team eventually had to admit that the data pipeline coupled the variants and the standard partition primitive was no longer sufficient. The fix that worked there — physically separate state per variant — is the right starting point here too.

Practical patterns that close the gap:

  • Experiment-scoped memory namespaces. Compose the memory key as (user_id, experiment_id, variant_id) rather than just user_id. Variant A writes to one logical namespace; variant B writes to another; neither reads from the other. The cost is paid in two places: the per-user state is now smaller and noisier per variant (less history to draw on), and the post-experiment migration has to decide which variant's memories survive when the winner is rolled out.
  • A contamination-detection pass at evaluation time. For every memory read during a variant call, log the variant that wrote the memory. At analysis time, compute the fraction of reads that crossed the variant boundary. If that fraction is non-trivial, the variant boundary did not hold and the measured delta is contaminated. Treat the cross-variant read rate as a first-class experiment health metric, the same way SRMs are treated in classical A/B testing.
  • State-declaration at experiment design time. Make the experiment specification include a "what state does this variant read or mutate" field, enumerating every persistent store. The framework can then verify that the partition primitive covers every declared store. This sounds bureaucratic and is — but the alternative is the team discovering the leak after the rollout has moved the topline metric.
  • A holdout-purity metric. Keep a small holdout that ran on the production variant during the experiment with no exposure to the other variant's memory. After rollout, compare the rolled-out variant's behavior against the holdout's behavior on the same input distribution. Divergence between in-experiment and post-launch behavior is the signal that the partition leaked. This is the metric that catches the failure mode in retrospect even when the experiment design missed it in advance.
  • Pre-flight synthetic contamination tests. Build a small offline harness that deliberately seeds variant A's memory store with variant-B-shaped memories and vice versa, then runs both variants and measures the behavioral delta. If a small amount of cross-variant memory moves the metric noticeably, the experiment design needs stronger isolation before going live. Treat this as part of the experiment's launch readiness review, not as a research artifact.

The eval discipline matters here as much as the production discipline. An eval suite that grades whether your framework detects synthetic leaks is the only way to keep the contamination story honest as the memory layer evolves. Without it, every new memory surface — a new cache, a new derived index, a new feedback channel — is a new leak surface that the experimentation team will not know about until a launch goes sideways.

The architectural realization

The right way to think about an agent A/B test is that the unit of statistical independence is not the user-id. It is the closure over every persistent surface the agent reads or writes during the variant call. The user is one component of that closure; the memory layer is another; the cache is another; the derived indices are another. The framework that names only the user has shipped a primitive that does not match the system.

This reframing changes the conversation at experiment design time. Instead of "did we split traffic correctly?" the question is "did we partition every piece of state the variant touches?" That question gets answered jointly by the platform team and the agent team, which is the seam where the failure used to live. Naming the closure forces the conversation that nobody was having.

There is a deeper point underneath all of this. The measured delta in any experiment is a function of the input distribution the variants saw. When part of that input distribution is generated by the variants themselves through persistent state, the experiment is measuring a fixed point of a coupled system, not a comparison of two independent treatments. The fixed point can be informative — it tells you something about the joint behavior — but it is not the quantity the team thought it was measuring, and it is not the quantity the post-rollout product will exhibit.

Teams that do not name that closure ship frameworks that measure noise as signal at exactly the moments the team most wants signal. The fix is not glamorous. It is naming the state, partitioning it at the experiment cell, monitoring the partition, and treating leaks as a first-class experiment health failure. Until the experimentation framework knows about the memory layer, every agent A/B test is one rollout away from teaching the team that its winning variant was never actually winning on its own.

References:Let's stay in touch and Follow me for more thoughts and updates