Skip to main content

17 posts tagged with "experimentation"

View all tags

The A/B Test Winner Whose Verbose Output Triggered Your Click Handler More Than the Better Answer

· 10 min read
Tian Pan
Software Engineer

A prompt-variant experiment runs on the production traffic of an AI-assisted search product. The success metric is a click on any suggested action in the response. Variant B ships responses that are roughly forty percent longer with more enumerated options. The click-through rate is eleven percent higher with three nines of statistical significance. The experiment is declared a winner and shipped.

A month later, the weekly customer satisfaction survey drops two points. Nobody connects it to the launch because the experiment has already been written up as a success and the team has moved on. A quarterly review eventually traces the satisfaction drop back to the prompt change, and the diagnosis lands hard: variant B won not because it gave users better answers but because longer answers contained more clickable surfaces. The click handler fired more often per impression because there was more to click, not because what the user read was more worth acting on.

The Canary Cohort Your Rollout Hashed by ID That Clustered Power Users Into One Arm

· 10 min read
Tian Pan
Software Engineer

A rollout team ships a new model behind a percentage flag. The flag bucket is computed as hash(user_id) % 100, the canary is buckets 0–4, the lift on per-user engagement is large and stable for two weeks, and the team ramps to 20%, then 50%, then global. The lift evaporates somewhere between 50% and global, and the post-mortem traces it back to the canary cohort. The treatment didn't move the metric. The canary arm was a different population.

The team thought it had been sampling users. It had been sampling IDs.

The Model Rollout Flag That Bucketed by Session and Drifted Your A/B Cohort

· 11 min read
Tian Pan
Software Engineer

The post-mortem opened with a sentence everyone in the room wanted to be true: the new model won by 4 percent on satisfaction, p less than 0.01, ship it. A month later a colder analysis found that the lift was a confound, the model was actually flat or slightly worse, and the team had spent the intervening weeks debating which prompt change had "caused" the win. Nothing about the model had caused anything. The experiment had been measuring the wrong thing because the flag service and the analysis pipeline disagreed, silently, about what a cohort was.

This is one of the most expensive failure modes in A/B testing because nothing in the system is broken. The flag service works. The experiment tracker works. The dashboard renders. The statistics are computed correctly on the data they receive. The failure lives in the seam between three components that each carry a different assumption about identity, and the seam is invisible until you go looking for it.

The Persona Your System Prompt Offered That the Model Picked the Same Way Every Time

· 10 min read
Tian Pan
Software Engineer

A product team I talked to recently ran a three-arm A/B test on response personas — concise, thorough, conversational — for three weeks across every cohort. The system prompt described all three and asked the model to pick the one that best matched the user. When they opened the dataset to write the readout, one number stopped them cold: the "thorough" arm had 91% of the traffic. The other two were rounding error.

Their experiment platform had not flagged anything. No alert fired. The pipeline did exactly what they had told it to do. Three weeks of supposed multi-persona testing had produced a dataset that could only tell them about thorough. The other two arms were too thin to power any inference at all.

The instinct in the room was that the prompt needed work — better instructions, sharper distinctions between personas, a more deliberate example for the conversational case. That diagnosis would have been right ten years ago in a rules-driven router. It is wrong for a model. The prompt was not the variable. The router was.

The A/B Test Powered by Token Counts Instead of Outcomes

· 13 min read
Tian Pan
Software Engineer

A team I worked with shipped a prompt change that reduced output tokens by 22%. The experiment dashboard lit up green — variance was tight, the p-value was clean, and the cost savings extrapolated to six figures a year. Two weeks later, a product analyst poking at conversion funnels flagged that the downstream task completion rate had dropped 11% in the same window. The shorter outputs were leaving out a clarifying step that users had been quietly relying on to know what to click next.

The experiment platform had not lied. It had reported the exact metric the team configured as primary, and that metric had moved in the right direction. The problem was that the metric measured something the team did not actually care about. Tokens were cheap to count, the experiment infra had a turnkey integration for them, and outcomes were hard to instrument — so the team picked what the platform made easy. The result was a clean win on the dashboard and a regression in the product.

The Embedding Model Rotation That Shadowed Your A/B Test for a Quarter

· 10 min read
Tian Pan
Software Engineer

You ran the experiment cleanly. Two arms, one feature flag, a clear metric, the stats team blessed the design. Twelve weeks later you ship the winner, and the lift quietly evaporates within a sprint. The post-mortem turns up nothing in the code, nothing in the flag rollout, nothing on the analytics side. The thing that moved was something nobody on your experimentation list owned: the hosted embedding model behind your retrieval call returned a slightly different vector for the same query in week three, in week seven, and again on the morning your readout meeting happened. Your A/B test was real. The substrate it ran on was not.

This is the failure mode every team running retrieval-augmented generation eventually walks into and the one almost nobody designs against. The embedding endpoint is treated as a stable substrate the way Postgres is treated as a stable substrate. It is not. It is a model with a release cadence the vendor controls, a changelog you do not read, and a behavior surface that can shift without changing the dimension count, the SLA, or the API contract you signed against. The experiment you thought was measuring a feature change was measuring a retrieval regime change with the feature flag noise on top.

The Feature Flag Your Model Already Learned to Predict From the Inputs It Could See

· 10 min read
Tian Pan
Software Engineer

The treatment arm shipped because the dashboard said "+4% conversion, p < 0.01, n = 2.3M." Six weeks after the global rollout the lift was gone, and the team filed the post-mortem under "scale effects" because nothing else fit. The actual cause was sitting in the prompt assembler the whole time: the routing hash that decided arm assignment was derived from a user-tier attribute, and the same attribute was being interpolated into the prompt template three lines later. The model was reading the assignment in band. The "treatment" wasn't the prompt change. The treatment was the population the prompt change happened to attract.

This is a failure mode that doesn't exist in the experimentation playbooks teams inherit from the web era. A button color does not read the user's tier and decide to behave differently. A prompt does. Once your treatment is a string that the model interprets, every input that touches the routing decision and also touches the prompt becomes a back channel the experiment cannot close.

The Agent A/B Test Whose Variants Quietly Shared Long-Term Memory

· 11 min read
Tian Pan
Software Engineer

You ran the cleanest A/B test of your career. Traffic split was 50/50, the hash function looked fine, the metric pipeline did not lie, the holdout was preserved, and after three weeks the analysis converged on a clear winner: variant B improved task completion by four points, with a p-value the stats team had no objections to. You shipped it to 100%. Two weeks after the rollout, the topline metric you launched on had drifted back toward the baseline, and nobody could explain why.

Here is the part that took a while to see. Both variants were writing to and reading from the same long-term memory store. Users in variant A wrote a memory like "this customer prefers blunt summaries" and the next day, when the same user happened to be on variant B, the variant B agent loaded that memory and read it into its prompt. The reverse happened in the other direction. The experiment was not comparing prompt A against prompt B. It was comparing "prompt A reading prompt-B-shaped memories" against "prompt B reading prompt-A-shaped memories." The result was an average over a contaminated joint distribution, and the launch was a regression to a different point on the same surface.

The AI A/B Test That Lied: Novelty, Carryover, and Anchoring Bias in LLM Experiments

· 10 min read
Tian Pan
Software Engineer

Your AI feature shipped with confidence. The A/B test showed a statistically significant 12% lift in user engagement. The confidence intervals didn't overlap. The sample size was right. The p-value was comfortably under 0.05. Six weeks later, the metric has flat-lined back to baseline. Three months in, it's actually below baseline. The experiment told you the feature worked. The experiment lied.

This isn't a bug in your statistical tooling. It's a fundamental mismatch between what standard A/B testing measures and what happens when humans interact with probabilistic AI systems over time. Three specific biases — novelty inflation, anchoring, and carryover — conspire to inflate every AI feature experiment, and the standard remedy of adding a holdout group doesn't fix any of them.

Why AI Features Break A/B Testing (and the Causal Inference Methods That Don't Lie)

· 11 min read
Tian Pan
Software Engineer

You ship an AI-powered feature, run a clean two-week A/B test, see a 4% lift in engagement, and call it a win. Six months later, the feature is fully rolled out and engagement is flat or declining. The test wasn't noisy — it was measuring the wrong thing entirely.

![](https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=Why%20AI%20Features%20Break%20A%2FB%20Testing%20(and%20the%20Causal%20Inference%20Methods%20That%20Don't%20Lie%29)

A/B tests were built for a world where users in a treatment group and users in a control group are statistically independent. AI features routinely violate that assumption. Users talk to each other, learn from each other's behavior, and share the outputs of AI tools. Treatment effects don't stabilize in two weeks when the real mechanism is long-horizon behavioral adaptation. When you ignore this, your experiment gives you a number that's internally consistent but causally meaningless.

The A/B Testing Trap: Why Standard Experiment Design Fails for AI Features

· 8 min read
Tian Pan
Software Engineer

A team ships an improved LLM prompt. The A/B test runs for two weeks. The metric ticks up 1.2%, p=0.03. They call it a win and roll it out to everyone. Six months later, a customer audit reveals the new prompt had been producing subtly incorrect summaries all along — the kind of semantic drift that click-through rates and session lengths can't see. The A/B test didn't lie exactly. It measured the wrong thing with a methodology that was never designed for what LLMs do.

Standard A/B testing was built for deterministic systems: a button changes color, a page loads faster, a recommendation algorithm shifts a ranking. The output is stable given the same input, variance is small and well-understood, and your sample size calculation from a textbook works. None of those properties hold for LLM-powered features. When teams don't account for this, they're not running experiments — they're generating noise with statistical significance attached.