Skip to main content

10 posts tagged with "ab-testing"

View all tags

The Embedding Model Rotation That Shadowed Your A/B Test for a Quarter

· 10 min read
Tian Pan
Software Engineer

You ran the experiment cleanly. Two arms, one feature flag, a clear metric, the stats team blessed the design. Twelve weeks later you ship the winner, and the lift quietly evaporates within a sprint. The post-mortem turns up nothing in the code, nothing in the flag rollout, nothing on the analytics side. The thing that moved was something nobody on your experimentation list owned: the hosted embedding model behind your retrieval call returned a slightly different vector for the same query in week three, in week seven, and again on the morning your readout meeting happened. Your A/B test was real. The substrate it ran on was not.

This is the failure mode every team running retrieval-augmented generation eventually walks into and the one almost nobody designs against. The embedding endpoint is treated as a stable substrate the way Postgres is treated as a stable substrate. It is not. It is a model with a release cadence the vendor controls, a changelog you do not read, and a behavior surface that can shift without changing the dimension count, the SLA, or the API contract you signed against. The experiment you thought was measuring a feature change was measuring a retrieval regime change with the feature flag noise on top.

The Feature Flag Your Model Already Learned to Predict From the Inputs It Could See

· 10 min read
Tian Pan
Software Engineer

The treatment arm shipped because the dashboard said "+4% conversion, p < 0.01, n = 2.3M." Six weeks after the global rollout the lift was gone, and the team filed the post-mortem under "scale effects" because nothing else fit. The actual cause was sitting in the prompt assembler the whole time: the routing hash that decided arm assignment was derived from a user-tier attribute, and the same attribute was being interpolated into the prompt template three lines later. The model was reading the assignment in band. The "treatment" wasn't the prompt change. The treatment was the population the prompt change happened to attract.

This is a failure mode that doesn't exist in the experimentation playbooks teams inherit from the web era. A button color does not read the user's tier and decide to behave differently. A prompt does. Once your treatment is a string that the model interprets, every input that touches the routing decision and also touches the prompt becomes a back channel the experiment cannot close.

The Agent A/B Test Whose Variants Quietly Shared Long-Term Memory

· 11 min read
Tian Pan
Software Engineer

You ran the cleanest A/B test of your career. Traffic split was 50/50, the hash function looked fine, the metric pipeline did not lie, the holdout was preserved, and after three weeks the analysis converged on a clear winner: variant B improved task completion by four points, with a p-value the stats team had no objections to. You shipped it to 100%. Two weeks after the rollout, the topline metric you launched on had drifted back toward the baseline, and nobody could explain why.

Here is the part that took a while to see. Both variants were writing to and reading from the same long-term memory store. Users in variant A wrote a memory like "this customer prefers blunt summaries" and the next day, when the same user happened to be on variant B, the variant B agent loaded that memory and read it into its prompt. The reverse happened in the other direction. The experiment was not comparing prompt A against prompt B. It was comparing "prompt A reading prompt-B-shaped memories" against "prompt B reading prompt-A-shaped memories." The result was an average over a contaminated joint distribution, and the launch was a regression to a different point on the same surface.

When Two AI Features Compete for the Same Click

· 9 min read
Tian Pan
Software Engineer

A user lands on a search results page. Team A's smart summary fires in the top banner: "Here's the gist — skip the list." Team B's inline assistant pulses on the side: "Stay here, I'll keep reading with you." Both prompts compete for the same 800ms of attention, and the user — annoyed — closes the tab. The next morning, Team A reports a 6% lift in summary clicks; Team B reports a 4% lift in assistant opens; nobody in the room is wrong, and the product is worse than it was a quarter ago.

This is the failure mode that the standard playbook of independent feature teams and per-feature A/B tests cannot see. Each team locally optimized against its own metric. The user — who only has one attention budget, one mental model, and one click to give — paid the bill for the integration both teams declined to do.

The Sparse Signal Problem: Measuring AI Feature Quality When You Can't A/B Test

· 11 min read
Tian Pan
Software Engineer

You've shipped an AI writing assistant to your enterprise customers. Twenty-three people use it every day. Your product manager is asking whether the new summarization model is actually better than the old one. You have two weeks before the next sprint, and you need a decision.

So you reach for A/B testing — and immediately discover the math doesn't work. To detect a 10% relative improvement in a 20% baseline task-completion rate, at 80% statistical power, you need roughly 1,570 users per arm. At 23 daily users, you'd need 136 days to accumulate enough data. The feature will be deprecated before the test concludes.

This is the sparse signal problem. It isn't a B2B startup edge case. Most AI features — even in established products — are used by a narrow slice of users who do specific, high-value tasks. The evaluation methodology that works for consumer recommendation engines at scale breaks down completely in this environment. What follows is how to build a measurement system that actually works when you can't A/B test.

The AI A/B Test That Lied: Novelty, Carryover, and Anchoring Bias in LLM Experiments

· 10 min read
Tian Pan
Software Engineer

Your AI feature shipped with confidence. The A/B test showed a statistically significant 12% lift in user engagement. The confidence intervals didn't overlap. The sample size was right. The p-value was comfortably under 0.05. Six weeks later, the metric has flat-lined back to baseline. Three months in, it's actually below baseline. The experiment told you the feature worked. The experiment lied.

This isn't a bug in your statistical tooling. It's a fundamental mismatch between what standard A/B testing measures and what happens when humans interact with probabilistic AI systems over time. Three specific biases — novelty inflation, anchoring, and carryover — conspire to inflate every AI feature experiment, and the standard remedy of adding a holdout group doesn't fix any of them.

Variance Eats the Experiment: Why A/B Power Math Breaks for LLM Features

· 11 min read
Tian Pan
Software Engineer

The model team can demo the new feature and show ten convincing wins side by side. The growth team runs it as a two-week A/B test, gets p = 0.31, and the readout says "no significant effect." Both teams are right. The experiment is wrong.

This pattern repeats across every org that has bolted an LLM onto a product without rebuilding its experimentation stack. The math the growth team is using was designed for button colors, ranking changes, and pricing pages — features whose outputs are deterministic given a user and a context. LLM features break the two assumptions that math leans on, and the standard 80%-power, 5%-significance, two-week-ramp template ships systematically wrong calls in both directions: real wins read as null results, and noise reads as confident wins.

When Your Evals Disagree: A Signal Hierarchy for the Week the Numbers Contradict Each Other

· 12 min read
Tian Pan
Software Engineer

It's Tuesday morning, the week after a prompt change shipped to half your traffic. You open four dashboards. The held-out golden set scored by the LLM judge says +8%. The human-rater panel that samples production weekly says no change. The A/B test on downstream conversion says −2%. The thumbs-up rate is flat. Four signals, four verdicts, and a standup in fifteen minutes where someone is going to ask whether you ship the prompt or roll it back.

The temptation is to pick the number that confirms what you already wanted to do — and the team will, because nobody on the call has a written rule for which signal wins. The disagreement isn't a measurement bug. It's the predictable output of a system that bolted four evaluators together without a hierarchy, and the cost of not having one is that every release week becomes a debate about whose number to trust.

A/B Testing Non-Deterministic AI Features: Why Your Experimentation Framework Assumes the Wrong Null Hypothesis

· 10 min read
Tian Pan
Software Engineer

Your A/B testing framework was built for a world where the same input produces the same output. Change a button color, measure click-through rate, compute a p-value. The variance comes from user behavior, not from the feature itself. But when you ship an AI feature — a chatbot, a summarizer, a code assistant — the treatment arm has its own built-in randomness. Run the same prompt twice, get two different answers. Your experimentation infrastructure was never designed for this, and the consequences are worse than you think.

Most teams discover the problem the hard way: experiments that never reach significance, or worse, experiments that reach significance on noise. The standard A/B testing playbook doesn't just underperform with non-deterministic features — it actively misleads.