Skip to main content

The Demo Was a Single Seed: Why Your AI Rollout Is a Variance Problem, Not a Polish Problem

· 11 min read
Tian Pan
Software Engineer

The exec demo went perfectly. The model answered the curated question, the agent completed the workflow, the screen recording is saved on the company drive, and the launch date is now in the calendar. Six weeks later the rollout craters and the post-mortem narrative writes itself: the model needed more polish, the prompt needed more iteration, the team underestimated the work between prototype and production.

That narrative is wrong, and it's expensive, because it sends the team back to do more of the work that already failed. The demo wasn't an under-polished version of production. It was a single sample from a distribution the team never measured. The wow moment was one realization out of thousands the model would generate against the same input, and the team shipped the best one as if it were the typical one. The gap between demo and prod isn't quality drift. It's variance the team hadn't yet seen.

This reframing matters because the fix for a variance problem looks nothing like the fix for a polish problem. Polish says "iterate the prompt, tune the model, hire a better PM." Variance says "you don't know what you have until you sample it n times across the input distribution." The two diagnoses produce different roadmaps, different budgets, and different incident patterns. The teams that ship reliably in 2026 know which problem they have.

What "single seed" actually means

Modern LLM serving is non-deterministic even at temperature zero. Floating-point non-associativity, GPU parallelism, and tensor-sharding routing mean that the same prompt run twice on the same model can produce different tokens. Set temperature=0, seed=42, and top_p=1 and you'll still see variance — providers document it, vLLM has open issues about it, and recent stability analyses report meaningful day-to-day metric drift on the same eval set.

A 2026 study on agent behavioral consistency found a coefficient of variation of 15% on the most consistent frontier model and 47% on a popular open-weights model — meaning the same task, run repeatedly, produces success rates that swing by a third to a half around their mean. Scale AI's reliability work reports that identical evals can move 10–15% day-over-day on the same model, before any version change.

This is the math the demo silently ignores. When the founder showed the agent completing a multi-step workflow on stage, the audience saw one trajectory. The model had thousands of equally probable trajectories for the same input, and the demo was the lucky one because the team picked the lucky one — or, more charitably, because they only ran it three times and the third worked. The pass@1 metric they're optimizing measures whether some run succeeds. The reliability metric production needs is whether every run succeeds, or at least whether the failure rate is bounded.

The result is that pass@1 of 80% routinely collapses to consistency-across-k-attempts of 25%. The agent that "works" answers the same question correctly one time in four. A user who sees three failures in a row before the fourth success churns before the success arrives, and the pass@1 metric never moved.

The two-axis input distribution your demo didn't sample

Variance has two independent axes and both are usually under-sampled at demo time.

The first is stochastic variance: same input, different runs. This is what the prior section described — the model rolls dice, and the demo only saw one roll. The fix is n-of-k sampling at demo time: run every demo input 8 to 32 times and report the failure rate, the worst output, and the disagreement among runs, alongside the headline "best run." If the worst-of-32 is unacceptable, you don't have a feature, you have a coin flip dressed as a product.

The second is input variance: different inputs, drawn from the production distribution rather than from the curated demo set. The demo input was chosen because it works. Production inputs are not chosen — they arrive. They contain typos, missing fields, ambiguous intent, conflicting constraints, and locale-specific structure the demo dataset under-represents. Recent surveys of production AI failures report a recurring pattern: 98% success on the typical case and 50% failure on the 5% of cases that diverge from the curated set. The mean is fine. The tail is catastrophic. And users do not visit the mean — they visit themselves, and the long-tail user is the one who churns loudly.

Both axes need to be sampled before launch. A demo that varies inputs but pins the seed measures input robustness without measuring stochasticity. A demo that varies the seed but pins the input measures stochasticity without measuring distribution coverage. Production hits both simultaneously. The pre-launch eval has to as well.

The variance-first discipline: three artifacts every launch needs

A team that has internalized "variance, not polish" produces three artifacts before any launch milestone is checked. None of them require new tooling — they require commitment to running the experiment before shipping the feature.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates