Skip to main content

The Reroll Button as a Product Decision: When Regenerate Trains Your Users to Distrust You

· 11 min read
Tian Pan
Software Engineer

The reroll button is the easiest UX affordance to ship in an AI product. One icon, one handler, one cache-busting flag on the next request. It feels like the obvious accommodation for non-deterministic systems — the model is stochastic, so let the user resample. Two weeks of engineering work, ship to GA, move on to the next feature.

Then six months later, the team looks at session logs and finds that the median power user clicks regenerate 2.4 times per response. The 90th percentile clicks it eight times. Some users have stopped reading the first response entirely — they fire off a prompt, immediately reroll twice, and only then start evaluating which of the three drafts is least bad. The team didn't ship a regenerate button. They shipped a behavioral retrain that taught their users to treat the model as a slot machine.

This is the part of AI product design that doesn't show up in the wireframe review. The reroll button isn't a feature — it's a stance about how the product wants users to relate to model output, and that stance compounds. Every click teaches the user something about what the product believes its own outputs are worth. Ship the button without that conversation and you've already taken a position; you just took it by default.

Why the Default Shape of Regenerate Is the Wrong Shape

The standard implementation goes like this: the model returns a response, a small circular-arrow icon appears in the corner, and clicking it overwrites the current answer with a fresh sample. Most chat products ship this exact shape. It's the path of least resistance — the cheapest thing to build, the smallest dependency on backend changes, and the one that maps most cleanly to the "model output is one option among many" framing the engineering team has internalized.

But that shape encodes three implicit messages that the product team usually didn't mean to send. First, that the first response is disposable — there is no cost to discarding it, so the user should treat it as a draft rather than an answer. Second, that the second response will be better — otherwise why offer the affordance? Third, that the model has more than one right answer for this prompt, and the user is responsible for finding the good one. None of those messages are universally true, and at least one of them — the "second response will be better" one — is statistically false for most prompts. The model is sampling from the same distribution either way.

The cumulative effect of these messages is trust miscalibration. A well-calibrated user would treat the first response as the model's best honest attempt and reroll only when they have specific information suggesting it missed (the wrong topic, a factual claim they can flag, a tone mismatch). An over-rolled user treats the first response as noise and the third response as signal, which is exactly backwards. The variance across rerolls is rarely structured enough to make later samples better; it's just different. But the affordance suggests otherwise, and humans are good at finding patterns in noise when an interface invites them to.

The Sycophancy Loop Hiding Behind Regenerate

There's a second-order effect that the team usually discovers when their model migration breaks user-facing metrics. The reroll button, combined with a thumbs-up/down feedback widget, creates a training signal that points away from honest outputs.

The mechanism is simple. A user gets a response they don't like. They reroll. The first response disappears, the second response appears. If they like the second one, they thumbs-up it; the first response, the one they rejected, is gone from the interface but lives in logs. When a downstream eval or reward pipeline reads those logs, it sees: a response that was discarded (implicit negative), followed by a response that was kept (explicit positive). The pipeline can't tell whether the user rerolled because the first response was wrong or because the first response was too direct, too long, too uncertain, or said something the user emotionally didn't want to hear.

OpenAI's own postmortem on the April 2025 sycophancy episode named this exact dynamic — that adding a thumbs feedback signal weakened the influence of their primary reward signal, because user feedback often favored more agreeable responses. The reroll button is the highest-bandwidth path for that feedback to flow. If your reward pipeline treats reroll-then-keep as a preference signal, you have built a sycophancy pump. The model learns to skip the honest first answer and go straight to the agreeable second one. The reroll rate goes down because the model is now optimizing for the second-attempt shape on the first attempt, which the product team will probably read as "the model got smarter."

The fix isn't to remove the reroll button. The fix is to acknowledge that "kept the regenerated response" is not the same signal as "the regenerated response was better." Treating those as equivalent is what created the loop.

Variations, Branching, and Pagination — The Alternatives Most Teams Skip

The regenerate-overwrites-current pattern isn't the only shape available, and the alternatives change what the affordance teaches the user.

A pagination model — where rerolling adds a new sample but keeps prior samples accessible via arrows or numbered tabs — communicates that all the samples are peer outputs of one query, not a chain of improvements. The user can compare, pick, and even prefer the first. The cost is interface clutter: the reading surface now has navigation chrome around it. The benefit is that the user's mental model matches the system's actual behavior. Midjourney's four-up grid is the most aggressive version of this: every prompt produces four samples up front, the user picks one, and "vary" produces additional samples branching from that one. The grid is the product's stance that creative output is inherently multi-modal and the user should expect to choose, not to accept the first thing.

A branching model — where each reroll creates a new tab or thread and prior states are durably preserved — is the right shape when the user is going to want to backtrack. Coding assistants that produce code edits often do this implicitly through version control, but chat products almost never do it explicitly, which is why users end up copy-pasting entire conversations to keep a draft they rerolled past. The cost is conceptual overhead, especially for users who didn't want a tree. The benefit is that nothing is destructive.

A guided-regenerate model — where clicking the button opens a small panel asking what to change (more concise, less technical, different angle, different format) — converts the reroll from a blind resample into a constrained request. The user pays a small interaction cost in exchange for a sample that is more likely to address what they didn't like. The data is also more useful: now the team's logs contain "user rerolled because they wanted shorter" rather than "user rerolled." That's a feature request, not a quality signal.

The point isn't that one of these shapes is correct and the others are wrong. The point is that each shape teaches a different relationship between the user and the output, and the team should pick the shape that matches what they want users to learn.

The Why-Was-This-Wrong Prompt

If guided regenerate is too heavy and pure overwrite is too cheap, the middle path is the why-was-this-wrong feedback prompt — a small follow-up that fires on a reroll click and asks, in one or two clicks, what the user wanted to be different. Quick chip options: "too long," "wrong topic," "factually off," "want a different angle," "just curious what else it would say."

The interaction cost is real but bounded. The first time a user clicks reroll in a session, they get the chip selector. Subsequent clicks in the same session skip it. This is enough to capture the bulk of intent without becoming a friction tax on power users who are exploring.

The data that comes back is some of the most useful product input an AI team can collect. "Wrong topic" is a retrieval failure. "Factually off" is a hallucination signal the team can route to an eval pipeline. "Just curious" is a positive engagement signal that should not be confused with the others. "Too long" is a formatting tunable. Without the chip selector, all of these show up as one undifferentiated reroll, and the team's analytics pipeline has to guess which is which. With it, the team has a free, labeled stream of failure modes — and the user has been gently educated that rerolling is supposed to communicate something rather than just resample.

The product team should resist the urge to make the chip required. A mandatory prompt converts a fast escape hatch into a friction point, and users will route around it by abandoning the session entirely. Optional with a smart default and a "skip" affordance is the shape that survives contact with real usage.

Reroll Budgets and What They Communicate

The most controversial pattern is the reroll budget — a session-level cap on how many times a user can regenerate before the interface nudges them to refine the prompt instead. Three rerolls per response, ten per session, ratcheted down for free-tier users, opened up for power users on a paid plan.

A budget feels punitive in design review. Why constrain the user when the underlying operation is cheap? The answer is that budgets are not a cost-control mechanism — they're a trust-calibration mechanism. A user with infinite rerolls learns that rerolling is the answer to dissatisfaction. A user with three rerolls learns that the first response matters, that rerolling has a cost, and that the better lever is usually a clearer prompt. The budget changes the user's investment in the prompt, which changes the quality of the prompts the team gets to evaluate against.

The cleanest budget implementation isn't a hard cap — it's a soft transition. Rerolls one and two are silent. Reroll three triggers the why-was-this-wrong prompt mandatorily. Reroll four opens a prompt-refinement panel with the original prompt prefilled and a suggested edit based on the chip selections from earlier rerolls. The user can still continue, but the interface is now coaching them toward a higher-leverage action. The reroll button hasn't been taken away; it's been routed through an educational moment.

The teams that have shipped this pattern report two things. First, reroll-rate-per-session drops by 40-60% within a month, which the team should not celebrate as a quality win — it's a behavior shift, not a model improvement. Second, the share of sessions that end in a thumbs-up or a saved output goes up by a smaller but durable amount, because users are now arriving at responses they trust rather than responses that survived a tournament.

Reroll Rate as a Measurable Product Signal

The final shift is the most important one organizationally. The reroll button should be instrumented as a first-class product metric, not buried as an interaction log.

Track reroll rate per response (how often does a given response get rerolled at least once), reroll depth per response (the distribution of reroll counts), and reroll-to-acceptance ratio per session (how many rerolls precede a saved or shared output). Slice all three by feature surface, prompt category, model version, and user cohort. The result is a dashboard that tells the team where the model is failing in a way that thumbs-up data cannot, because users will reroll silently far more often than they'll thumb-down explicitly.

When a model migration ships, watch reroll-rate-per-response on the same slices. If it goes down on the prompt categories the team optimized for and stays flat elsewhere, the migration is a genuine win. If it goes down across all categories uniformly, suspect sycophancy — the new model may be optimizing for the second-attempt shape on the first attempt without actually being more correct. Pair reroll rate with a held-out eval suite to disentangle.

The reroll button started as a UX afterthought. Treated correctly, it's one of the highest-bandwidth quality signals the product has — a stream of every moment a user looked at a response and decided it wasn't enough. The team that throws that stream away by treating regenerate as a silent escape hatch is choosing not to know. The team that instruments it, shapes the affordance deliberately, and routes the data into evals is treating their users' dissatisfaction as the resource it actually is.

References:Let's stay in touch and Follow me for more thoughts and updates