Skip to main content

The Thumbs-Up Button That Poisoned Your Eval Set Through the Back Door

· 11 min read
Tian Pan
Software Engineer

A thumbs-up button is the cheapest signal you will ever instrument. It is also one of the most dangerous, because nothing about it announces that it is reshaping the distribution your eval set is supposed to represent. The button is collected as a positive — the curation pipeline reads it as quality — and six months later the eval is dominated by examples chosen by a cohort that does not include the customers most likely to churn.

The failure rarely shows up as a regression. It shows up as a divergence: weekly eval trends up, the enterprise tier's NPS slides, and the team only diagnoses the gap when a churned account names the specific kind of question their team kept getting wrong. The eval set has no examples shaped like it. The signal you were optimizing was real. It was just measuring the wrong distribution.

This post is about implicit feedback as a sample-selection mechanism, the mechanics by which it leaks into eval sets, and what an eval-set governance policy has to look like to keep the cohort that pays you from being silently overwritten by the cohort that clicks.

The Quiet Mechanics of the Leak

The leak rarely happens in a single step. It happens through a pipeline that looked reasonable at every joint.

Start with the affordance. A thumbs-up button is binary, low-friction, and emitted disproportionately by users in a casual interaction — users who are not blocked, not deadlined, and not running the model against a hard task. In production chatbots, fewer than five percent of responses receive any user feedback at all, and that five percent is not a uniform random sample. It skews toward short interactions, friendly tones, and users who treat the button as a way to be polite rather than as a measurement instrument.

Now layer the curation pipeline. The data team samples the highest-rated responses as a training signal — defensible — and as a candidate pool for new eval examples — also defensible in isolation. The curation rubric screens for clarity, topical coverage, language, and toxicity. It does not screen for selection bias, because selection bias is not a property of an individual example. It is a property of how the example was admitted to the pool. By the time a single row reaches eval, the chain of custody that would let you reconstruct its provenance has been compacted into "candidate from feedback pool, reviewer approved."

Finally, layer the eval team. The eval team accepts thumbs-up-derived examples on the assumption that "users liked the answer" is a usable proxy for correctness. It is a proxy for satisfaction in the moment of clicking, which is a different quantity. A user can like a confidently wrong answer to a question they could not verify. A user can withhold a thumbs-up from a correct answer that took too long to render. The button measures the meeting point of expectation and rendering — not the meeting point of question and ground truth.

Six months in, the eval is dominated by prompts whose population skews toward casual queries with short, friendly answers that score high on engagement and low on the technical accuracy that determines whether your hardest customers renew. Every step in the pipeline did what it was designed to do. The system shipped an eval drift anyway.

Why "Users Gave Us This Signal" Is Not "This Signal Represents Our Users"

There is a sentence that gets said in product reviews when the eval graph and the NPS graph diverge: "but users gave us this signal." It is the moment where the team that owns the model and the team that owns the customer relationship discover they are running on different definitions of the word "user."

The eval set's user is a sample-weighted average of who clicked the button. The NPS panel's user is a survey-weighted average of who replied to the survey, with explicit cohort breakdowns. The churn analysis's user is a cohort-weighted average of who canceled, with the cohorts that canceled most overrepresented relative to traffic. These three distributions are not the same. They are not even close. And the eval set's distribution is the most invisibly distorted of the three, because the sampling mechanism — a button — does not produce a frame the way a survey or a churn list does.

This is survivorship bias dressed in product analytics. The decisions are made on the data of the users who survived long enough to engage; the data of the users who did not is not in the pool at all. You will overfit your eval to a population that, by construction, excludes the population that matters most for the business. The leadership argument that "users gave us this signal" is not the same statement as "this signal represents our users," and the team that conflates them is shipping eval improvements that read like product wins against a set increasingly tuned to the cohort least likely to leave.

The hardest part of catching this in flight is that the local optimization is real. The model genuinely is getting better at the prompts in the eval set. The eval set is genuinely accumulating examples that more than one reviewer agreed were quality. The pipeline is functioning. It is the population the pipeline is sampling from that has drifted underneath everyone's feet.

Patterns That Close the Gap

The corrective is not to throw out implicit feedback. Implicit feedback is the largest and cheapest source of signal you have, and refusing to use it because it is biased is a luxury that products with budget for hand-labeled eval sets can afford. The corrective is to treat the bias as a measurable property and to govern around it.

Cohort coverage as an acceptance criterion. Define the cohorts your product needs to be good for — enterprise admins, power users, first-week users, users running adversarial prompts, users on the surfaces with the highest revenue exposure — and require every eval set to publish its coverage of those cohorts as a first-class statistic. New examples are accepted on whether they improve coverage on under-represented cohorts, not only on whether a reviewer scored them high. An example that arrives via the feedback pool and increases coverage on a cohort that needs it is admissible; an example that arrives via the feedback pool and worsens the cohort balance is rejected even if it is a beautiful prompt. The eval set is a sample of the world you want to measure, and the acceptance gate is the place to enforce that the sample looks like the world.

Strict separation between training signal and eval inclusion. Thumbs-up is a fine signal to feed into reward modeling for fine-tuning. It is a dangerous signal to admit, untransformed, into the eval set, because the eval set is the instrument you use to check whether the training signal made the model actually better. If both sides of the loop are sampled from the same biased population, you have built a closed system that ratifies its own preferences. The fix is a chain of custody on every eval row that names the human reviewer who admitted it, the source pool it came from, and a tag that lets you slice the eval by source. The moment the eval can be sliced by "rows from feedback pool" vs. "rows from cohort coverage targets," the conversation about whether the eval still represents the customers shifts from rhetorical to checkable.

Adversarial cohorts sampled from churn interviews and high-severity support tickets. The hardest examples your product faces are not coming from the users who hit thumbs-up. They are coming from the users who opened a P0 ticket and the accounts that canceled. Make a habit of converting the prompts from those interactions into eval rows, with the cohort label attached, and run them as a separate slice on every model change. This is the slice that catches the divergence early, because it samples the population whose feedback never reaches the button.

A feedback-source label on every eval row. Source provenance is the single highest-leverage piece of metadata you can attach to an eval. It lets you ask: of the rows accepted in the last quarter, how many came from feedback pool, how many from explicit reviewer-curated additions, how many from churn? You can monitor that distribution and alert when it drifts. You can decompose the eval score by source and discover whether the trend is moving on the feedback-sourced slice (suspicious — likely overfit to engaged users) or on the adversarial slice (the slice you actually care about).

Eval as a continuously-governed artifact, not a one-time corpus. The eval set is not built once and locked. It is a living artifact that drifts with the product's traffic, the model's behavior, and the cohorts the business depends on. Continuous eval governance means owning a quarterly review of the cohort coverage, a quarterly recalibration of the source mix, and a quarterly retirement of rows that no longer represent live traffic. Without that discipline the eval becomes a museum of last year's customers, and the model's "improvements" are measured against a population that has moved on.

What the Dashboards Should Say That They Probably Don't

A useful diagnostic, before any of this becomes policy, is to ask whether the dashboards your team uses to talk about eval quality answer the questions cohort drift would force them to answer.

Does the weekly eval dashboard publish a cohort breakdown alongside the headline number, or is the headline the only thing the leadership review sees? If only the headline is shown, the cohort drift will move underneath it for at least the length of the lag between eval movement and NPS movement, which is often a full quarter for enterprise products.

Does the eval set's row inventory report the source pool each row was admitted from, and the date it was admitted? If it does not, you cannot reconstruct how the eval drifted. You can only describe its current state.

Does the model-change review process require an adversarial-slice score in addition to a headline score? If not, a change that moves the headline up two points while leaving the adversarial slice flat is indistinguishable from a change that moves both. The two changes have very different product consequences.

Does the team that owns the eval set have any contact with the team that runs churn interviews? In many organizations these two teams have no shared artifacts, no shared review cycle, and no shared metrics. The churn interviews produce qualitative artifacts that are not in any eval format. The eval team produces quantitative artifacts that have no link back to the qualitative reasons customers left. Bridging that gap is one of the highest-leverage moves a product organization can make, because it is the bridge across which the hardest examples flow into the eval set.

The Architectural Realization

The thumbs-up button is a measurement device whose statistical properties were never specified, attached to a system whose evaluation depends on those properties being something they are not. Every implicit feedback channel has this property to some degree, and the answer is not to remove the channels. It is to acknowledge that the act of collecting feedback is itself a sampling mechanism with its own bias profile, and that the eval set is the artifact most exposed to those biases because it is closest to the loop.

A practitioner-level takeaway: the eval set is a sample, and every sample has a frame. The frame for an implicit-feedback-derived eval is "users who clicked a button after a model response," which is a frame defined by engagement, not by representativeness. The frame for a churn-derived eval is "users who left," which is closer to the population the business cares about but skewed in the opposite direction. The frame for a balanced eval is the one you build deliberately, with explicit cohort targets, source-mix discipline, and a review cadence that treats the frame as the artifact under management.

The team that ships the next round of eval improvements without owning the frame is shipping numbers that move while the customers who matter most quietly leave. The team that owns the frame can argue for a model change with a confidence that maps to product outcomes, because the eval it ran the change against is a measurement device whose biases are catalogued, monitored, and explicitly compensated for. That is the difference between an eval that ratifies the cohort that clicks and an eval that represents the cohort that pays.

References:Let's stay in touch and Follow me for more thoughts and updates