The Thumbs-Up Button That Poisoned Your Eval Set Through the Back Door
A thumbs-up button is the cheapest signal you will ever instrument. It is also one of the most dangerous, because nothing about it announces that it is reshaping the distribution your eval set is supposed to represent. The button is collected as a positive — the curation pipeline reads it as quality — and six months later the eval is dominated by examples chosen by a cohort that does not include the customers most likely to churn.
The failure rarely shows up as a regression. It shows up as a divergence: weekly eval trends up, the enterprise tier's NPS slides, and the team only diagnoses the gap when a churned account names the specific kind of question their team kept getting wrong. The eval set has no examples shaped like it. The signal you were optimizing was real. It was just measuring the wrong distribution.
