Skip to main content

Training Data Self-Poisoning: When Your AI Feature Corrupts Its Own Ground Truth

· 10 min read
Tian Pan
Software Engineer

Your recommendation model launched three months ago. Click-through rates are up 18%. Watch time is climbing. The dashboard is green. Leadership is happy.

And your model is quietly destroying the data it will use to train its next version.

This is training data self-poisoning: a feedback loop where a deployed AI feature shifts user behavior in ways that corrupt the interaction data the model was originally trained to learn from. The worst part is that your standard engagement metrics will tell you everything is fine — right up until they don't.

The phenomenon has a name in physics: the Heisenberg problem. The act of measuring a system changes the system itself. In recommendation and personalization, your model isn't passively observing what users like — it's actively deciding what they see, which shapes what they click, which becomes the training signal for the next model. The measurement changes the thing being measured. You can never observe what users would have done without your model's intervention.

How the Loop Closes

The mechanism is straightforward once you see it. A model is trained on historical user interactions. It gets deployed and begins surfacing recommendations. Users interact with what they're shown — because they can only interact with what they're shown. That interaction data is collected and fed back into retraining. The next model learns from a dataset that reflects the first model's choices, not true user preferences.

Consider what this means concretely. Your model recommends "Taylor Swift" to User A and "jazz" to User B. It never shows User A any jazz. User A never clicks on jazz. The retraining dataset records this as evidence that User A doesn't like jazz. The next model becomes more confident in this "preference" it manufactured through the original recommendation decision.

This is exposure bias, and it's the foundational constraint that makes feedback loops hard to escape. The model only ever sees feedback signals for items it chose to expose. Items never shown accumulate zero signal. The training data is therefore a heavily biased sample of all possible user-item interactions — biased precisely toward whatever the previous model already believed.

The dynamic compounds through popularity. Popular items get shown more, accumulating more clicks, which trains the model to show them even more. New or niche items never get enough exposure to accumulate signal. The result is a "rich get richer" dynamic where the training distribution collapses toward a narrower and narrower slice of the content catalog over successive retraining cycles. Glovo's engineering team audited their ML systems and found over half of their 120+ production models had become "outcome starved" — collecting feedback almost exclusively on outcomes they had already decided to optimize for, without anyone realizing it was happening.

The Silent Failure Pattern

Here is why teams don't catch this early: engagement metrics are structurally incapable of detecting it.

Click-through rates measure whether users clicked on what the model showed them. They don't measure whether those items were actually the best choices, or whether users would have been more satisfied with something else. Watch time measures whether users watched what was recommended. It doesn't measure whether users explored things they genuinely wanted to discover. Conversion rates measure whether users completed a transaction. They don't measure whether those transactions represent true preference or a consequence of constrained choice.

When the model's feedback loop is running, these metrics can actually improve as quality degrades. YouTube's early algorithm, optimizing for raw clicks, generated a system where clickbait thumbnails outperformed high-quality content — not because users preferred clickbait, but because the model's own choices had trained it to manufacture that preference. Clicks went up while content diversity and long-term user satisfaction eroded.

The most striking evidence of this pattern comes from medical AI research published in early 2026. Models that were retrained on AI-generated medical notes showed apparent improvement on standard evaluation metrics — because they were being evaluated on data from the same contaminated distribution they were trained on. When tested on authentic clinical notes from outside the feedback loop, comprehension degraded 45-fold by perplexity. The dashboards said the model was getting better. The model was getting worse at the actual task.

The general pattern: your offline evaluation metrics are computed on the same biased dataset your model created. They will confirm your model is improving. The offline-online gap — the divergence between offline metric improvement and real-world A/B test results — is often the first measurable signal that something is wrong, and by then the feedback loop has been running for months.

Detection Patterns

Several warning signals can surface feedback loop contamination before it becomes a crisis.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates