Skip to main content

2 posts tagged with "training-data"

View all tags

The AI Code Feedback Loop: How Today's Generated Code Trains Tomorrow's Models

· 9 min read
Tian Pan
Software Engineer

About 41% of all new code merged globally in 2025 was AI-generated. Most of that code flows into production repositories that are publicly indexed, scraped, and eventually fed back into the next round of training data for AI coding tools. The implication is straightforward but its consequences are still unfolding: AI models are increasingly being trained on the outputs of prior AI models, with no structured record of which code came from where.

This is the context pollution problem. It is not hypothetical. The feedback loop is already operating at scale, the quality effects are measurable, and the failure mode is unusual enough that it can look like improvement in the short term while the underlying distribution quietly degrades.

Feedback Surfaces That Actually Train Your Model

· 10 min read
Tian Pan
Software Engineer

Most AI products ship with a thumbs-up/thumbs-down widget and call it feedback infrastructure. It isn't. What it is, in practice, is a survey that only dissatisfied or unusually conscientious users bother completing — and a survey that tells you nothing about what the correct output would have looked like.

The result is a dataset shaped not by what your users want, but by which users felt like clicking a button. That selection bias propagates into fine-tuning runs, reward models, and DPO pipelines, quietly steering your model toward the preferences of a tiny and unrepresentative minority. Implicit signals — edit rate, retry rate, session abandonment — cover every user who touches the product. They don't require a click. They're generated by the act of using the software.

Here's how to design feedback surfaces that produce high-fidelity training signal as a natural side effect of product use, and how to route those signals into your training pipeline.