Skip to main content

2 posts tagged with "llm-training"

View all tags

Your Synthetic Training Data Is Collapsing Toward the Mean

· 8 min read
Tian Pan
Software Engineer

You needed more training data, so you generated it. A model wrote a few thousand examples to fill the gaps in your dataset — edge cases, underrepresented intents, the long tail your real logs never covered. You spot-checked a sample. Each example looked fine: grammatical, on-topic, correctly labeled. You shipped the batch into your fine-tuning set and moved on.

Three rounds later, your model is worse at exactly the cases you generated data to cover. Not catastrophically worse — just quietly, uniformly mediocre. The rare intents that used to work sometimes now never work. The phrasing your users actually type gets misread. And nothing in your quality checks ever flagged it, because every individual example you generated really was fine.

The failure is not in any single example. It is in the distribution. Synthetic data, generated and re-generated without a reality anchor, contracts toward the mean — and the tails, which are the entire reason you reached for synthetic data, are the first thing to go.

The Synthetic Preference Trap: How AI-Ranked RLHF Quietly Drifts Your Model Into the Teacher's Voice

· 12 min read
Tian Pan
Software Engineer

The first sign is almost always the same: your internal eval dashboard is green, reward-model scores are climbing, DPO loss is trending right — and a customer on a Zoom call shrugs and says "it sounds like ChatGPT now." No one on the training team wants to hear that. The evals say the model is better. The annotators who shipped the last batch of preferences say the model is better. But the user is telling you the truth, and the dashboard is lying. What broke is not any single label. What broke is that your preference data is no longer yours.

This is the synthetic preference trap. Label budgets get squeezed, someone proposes using a stronger model to rank a second model's completions, the experiment ships, and for a while it looks like a free lunch. The student model learns to sound more like the teacher on every turn, and because your reward model was trained on data the teacher also influenced, your reward model cheerfully agrees. The user sees a product that reads exactly like every other product built on top of the same frontier API. The differentiation you thought you were buying with fine-tuning has been quietly distilled away.