Your Synthetic Training Data Is Collapsing Toward the Mean
You needed more training data, so you generated it. A model wrote a few thousand examples to fill the gaps in your dataset — edge cases, underrepresented intents, the long tail your real logs never covered. You spot-checked a sample. Each example looked fine: grammatical, on-topic, correctly labeled. You shipped the batch into your fine-tuning set and moved on.
Three rounds later, your model is worse at exactly the cases you generated data to cover. Not catastrophically worse — just quietly, uniformly mediocre. The rare intents that used to work sometimes now never work. The phrasing your users actually type gets misread. And nothing in your quality checks ever flagged it, because every individual example you generated really was fine.
The failure is not in any single example. It is in the distribution. Synthetic data, generated and re-generated without a reality anchor, contracts toward the mean — and the tails, which are the entire reason you reached for synthetic data, are the first thing to go.
