The Synthetic Training Examples Whose Input Distribution Did Not Match What Your Users Actually Typed
A team fine-tunes a customer-support model on 80,000 synthetic examples. The teacher prompt was tasteful: "Generate realistic customer questions about returns, refunds, and shipping." The teacher complied. It produced clean, full-sentence, well-spelled queries with one intent per message, polite framing, and a consistent register. The offline eval against the held-out synthetic split lands at 94%. The team ships.
The production slice underperforms by twenty points. The team spends a sprint debating whether the model is "bad at customer support." It isn't. The model is fine at customer support. It is bad at the language a stressed customer actually types at 11pm on a phone keyboard: "hi i returnd the thing last week but where's my refund also do u ship to canada now." The model never saw an input shaped like that during training, because the teacher was busy generating the queries the teacher imagined, not the queries the users send.
