Knowledge Distillation Economics: When Compressing a Frontier Model Actually Pays Off
Most teams that reach for knowledge distillation do it for the wrong reasons and at the wrong time. They see a 70B model blowing their inference budget, read that distillation can produce a 7B student that's "just as good," and start immediately. Six weeks later they have a distilled model that scores well on their validation set, ships to production, and begins producing confident nonsense at scale. The validation set was drawn from the same distribution as the teacher's synthetic training data. Real traffic was not.
Distillation is an optimization tool, not a capability upgrade. The economics only work under specific conditions — and the failure modes are subtle enough that teams often don't detect them until users do.
