Knowledge Distillation for Production: Teaching Small Models to Do Big Model Tasks
A healthcare company ran GPT-4 on 10,000 documents per day. Annual bill: 5,000—a 90% reduction. The smaller model also outperformed the frontier model by 60% on their specific task, because it had been shown thousands of examples of exactly the right behavior.
This is knowledge distillation in its modern form: you pay the frontier model API costs once to generate training data, then run a small specialized model forever. The math works because inference is cheap when you own the weights, and task-specific models beat general-purpose models on narrow tasks given enough examples.
But "collect outputs, fine-tune, ship" is not a complete recipe. Most teams that attempt distillation hit one of three invisible walls: bad synthetic data that teaches the student wrong behaviors, no reliable signal for when the student is actually ready, or silent quality collapse in production that doesn't surface until users complain. This post covers the pipeline decisions that determine whether distillation works.
