Knowledge Distillation Economics: When Compressing a Frontier Model Actually Pays Off
Most teams burning money on GPT-4o try the same thing first: swap to a cheaper model. GPT-4o mini is 16.7× cheaper per token, Llama 3.1 8B is self-hostable for pennies. But quality drops in ways that break production — the classification task that scored 94% on the frontier model crashes to 71% on the smaller one, or the extraction pipeline starts hallucinating fields that simply don't exist in the source document. So teams either stay on the expensive model and keep paying, or they accept degraded quality.
Knowledge distillation offers a third path: train a small model specifically to replicate the behavior of a large one on your task, not on general language understanding. Done right, you get small-model speed and cost with near-frontier accuracy. Done wrong, you inherit the teacher's confident mistakes at 10× the production volume. Understanding which outcome you get — and when the economics actually work — is what this post covers.
