Skip to main content

Knowledge Distillation for Production: Teaching Small Models to Do Big Model Tasks

· 9 min read
Tian Pan
Software Engineer

A healthcare company ran GPT-4 on 10,000 documents per day. Annual bill: 50,000.Afterfinetuninga27Bopensourcemodelonfrontieroutputs,thesameworkloadcost50,000. After fine-tuning a 27B open-source model on frontier outputs, the same workload cost 5,000—a 90% reduction. The smaller model also outperformed the frontier model by 60% on their specific task, because it had been shown thousands of examples of exactly the right behavior.

This is knowledge distillation in its modern form: you pay the frontier model API costs once to generate training data, then run a small specialized model forever. The math works because inference is cheap when you own the weights, and task-specific models beat general-purpose models on narrow tasks given enough examples.

But "collect outputs, fine-tune, ship" is not a complete recipe. Most teams that attempt distillation hit one of three invisible walls: bad synthetic data that teaches the student wrong behaviors, no reliable signal for when the student is actually ready, or silent quality collapse in production that doesn't surface until users complain. This post covers the pipeline decisions that determine whether distillation works.

How Distillation Actually Transfers Knowledge

The original framing—train a small model to mimic a big model—undersells what's happening. The key mechanism is soft label transfer.

When a teacher model classifies a document as "billing inquiry," it doesn't output a clean 1.0 probability. It might output 0.72 for billing, 0.18 for refund request, and 0.10 for account inquiry. Those secondary probabilities encode semantic relationships the model learned: billing and refund requests are more similar to each other than either is to account management. A student trained only on hard labels (0 or 1) loses this signal entirely.

Temperature scaling controls how much of this distributional information transfers. At temperature T=1, the teacher's outputs are relatively peaked. At T=3 or T=5, the probability mass spreads across classes, making the inter-class relationships more visible to the student. Empirically, optimal temperatures fall between 2 and 5 for most classification tasks; pushing higher requires longer training to stay useful.

The training loss combines two components: KL divergence between student and teacher soft outputs (weighted by T²) plus standard cross-entropy on hard labels. The T² scaling preserves gradient magnitude as temperature increases. In practice, start with a 50/50 weight split and tune from there—if your model has access to ground truth labels, don't abandon them for soft targets alone.

For LLM-based distillation, the equivalent is token-level probability matching, but most teams skip this in favor of a simpler approach: treat frontier model outputs as ground truth labels, and fine-tune the student with standard supervised training. You lose some of the theoretical benefits but gain a much simpler pipeline, and the empirical results are strong enough that this is the dominant approach.

Building the Distillation Dataset

The single most important lesson from recent distillation research: quality of examples dominates quantity. The LIMA paper demonstrated that 1,000 carefully curated instruction-following examples matched performance that required 52,000 uncurated examples in earlier work. A web agent distillation study used 2,322 filtered teacher trajectories to produce a 9B model that outperformed GPT-4o and Claude 3.5 Sonnet on agent tasks.

This means your data pipeline is as important as your model architecture. A three-stage filter is the minimum:

Stage 1: Generation

Design your prompts to elicit the behavior you want to transfer, not just any behavior. For reasoning tasks, use chain-of-thought prompts and keep the reasoning traces—they're part of what you're distilling. For classification, vary the examples to cover your real input distribution. Don't generate from a narrow prompt template; the student will overfit to the template's artifacts.

Stage 2: Automated filtering

Run the generated outputs through:

  • A secondary AI grader (a different model, or the same model with a grading prompt) to flag incorrect answers
  • Heuristic rules to remove self-contradictions, truncated outputs, and refusals
  • Entropy-based filtering to remove outputs where the teacher was highly uncertain—these are unreliable supervision signals
  • Deduplication to avoid the student memorizing repeated phrasing

Stage 3: Manual spot-check

Sample 5–10% of your filtered dataset and review it. The automated filters will miss systematic errors—for example, a teacher model that's confidently wrong on a specific class of inputs. If you find a pattern of errors in your spot-check, you need a targeted filter for that pattern before training.

The final dataset should be a mix of synthetic frontier outputs (roughly 70%) and real examples from your task distribution (30%). The real examples anchor the student to actual user inputs; the synthetic examples provide coverage. Training on synthetic data alone produces models that perform well on frontier-generated test sets and degrade on real-world inputs.

Detecting Quality Collapse Before Users Do

Knowledge distillation introduces failure modes that don't exist in standard supervised training. Three are common enough to warrant dedicated monitoring:

Capacity gap failure: The student is too small to represent what the teacher learned. This manifests as training loss that plateaus early while validation accuracy stays 10–15 points below the teacher. The fix is usually a larger student or progressive distillation—training an intermediate-size model first, then distilling that to your target size.

Confidence miscalibration: The student produces outputs that match the teacher's top-1 predictions but has miscalibrated confidence. This is dangerous in production because downstream systems often use confidence scores for routing decisions (e.g., "if confidence < 0.8, escalate to human review"). A student that outputs 0.95 confidence on cases where the teacher was uncertain will route everything to the automated path. Measure Expected Calibration Error (ECE) explicitly during evaluation—accuracy alone won't catch this.

Distribution drift: The student was trained on frontier outputs from your data distribution at time T. When your users' inputs shift, the student's performance degrades while the teacher (which you can still query on-demand) would adapt through the next distillation cycle. Track accuracy on a reference holdout set continuously. If your reference set accuracy drops more than 5 percentage points from baseline, trigger a retraining run.

Entropy collapse is a more subtle failure documented in recent research: the model loses the ability to produce diverse outputs across a range of inputs, not just accuracy degradation on specific cases. You can detect it by tracking the entropy of your model's output distribution across a held-out evaluation set over time. If entropy is monotonically decreasing across multiple evaluation cycles without a corresponding accuracy improvement, the model is converging to a narrow behavioral repertoire.

When Is the Distilled Model Ready for Production?

The question teams get wrong: "Does the student match the teacher's accuracy?" The right question: "Does the student meet the production bar across the dimensions that matter for this task?"

Those dimensions vary, but a minimum viable evaluation covers:

Accuracy on in-distribution data: Target 95%+ of teacher performance on your standard benchmark. If your task has no existing benchmark, you need to build one from labeled real-world examples before starting distillation—not from synthetic data.

Accuracy on tail cases: Pull 200–500 examples representing your hardest inputs—rare entities, unusual formatting, edge-case semantics. Evaluate teacher and student on this set separately. Students frequently match teacher performance on typical inputs while degrading 15–20 points on the tail. If your tail cases are high-stakes, this gap matters more than aggregate accuracy.

Calibration: ECE below 5% is a reasonable production target. Miscalibrated students erode trust in the confidence signals that downstream systems rely on.

Latency and cost: These are often why you're distilling in the first place. Measure P50 and P95 latency under realistic load before declaring success—batch inference benchmarks are misleading for production request patterns.

Robustness: Test on a small set of adversarial rephrases and out-of-domain inputs. You don't need exhaustive adversarial testing, but you do need to know whether your student is brittle to input variations the teacher handles gracefully.

The final deployment protocol should be shadow mode first: route all production traffic to both models, serve only the champion (teacher or previous student), and log both outputs for comparison. After two weeks of shadow data, compare outputs on the disagreements—cases where student and teacher differ. If the student is wrong on the disagreements more than 30–40% of the time, it's not ready. If it's wrong less than that, run a canary deployment at 10% traffic and monitor business metrics.

The Distillation Cycle

Distillation is not a one-time project. The pipeline that makes it work in production is a cycle:

  1. Identify a task where you're spending disproportionate inference cost or where a small specialized model could outperform a general one
  2. Generate and filter frontier model examples for that task
  3. Train and evaluate the student against your production criteria
  4. Shadow-deploy and validate on real traffic before promoting
  5. Monitor calibration, accuracy, and entropy drift in production
  6. Retrigger distillation when your distribution shifts enough that the student's performance degrades

The teams that get the most leverage from distillation treat it as infrastructure, not a research project. The dataset curation tooling, the evaluation harnesses, and the deployment machinery are reusable across tasks. Once those are in place, distilling a new task looks like a few days of data generation and a fine-tuning run, not a multi-month engineering effort.

The cost math only improves with scale. If you're paying frontier model prices on any task where you have stable, definable success criteria and enough volume to justify a distillation run, you're leaving significant savings on the table.

References:Let's stay in touch and Follow me for more thoughts and updates