Knowledge Distillation Without Fine-Tuning: Extracting Frontier Model Capabilities Into Cheaper Inference Paths
A 770-million-parameter model beating a 540-billion-parameter model at its own task sounds impossible. But that is exactly what distilled T5 models achieved against few-shot PaLM—using only 80% of the training examples, a 700x size reduction, and inference that costs a fraction of a cent per call instead of dollars. The trick wasn't a better architecture or a cleverer training recipe. It was generating labeled data from the big model and training the small one on it.
This is knowledge distillation. And you do not need to fine-tune the teacher to make it work.
Most engineers reach for fine-tuning when they need a cheaper model that still performs well on their specific task. Fine-tuning is the obvious choice: take a capable base model, update its weights on your domain data, and ship it. But fine-tuning requires two things that many teams don't have: (1) high-quality human-labeled examples at scale, and (2) a decision about which base model to fine-tune—locking you into that model's size, cost, and capability ceiling. Distillation sidesteps both problems. You use the frontier model itself to generate the training data, and you train a smaller student model on that data without ever touching the teacher's weights.
What Distillation Actually Does
The core idea is simple: run your expensive frontier model (the teacher) on a large set of inputs, collect its outputs, then train a smaller model (the student) to replicate that behavior on your specific task distribution.
This is fundamentally different from fine-tuning. Fine-tuning updates an existing model's weights using gradient descent on new data. Distillation trains a separate, smaller model from scratch (or from a pretrained checkpoint) using the teacher's outputs as supervision labels. You never backpropagate through the teacher. You never need access to its internals. If the teacher is a closed-source API—GPT-4, Claude, Gemini—you can still distill from it by treating API responses as your labeled data.
The training objective combines two signals:
- Task loss: Is the student producing the right answers?
- Knowledge transfer loss: Is the student's output distribution close to the teacher's?
The second signal is where the magic happens. Instead of training the student to simply match the correct answer, you train it to match the teacher's full probability distribution over possible outputs. This softer supervision gives the student more information per example, because the teacher's distribution encodes its uncertainty and its implicit knowledge about which answers are plausible.
In practice, for black-box API-based teachers, you typically only have access to the final output—not the full distribution—so you fall back to pure task loss. This still works, but chain-of-thought extraction dramatically improves it.
Chain-of-Thought Extraction: The Multiplier
Training a student on bare teacher answers produces mediocre results for reasoning tasks. Training a student on teacher reasoning traces—the full chain of thought leading to each answer—produces dramatically better students, often with 3-5x fewer examples needed.
The procedure: instead of prompting your teacher model with "What is the answer to X?", you prompt it with "Think step by step. What is the answer to X?" You collect the entire reasoning trace, not just the final token. Then you train the student to produce both the reasoning trace and the final answer.
This works because reasoning tasks involve implicit intermediate steps that the teacher compresses into its output probability distribution. A student trained only on final answers has to reconstruct those steps from scratch. A student trained on reasoning traces receives the teacher's problem-solving strategy directly.
The DeepSeek-R1 distillation suite made this concrete at scale. Using approximately 800,000 chain-of-thought samples from their frontier reasoning model, they trained a series of smaller student models ranging from 1.5B to 70B parameters. The 70B student achieved near-parity with the teacher across math, coding, and science benchmarks. The 32B student achieved results competitive with models three to four times its size.
For engineers building production systems, the practical implication is: if your task involves multi-step reasoning, do not skip the reasoning traces. Collecting bare answers and wondering why your distilled model underperforms is a common and avoidable mistake.
The Synthetic Data Pipeline
Generating the training data is the most expensive and error-prone part of distillation. The pipeline looks like this:
- Design elaborate teacher prompts that elicit high-quality outputs, including chain-of-thought instructions for reasoning tasks
- Generate 100K to 1M examples by running your teacher API on diverse, task-representative inputs
- Filter for quality using automated correctness criteria specific to your domain
- Train the student on the filtered dataset using simple, cost-optimized prompts
- Validate on held-out benchmarks before deploying
The filtering step is more important than most teams realize. Research consistently shows that quality beats quantity: a dataset of 200,000 carefully validated examples outperforms 1,000,000 unfiltered ones. The student learns from whatever patterns exist in its training data—if those patterns include teacher errors, ambiguities, or low-confidence outputs, the student faithfully learns those too.
Two common mistakes in data generation:
- Underspecifying diversity: If you generate all your training examples from the same narrow prompt distribution, the student overfits to that distribution and fails badly when inputs look slightly different at inference time.
- Using LLM-based quality filters: Filtering your training data by asking another LLM to rate quality introduces that LLM's biases. Domain-specific correctness checks—running generated SQL against a real database, checking code against unit tests, verifying math with a symbolic solver—are far more reliable.
One finding that regularly surprises engineers: scaling from 25M to 150M training tokens sometimes shows "essentially no improvement" on reasoning tasks. Once you've covered the task distribution with high-quality examples, adding more low-quality examples doesn't help. The bottleneck shifts from data quantity to data quality.
The Decision Framework: When to Distill, When to Pay for the Big Model
Distillation makes economic sense under specific conditions. Getting those conditions wrong is expensive.
Distill when:
- You have more than 100,000 monthly API calls on a narrow, repetitive task category (routing, classification, structured extraction, simple QA)
- Your task distribution is stable—the inputs you'll see in production look like the inputs you generated training data from
- You can tolerate 1–5% accuracy loss in exchange for 5–10x cost reduction
- Latency matters: self-hosted 7B models achieve sub-100ms P99 latency that API calls cannot
Don't distill when:
- Monthly API spend is under $5,000 (the engineering cost of building the distillation pipeline exceeds the savings)
- Your use cases are constantly novel—users asking questions the teacher hasn't been trained on, adversarial inputs, rapidly evolving domains
- Maximum accuracy is non-negotiable (medical, legal, safety-critical applications)
- Your task distribution is heterogeneous and hard to characterize
The economics are stark when the conditions are right. At 10 million monthly API calls on a narrow classification task, the cost difference between a frontier API at $15/million tokens and a self-hosted distilled model at $0.40/million tokens amounts to roughly $1.1M annually. A distillation pipeline that costs $50,000 to build pays back in about two weeks.
When the conditions are wrong, distillation is a trap. The 7B student model that achieves 97% of teacher accuracy on your training distribution achieves 60–80% accuracy on the out-of-distribution inputs that real users inevitably generate.
The OOD Problem: What Distillation Cannot Fix
The most consistently underappreciated failure mode in distillation is out-of-distribution degradation. A distilled model doesn't learn to reason—it learns to replicate the teacher's behavior on the training distribution. When test inputs fall outside that distribution, the student's performance degrades sharply, sometimes by more than 20 percentage points.
This creates a dangerous trap in production: the distilled model looks great on your evaluation benchmark (which samples from the same distribution as your training data), ships to production, and immediately encounters user inputs that don't match those patterns.
The fix is explicit OOD evaluation before shipping. Collect examples that are deliberately different from your training data—different phrasings, different domains, edge cases, adversarial inputs—and measure how much accuracy drops. If the drop is more than your team can tolerate, distillation is the wrong tool for this task.
For tasks where OOD robustness matters, three approaches work better than pure distillation:
- Hybrid routing: Route simple, in-distribution queries to the distilled model and everything else to the frontier model. This cuts 40–60% of API costs without sacrificing OOD coverage.
- Distill to a larger student: A 70B student generalizes far better than a 7B student. The cost savings are smaller but still substantial (18x vs. 37x for the smallest models).
- Combine distillation with fine-tuning: Distill from the frontier model, then fine-tune the resulting student on domain-specific data. The distilled baseline gives the fine-tuned model a better starting point than a generic pretrained checkpoint.
Three Misconceptions That Cause Production Failures
Bigger teacher, better student. This is false. Teacher accuracy rises with model size; student accuracy does not rise proportionally. A task-aligned 13B teacher that excels at your specific domain often produces better students than a generic 540B model. Evaluate multiple teachers on your actual task distribution before committing to a data generation budget.
Distillation transfers alignment. Safety properties from the teacher don't reliably transfer to the student. If your teacher model has been trained to refuse certain requests or to maintain specific tonal properties, do not assume the student inherits those properties. Audit the student explicitly.
Sequential cascading compounds quality. Training a large student from a frontier model, then training a smaller student from the large student, produces worse results than training the small student directly from the frontier model. Each transfer step loses information. Go direct.
What's Changed in 2025
Three recent developments are worth watching.
In-context distillation: Instead of training a new model, you distill the teacher's capabilities into in-context examples—carefully crafted demonstrations that, when included in a prompt, elicit behavior from a smaller base model that approximates the teacher. No training required. The tradeoff is that every inference call incurs the context length cost of those demonstrations, but for teams without ML infrastructure, this unlocks distillation-like cost savings with minimal engineering overhead.
Multi-teacher frameworks: Training students against a single teacher is increasingly giving way to committee approaches where multiple specialized teachers contribute to the training signal, weighted by the student's demonstrated proficiency with each teacher's domain. On heterogeneous task suites, multi-teacher students outperform single-teacher students by 3–5 percentage points.
Reasoning model distillation: Chain-of-thought distillation from reasoning-optimized teachers (models explicitly trained to produce long, step-by-step reasoning traces) transfers reasoning capability more effectively than distillation from standard instruction-following models. The 800K reasoning traces that produced DeepSeek-R1's distilled series would have produced substantially worse students if those traces had come from a standard chat model rather than a reasoning model.
The Specialist Tradeoff
The core thing to understand about distillation is what you are building and what you are not. You are not building a general-purpose model. You are training a specialist to replicate a generalist's behavior on your specific workload.
When that's the right problem—stable, high-volume, narrow task distribution—distillation is one of the most cost-effective tools in an AI engineer's toolkit. A 770M model beating a 540B model at its intended task, at a fraction of the cost, is not magic. It's what happens when you give a smaller model exactly the right training signal for exactly the right distribution.
When you need genuine generalization—novel problems, rapidly evolving domains, adversarial users—distillation fails in predictable ways. Recognizing which situation you're in before building the pipeline is the actual engineering work.
- https://link.springer.com/article/10.1007/s10462-025-11423-3
- https://research.google/blog/distilling-step-by-step-outperforming-larger-language-models-with-less-training-data-and-smaller-model-sizes/
- https://arxiv.org/html/2410.18588v1
- https://yage.ai/share/llm-distillation-misconceptions-en-20260414.html
- https://arxiv.org/pdf/2601.18734
- https://arxiv.org/html/2603.25562v1
- https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/distillation-turning-smaller-models-into-high-performance-cost-effective-solutio/4355029
- https://galileo.ai/blog/knowledge-distillation-ai-models
- https://medium.com/@jsmith0475/a-detailed-technical-comparison-of-fine-tuning-and-distillation-in-large-language-models-cccbe629dcba
- https://arxiv.org/abs/2512.02543
- https://arxiv.org/pdf/2106.05945
- https://leanlm.ai/blog/llm-cost-optimization
- https://introl.com/blog/inference-unit-economics-true-cost-per-million-tokens-guide
