Skip to main content

Distillation Is a Product Decision, Not a Research Artifact

· 10 min read
Tian Pan
Software Engineer

A frontier-model chat feature is roughly a thirty-cents-per-conversation product. The distilled variant of the same feature is roughly a third-of-a-cent-per-conversation product. These are not two implementations of one product. They are two products, with different free-tier economics, different acquisition costs, different markets, and different competitive moats. The team that ships the distilled version as "the same feature, cheaper" wastes the move.

Most engineering organizations still treat distillation as a research-team optimization that gets applied after a feature is "done" — a tail-end pass to wring inference cost out of something already spec'd against the frontier model. That framing is wrong by an order of magnitude. The choice of teacher, the choice of student, the eval suite the student is graded against, and the product surface the student is deployed to are product decisions. They determine which capabilities you are consenting to lose, which traffic shape you are designing for, and which price floor you are unlocking. Hand them to a research team to optimize against MMLU and you will ship a model that wins benchmarks the product does not care about.

The economics make the stakes legible. A small distilled model serving ten thousand daily queries runs in the $500–$2,000 per month range, versus $5,000–$50,000 for the equivalent frontier-API workload — roughly a 10–25x compression of the inference bill, and as much as 30x in well-tuned pipelines. Per-token pricing tells the same story from the other side: GPT-4o-mini at $0.15/$0.60 per million tokens versus Claude Opus at $15/$75 is a hundred-fold spread between the cheapest and most capable tiers. That is not an optimization. It is a different price column on the menu.

The Capability-Loss Eval Is a Product Spec

The most common distillation failure is silent, and it looks like this: the team picks a teacher (some frontier model), picks a student (some smaller checkpoint), trains against a generic distillation set, watches the student hit 95% of the teacher's MMLU score, ships it, and discovers two months later that the long-tail use cases that drove free-to-paid conversion are exactly the ones the student cannot do. Nobody named what was being given up, so nobody noticed it was gone.

The discipline that closes this gap is a capability-loss eval that is explicit about which behaviors you are willing to lose. Long reasoning chains. Rare-domain knowledge. Low-frequency tool selection. Multi-step planning across more than three or four hops. Code generation in languages outside your top three. Adversarial-robustness against jailbreak prompts that the teacher resisted and the student does not. Each of these is a checkbox on a product spec, not a number to optimize. The honest answer for a distilled customer-support bot might be "we are losing reasoning over arbitrary documents but keeping intent classification and sentiment" — which is fine, because the product does not need the former. But you cannot make that trade unless you have written it down.

Public benchmarks are not the right tool for this. A distilled model retaining 95% of the teacher's GLUE or SuperGLUE score is the rule, not the exception. What that retention number does not tell you is whether the 5% your model lost was concentrated in the slice of traffic that drives revenue. Application-specific eval — built from your own production logs, weighted by traffic frequency and business value — is the only way to see the loss at the resolution that matters. If your eval suite cannot distinguish between "fails on the 0.5% of queries the model also failed on before distillation" and "fails on the 0.5% of queries the model used to handle and now does not," it is not a distillation eval. It is a vibe check.

The Distill Ladder: Variants Mapped to Product Surfaces by Economics

Once distillation is treated as a product variant, it becomes natural to keep more than one variant. A "distill ladder" is a set of progressively compressed models — frontier teacher, mid-tier student, small student, edge-deployable student — each assigned to product surfaces by economics rather than by what happens to be available.

The frontier teacher serves the workflows where a wrong answer is a paid-tier churn event: the legal-research query, the contract-redline summary, the multi-step debugging session. The mid-tier student serves the bulk of authenticated traffic, where users are signed in and a slightly weaker reasoning chain is acceptable. The small student serves anonymous free-tier traffic and the autocomplete-style features where latency dominates quality perception. The edge variant runs on the user's device for the offline-mode story or the privacy-sensitive surface where data cannot leave the perimeter.

Routing between these is its own engineering problem — a small classifier or rule-based router, not a fourth LLM call — but the product question is which surface gets which rung. Treating that question as "the cheapest model that passes our floor" yields a coherent ladder. Treating it as "everything goes to the frontier model and we'll optimize later" yields a single rung and a margin problem.

The strategic value of the ladder is that it gives the product team optionality across price points. A free tier that runs on the small student is a customer-acquisition channel; a paid tier that runs on the frontier teacher is the upgrade. The economics that allow you to give away the bottom rung are exactly the economics that make the top rung valuable. Companies that only have the top rung are running a high-margin business with no acquisition funnel.

The Failure Mode: Winning the Easy 80% and Losing the Long Tail

The seductive failure mode is distilling against the easy majority of traffic. The team builds a teacher-student pipeline, samples a corpus of "representative" production queries, trains the student, and watches it match the teacher on most of them. They ship.

What they shipped is a model optimized for the queries that were never the problem. The 80% of traffic that any reasonable model handles correctly is not the part that drove product-market fit. The other 20% — the long-tail edge cases, the multi-step reasoning, the adversarial inputs, the domain-specific terminology, the multi-turn clarifications — is where users either become advocates or become churn. A distilled model that wins the easy queries and loses the long tail is a feature that demos beautifully and degrades quietly in production.

Two practices guard against this. The first is stratified training-data sampling — building the student's training corpus so that long-tail traffic is overrepresented relative to its production frequency, biasing the student's training signal toward the cases it would otherwise underfit. The second is stratified evaluation — slicing the eval set the same way, so that a 2% improvement on overall accuracy that came at the cost of a 15% regression on the long tail does not look like a win. Most distillation pipelines do neither by default. The team that adds them will catch a class of regressions that the team that does not will ship.

Ownership Belongs to Product, Not Research

The teacher-student pipeline is a product spec. The teacher choice is a quality ceiling, the student choice is a cost floor, the training corpus is a definition of what the product is for, and the eval suite is a definition of what good looks like. These are decisions a product owner needs to make and own. Letting the research team make them in isolation produces a model optimized for an academic objective that does not match the product's economics.

This does not mean product managers are running gradient descent. It means the cross-functional ownership of the pipeline lives at the product level: which capabilities are protected, which can be sacrificed, which surfaces use which rung, what the price ceiling is, what the latency floor is. Research executes the training; product specifies the trade. The reverse — research specifying the trade and product accepting whatever drops out — is how you get a distilled model that looks great on the team's internal benchmarks and fails the surfaces that pay the bills.

A practical signal that ownership is in the wrong place: ask the team why they picked the student size they did. If the answer is "it was the next checkpoint down from the teacher" or "it fit on one GPU," ownership is with infrastructure. If the answer is "we needed the cost per conversation under one cent and the p95 latency under 800 milliseconds because the chat surface loses session retention above that," ownership is with product, and the team is making the right kind of decision.

Don't Price It Like the Frontier

The last and most underappreciated point: a distilled feature that is priced and positioned identically to its frontier sibling has wasted most of its leverage. The whole reason to absorb the engineering complexity of a teacher-student pipeline, capability-loss eval, distill ladder, and stratified evaluation is to unlock product surfaces that did not exist at the frontier price point. A free tier. A bulk-processing tier. A high-volume API tier with a different SLA. An on-device deployment that the frontier model could never run on.

If the team ships a distilled model and routes the same traffic through it at the same price, they have converted product-strategy leverage into a margin improvement on the existing business. That is not nothing — but it is the smallest possible payoff, and it leaves the bigger move on the table. The bigger move is using the cost floor to define new products. The 30x compression of inference cost is product-strategy capital. Spending it on margin is the equivalent of paying down a credit card with a windfall instead of investing it. Defensible, but rarely the best use of the money.

What This Looks Like in Practice

A team taking distillation seriously as a product decision treats the rollout the way they would treat any other product launch. There is a product owner accountable for the variant. There is a written capability-loss spec naming the behaviors being given up. There is a stratified eval set that overweights the traffic that drives revenue. There is a routing layer that maps surfaces to rungs of the ladder by economics, not by happenstance. There is a pricing decision that reflects the new cost floor as a new product offering, not as a quiet margin upgrade on the old one.

The team that does none of this will still ship a distilled model. It will be cheaper to run, it will perform respectably on standard benchmarks, and it will quietly underperform on the slice of traffic that mattered. The team that does all of it will end up with a portfolio of variants, each priced to a different market, each evaluated against a different bar, and each capturing a different segment of demand. The first team has a research artifact. The second has a product strategy.

Distillation is the rare engineering technique where the capability you gain is mostly economic. Treat it that way. Hand the spec to a product owner, hand the trade to a written eval, and hand the rung-to-surface mapping to a router. The team that ships distillation as a product variant is playing a different game than the team that ships it as a research afterthought, and the difference shows up on the income statement long before it shows up on the model card.

References:Let's stay in touch and Follow me for more thoughts and updates