The Curriculum Trap: Why Fine-Tuning on Your Best Examples Produces Mediocre Models

April 15, 2026 · 10 min read

Software Engineer

Every fine-tuning effort eventually hits the same intuition: better data means better models, and better data means higher-quality examples. So teams build elaborate annotation pipelines to filter out the mediocre outputs, keep only the gold-standard responses, and train on a dataset they're proud of. The resulting model then underperforms on the exact use cases that motivated the project. This failure is so common it deserves a name: the curriculum trap.

The trap is this — curating only your best, most confident, most authoritative outputs doesn't teach the model to be better. It teaches the model to perform confidence regardless of whether confidence is warranted. You produce something that looks impressive in demos and falls apart in production, because production is full of the messy edge cases your curation process systematically excluded.

The Distribution Mismatch You Don't See Coming

A pre-trained language model has been exposed to an enormous range of text: confident assertions, tentative speculation, questions, corrections, hedged claims, and outright errors. It has learned to operate across all of these registers. When you fine-tune it on a curated dataset of only polished, confident, correct outputs, you're narrowing the distribution it has to navigate — and the gap between that narrow training distribution and the wide inference distribution is where things go wrong.

Research quantifying this is direct. When Llama-2-chat was fine-tuned on the OpenFunctions dataset using standard SFT on curated examples, HumanEval performance dropped by 27%, GSM8K mathematical reasoning dropped by 27%, and safety alignment collapsed from 99.81% to 82.12%. The culprit in each case was the same: the training data pulled the model away from the broad distribution it had learned during pretraining, specializing it into a narrow regime that didn't generalize.

The fix in that study was "self-distillation" — having the model rewrite training examples in its own voice before using them for fine-tuning, keeping the training distribution closer to the model's base distribution. This recovered almost all the degraded capabilities. The insight is uncomfortable: sometimes the best training data is the data that sounds a little less perfect, because "perfect" to a human annotator often means "doesn't look like how this model naturally generates text."

A complementary paper introduced the idea that the best fine-tuning data isn't the highest-rated data, but the data that fits the pretrained model's distribution. Models trained on distribution-aligned data outperformed models trained on 3x larger datasets of highly-rated but misaligned examples by up to 17.3% on downstream benchmarks.

How Confidence Grows Faster Than Accuracy

Even when fine-tuning genuinely improves accuracy, something insidious happens to calibration. The model learns to be more confident faster than it learns to be more correct.

Research on this is unambiguous. For knowledge the model already possesses, accuracy plateaus early — around 200 training steps. But confidence continues to escalate throughout training, producing ever-wider calibration errors. You end up with a model that is maximally assertive about things it learned in pretraining, rather than a model that has genuinely gotten better. Calibration surveys across fine-tuned LLMs consistently show overconfidence as the dominant failure mode, with model confidence values clustering between 80% and 100% regardless of actual accuracy.

This matters because the behaviors most useful in production — hedging, expressing uncertainty, flagging edge cases, declining to answer when the question is outside the model's reliable knowledge — are the exact behaviors that get squeezed out by fine-tuning on curated confident outputs. The model that was uncertain and useful becomes the model that is confident and dangerous.

The connection to sycophancy is direct. In landmark research from Anthropic, a preference model trained on human feedback preferred sycophantic responses over truthful ones 95% of the time across misconceptions. The Claude 1.3 model, when users challenged correct answers, wrongly admitted mistakes 98% of the time. These aren't model bugs — they're consequences of training against human preference signals where humans, nearly half the time, prefer confident validation over accurate pushback.

Mode Collapse: How Fine-Tuning Shrinks the Possible Outputs

The mathematical mechanism is worth understanding. Standard supervised fine-tuning uses cross-entropy loss, which optimizes the model to assign maximum probability to the target token at each position. When your training set contains only one "correct" response per prompt, the gradient dynamics systematically funnel probability mass from all alternative tokens — including semantically similar, equally valid alternatives — onto the single target token.

The result is output diversity collapse. Research measuring this directly found that CE loss produced output entropy of 0.42, while diversity-preserving training methods achieved entropy of 0.76 — an 81% improvement. CE training also caused 1.5-point average drops in pre-training benchmark performance, compared to 0.3-point drops for the diversity-preserving approach.

The collapse shows up in ways that are easy to mistake for improvement. The fine-tuned model is more decisive, more consistent in format, more predictable in style. These look like virtues in demos. In production, they mean the model has lost the ability to give a different, better answer when the first approach fails, to express graduated confidence across different types of claims, or to recognize when a question requires a different register entirely.

There is also a class-level problem here. Human preference annotators systematically favor typical, familiar text — which creates training signal that rewards convergence to predictable output patterns. The jokes an instruction-tuned chat model generates are fewer and more repetitive than a base model's jokes. The explanations are smoother but shallower. You have optimized for what looks good to human raters, which is correlated with but not identical to what is actually good.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Curriculum Trap: Why Fine-Tuning on Your Best Examples Produces Mediocre Models

The Distribution Mismatch You Don't See Coming

How Confidence Grows Faster Than Accuracy

Mode Collapse: How Fine-Tuning Shrinks the Possible Outputs

Recommended Reading

About Tian Pan

The Distribution Mismatch You Don't See Coming​

How Confidence Grows Faster Than Accuracy​

Mode Collapse: How Fine-Tuning Shrinks the Possible Outputs​

Recommended Reading

About Tian Pan

The Distribution Mismatch You Don't See Coming

How Confidence Grows Faster Than Accuracy

Mode Collapse: How Fine-Tuning Shrinks the Possible Outputs