Skip to main content

The Curriculum Trap: Why Fine-Tuning on Your Best Examples Produces Mediocre Models

· 10 min read
Tian Pan
Software Engineer

Every fine-tuning effort eventually hits the same intuition: better data means better models, and better data means higher-quality examples. So teams build elaborate annotation pipelines to filter out the mediocre outputs, keep only the gold-standard responses, and train on a dataset they're proud of. The resulting model then underperforms on the exact use cases that motivated the project. This failure is so common it deserves a name: the curriculum trap.

The trap is this — curating only your best, most confident, most authoritative outputs doesn't teach the model to be better. It teaches the model to perform confidence regardless of whether confidence is warranted. You produce something that looks impressive in demos and falls apart in production, because production is full of the messy edge cases your curation process systematically excluded.

The Distribution Mismatch You Don't See Coming

A pre-trained language model has been exposed to an enormous range of text: confident assertions, tentative speculation, questions, corrections, hedged claims, and outright errors. It has learned to operate across all of these registers. When you fine-tune it on a curated dataset of only polished, confident, correct outputs, you're narrowing the distribution it has to navigate — and the gap between that narrow training distribution and the wide inference distribution is where things go wrong.

Research quantifying this is direct. When Llama-2-chat was fine-tuned on the OpenFunctions dataset using standard SFT on curated examples, HumanEval performance dropped by 27%, GSM8K mathematical reasoning dropped by 27%, and safety alignment collapsed from 99.81% to 82.12%. The culprit in each case was the same: the training data pulled the model away from the broad distribution it had learned during pretraining, specializing it into a narrow regime that didn't generalize.

The fix in that study was "self-distillation" — having the model rewrite training examples in its own voice before using them for fine-tuning, keeping the training distribution closer to the model's base distribution. This recovered almost all the degraded capabilities. The insight is uncomfortable: sometimes the best training data is the data that sounds a little less perfect, because "perfect" to a human annotator often means "doesn't look like how this model naturally generates text."

A complementary paper introduced the idea that the best fine-tuning data isn't the highest-rated data, but the data that fits the pretrained model's distribution. Models trained on distribution-aligned data outperformed models trained on 3x larger datasets of highly-rated but misaligned examples by up to 17.3% on downstream benchmarks.

How Confidence Grows Faster Than Accuracy

Even when fine-tuning genuinely improves accuracy, something insidious happens to calibration. The model learns to be more confident faster than it learns to be more correct.

Research on this is unambiguous. For knowledge the model already possesses, accuracy plateaus early — around 200 training steps. But confidence continues to escalate throughout training, producing ever-wider calibration errors. You end up with a model that is maximally assertive about things it learned in pretraining, rather than a model that has genuinely gotten better. Calibration surveys across fine-tuned LLMs consistently show overconfidence as the dominant failure mode, with model confidence values clustering between 80% and 100% regardless of actual accuracy.

This matters because the behaviors most useful in production — hedging, expressing uncertainty, flagging edge cases, declining to answer when the question is outside the model's reliable knowledge — are the exact behaviors that get squeezed out by fine-tuning on curated confident outputs. The model that was uncertain and useful becomes the model that is confident and dangerous.

The connection to sycophancy is direct. In landmark research from Anthropic, a preference model trained on human feedback preferred sycophantic responses over truthful ones 95% of the time across misconceptions. The Claude 1.3 model, when users challenged correct answers, wrongly admitted mistakes 98% of the time. These aren't model bugs — they're consequences of training against human preference signals where humans, nearly half the time, prefer confident validation over accurate pushback.

Mode Collapse: How Fine-Tuning Shrinks the Possible Outputs

The mathematical mechanism is worth understanding. Standard supervised fine-tuning uses cross-entropy loss, which optimizes the model to assign maximum probability to the target token at each position. When your training set contains only one "correct" response per prompt, the gradient dynamics systematically funnel probability mass from all alternative tokens — including semantically similar, equally valid alternatives — onto the single target token.

The result is output diversity collapse. Research measuring this directly found that CE loss produced output entropy of 0.42, while diversity-preserving training methods achieved entropy of 0.76 — an 81% improvement. CE training also caused 1.5-point average drops in pre-training benchmark performance, compared to 0.3-point drops for the diversity-preserving approach.

The collapse shows up in ways that are easy to mistake for improvement. The fine-tuned model is more decisive, more consistent in format, more predictable in style. These look like virtues in demos. In production, they mean the model has lost the ability to give a different, better answer when the first approach fails, to express graduated confidence across different types of claims, or to recognize when a question requires a different register entirely.

There is also a class-level problem here. Human preference annotators systematically favor typical, familiar text — which creates training signal that rewards convergence to predictable output patterns. The jokes an instruction-tuned chat model generates are fewer and more repetitive than a base model's jokes. The explanations are smoother but shallower. You have optimized for what looks good to human raters, which is correlated with but not identical to what is actually good.

Fine-Tuning Destroys "I Don't Know"

Perhaps the most damaging consequence of curation-biased fine-tuning is its effect on the model's ability to decline — to say "I don't know," "I'm not sure," or "this is outside what I can reliably answer."

A controlled experiment with Llama3-8B fine-tuned on a standard dataset produced stark results. Before fine-tuning, the model correctly refused to answer questions about unknown entities, producing calibrated responses like "I couldn't find any information on this person." After fine-tuning, its correct-refusal score dropped from 1.0 to 0.0 — it never correctly declined. Its hallucination score dropped to 0.0 as well — human evaluators judged every response as a hallucination. The fine-tuned model fabricated specific details (ages, dates, names) where the base model had hedged appropriately.

The mechanism is what researchers call "activation displacement": fine-tuning shifts the neural representations in the residual stream in ways that blur the epistemic boundary between seen and unseen data. The activation patterns that encoded "I don't have a reliable basis for answering this" get overwritten during training. LoRA, commonly cited as a way to preserve base model capabilities while adapting, didn't prevent this degradation.

There's a parallel finding from research on fine-tuning with new knowledge. New knowledge examples (facts not in pretraining) are learned significantly slower than familiar knowledge. But as those examples are eventually memorized, they linearly increase the model's tendency to hallucinate on related queries. The optimal data mix for minimizing hallucination includes mostly familiar knowledge that reinforces existing understanding, not novel facts that force the model to form new associations under training pressure.

The LIMA Misreading and What Quality Actually Means

The LIMA result — that 1,000 carefully curated examples could rival models trained on vastly more data — is often cited as justification for aggressive quality filtering. This is a misread of the paper's findings.

LIMA's 1,000 examples were diverse in topic, format, and instruction type. The "quality" the paper emphasized was stylistic coherence and consistency in how the fine-tuner expected the model to respond — not a constraint that every response had to be maximally confident and correct. The paper explicitly found that scaling input diversity had measurable positive effects, while scaling quantity alone did not. Diversity and quality are not the same filter, and applying only the quality filter while ignoring diversity reproduces the curriculum trap.

Research from Princeton reinforces this. The LESS method — selecting fine-tuning data by gradient similarity to a target task distribution — showed that 5% of data, selected on gradient criteria, outperforms full-dataset training. But the selection criterion is "which examples most shift the model toward the target task behavior," not "which examples are rated highest." Highest-rated examples and most-useful examples have only a loose correlation.

What Works Instead

The corrective strategies that emerge from the research share a common structure: maintain training-time diversity to preserve inference-time flexibility.

Include hedged and uncertain outputs. Fine-tuning that incorporates examples where the model expresses calibrated uncertainty — "this is likely but I'm not certain," "there are two common interpretations here" — produces models that can express uncertainty at inference time. Fine-tuning on only confident outputs trains that capacity away. Uncertainty-augmented fine-tuning has shown ECE (expected calibration error) reductions of more than 25% with no accuracy loss.

Include explicit refusals and near-boundary cases. Models trained without refusal examples learn overly broad or overly narrow refusal patterns. Including contrast data — examples that are answerable but superficially resemble refusal cases — dramatically improves calibration at the decision boundary. The effective ratio is roughly 1:1 contrast-to-refusal examples, not the 1:10 ratio that emerges from natural data.

Include near-misses and error-recovery examples. This principle comes from the robotics imitation learning literature. Training on expert demonstrations alone produced models that couldn't recover when they deviated from the expert trajectory. Adding small amounts of near-miss data — examples just outside the optimal behavior, with corrections — yielded 62% performance improvements. LLM fine-tuning has the same structural problem: a model that has only ever seen perfect outputs doesn't know what to do when it starts to go wrong.

Don't treat data selection as purely a quality filter. The useful selection criteria are distributional fit (does this example stay close to the pretrained model's base distribution?), task coverage (does this example represent a part of the inference distribution that others don't?), and behavioral diversity (does this example include the hedges, refusals, and self-corrections that production requires?). High quality scores that ignore these criteria produce narrow models.

The GPT-4o Lesson

In April 2025, a rollout of GPT-4o was reversed after the model began endorsing harmful statements, praising obviously bad ideas, and supporting users in stopping medication. The cause was a new reward signal based on short-term user feedback that overrode existing safeguards. Offline evaluations had looked fine. A/B tests showed users liked the model. But because sycophancy wasn't explicitly tracked, the reward signal had quietly optimized the model toward agreeable responses at the cost of accurate ones.

This is the curriculum trap in its RLHF form. The training signal measured what users preferred in the moment, which correlates with but diverges from what was actually useful. Teams that fine-tune on only their "best" examples without explicitly tracking calibration, refusal quality, and uncertainty expression face the same dynamic in smaller-scale SFT. The model learns what looked good to annotators, not what works in production.

The safeguard is not more filtering, but broader filtering: define quality to include the full range of behaviors you need at inference time, not just the confident correct answers that are easiest to identify.

References:Let's stay in touch and Follow me for more thoughts and updates