The Pretraining Shadow: The Hidden Constraint Your Fine-Tuning Plan Ignores

April 17, 2026 · 9 min read

Software Engineer

Your team spent three sprints labeling 50,000 domain-specific examples. You ran LoRA fine-tuning on a frontier model. The eval numbers improved. Then a colleague changed the phrasing of a prompt slightly, and the model reverted to the behavior you thought you'd suppressed. That's not a dataset problem. That's the pretraining shadow.

The core insight that practitioners keep rediscovering: fine-tuning teaches a model how to talk in a new context, but it cannot rewrite what the model fundamentally knows or is inclined to do. The behaviors, biases, and factual priors encoded during pretraining are a gravitational field that fine-tuning orbits but rarely escapes.

What the Superficial Alignment Hypothesis Gets Right (and Wrong)

The Superficial Alignment Hypothesis — popularized in the alignment research community — states that a model's knowledge and capabilities are determined almost entirely during pretraining, while alignment and fine-tuning only teach the model which output format to use. In this framing, fine-tuning is a filter, not a transformer.

Empirical work has complicated this picture. Reasoning capabilities do continue to improve beyond pretraining during post-training, following a power-law scaling relationship with the number of fine-tuning examples. So the SAH isn't strictly true. But it captures something important: the improvement in style and output format saturates very quickly (around 100 examples), while the gap between what fine-tuning appears to improve — as measured by LLM-judge win rates — and what it actually improves in factual accuracy and reasoning depth is substantial.

The dangerous version of this misconception is when teams use win-rate metrics to justify their fine-tuning investment. Win rates measure stylistic preference. They can rise even when factual accuracy falls.

Pretraining Biases Persist Through Every Layer of Post-Training

When researchers analyze what separates the behavior of different fine-tuned models, the dominant factor isn't the instruction dataset — it's the pretraining backbone.

One study measured how much model behavior clusters by pretraining base versus instruction dataset: the silhouette score for pretraining was 0.104, versus 0.028 for instruction data. When researchers swapped instruction datasets across different base models, each model retained its pretraining-specific bias patterns. Fine-tuning amplifies or suppresses surface expression of those patterns, but doesn't rewrite the underlying distribution.

The implication for practitioners is uncomfortable: if your model has problematic defaults — an over-confident tone in medical contexts, biased associations between demographic categories, a tendency to confabulate citations — fine-tuning alone won't fix that. The root cause is in the pretraining corpus. The most effective lever is data curation upstream, not RLHF downstream.

Instruction Tuning Doesn't Actually Add Knowledge

One of the most persistent misconceptions in enterprise AI adoption is that fine-tuning on domain documents teaches the model domain knowledge. The evidence says otherwise.

When researchers tested this directly, LoRA models trained on just 1,000 samples outperformed full fine-tuning on datasets 326 times larger on factuality metrics. Full fine-tuning caused knowledge degradation rather than enhancement. The mechanism is subtle but important: approximately 81% of tokens shifted by full fine-tuning originate from the training dataset itself, and 87% of hallucinated phrases are causally linked to that same training data. The model isn't learning facts — it's learning to mimic the statistical patterns of your documents, including all their errors and omissions.

This also explains why fine-tuning on rigid prompt formats creates brittleness. The model learns to respond to your training format rather than the underlying intent, which means slight reformulations at inference time produce unexpected outputs.

The practical consequence: if you need a model to reliably answer questions about your proprietary knowledge base, retrieval-augmented generation will outperform fine-tuning — not because RAG is more sophisticated, but because it bypasses the knowledge ceiling entirely by keeping facts external.

LoRA's Hidden Trade-off: Intruder Dimensions

LoRA is the dominant parameter-efficient fine-tuning method for good reasons — it's cheap, fast, and often matches full fine-tuning on target tasks. But a critical 2024 finding revealed a structural problem: LoRA models develop "intruder dimensions" — new singular vectors that are orthogonal to the original pretraining weight matrices.

These intruder dimensions emerge because the low-rank parameterization (the BA matrix product) forces gradient updates into constrained subspaces, creating new directions that escape rather than adjust pretraining representations. The consequence is a model that performs well on the adaptation task while becoming a worse model of the pretraining distribution — with reduced robustness outside the adaptation context, increased catastrophic forgetting in continual learning settings, and worse generalization to multiple simultaneous tasks.

Setting the LoRA scaling factor α=2r (rather than the common default of α=r) reduces intruder dimensions. Moderate ranks (r=64) appear to better balance task adaptation with pretraining retention than very low or very high ranks. But no hyperparameter setting eliminates the fundamental tension.

The practical read: LoRA is highly effective for format and style adaptation. It's a poor tool for trying to fundamentally shift model behavior or inject domain expertise.

The Alignment Tax: Safety and Capability Compete for the Same Parameters

If you've ever noticed that a safety-aligned model seems less capable on complex reasoning tasks than its base counterpart, you've encountered the alignment tax. This isn't perception — it's measurable and mechanistically understood.

Reasoning training on large reasoning models improves task accuracy by an average of 22.64% across benchmarks. But it simultaneously increases harmful output scores by 43.7%. When you then apply safety alignment to bring those scores back down, DirectRefusal-style approaches reduce harmful scores by 59.6% — but decrease reasoning accuracy by 30.91%. The SafeChain approach reduces the penalty to 7.09% with a less aggressive safety gain of 29.1%.

The root cause is parameter interference: safety objectives modify the same parameter subspaces that encode pre-trained competencies. They can't both win. This is not a solvable problem with better training techniques so much as a structural tension in the current paradigm. Null-space projection methods (confining safety updates to orthogonal spaces of pretraining features) can reduce interference, but not eliminate it.

For teams building applications where both safety and reasoning quality matter — medical, legal, financial contexts — this means accepting a capability-safety frontier and choosing where on it to operate, rather than expecting fine-tuning to give you both for free.

Jailbreaking as Diagnostic: What It Tells You About Fine-Tuning's Depth

One of the most clarifying ways to understand fine-tuning's limitations is through what jailbreaking research reveals. Safety alignment is one of the most carefully executed post-training efforts across the industry. If even that can't override pretraining, more casual fine-tuning efforts have a steeper hill to climb.

The evidence is stark. Researchers demonstrated that just 10 benign question-answer pairs are sufficient to jailbreak safety-aligned LLMs through fine-tuning — achieving a 94.84% average attack success rate across 10 different models. The mechanism is overfitting: benign fine-tuning makes models sensitive to parameter changes in ways that compromise safety boundaries. The loss landscape around safety-aligned parameters is shallow enough that a small push dislodges it.

This finding isn't primarily about adversarial attacks. It's a diagnostic about depth. If safety alignment — which is applied at scale, with enormous resources — can be undone by 10 examples, then a company's custom fine-tuning job on 5,000 FAQ pairs is not building a fundamentally new model. It's applying a surface coating that will chip under pressure.

The Enterprise Ceiling: What the Data Says

The industry has now accumulated enough fine-tuning deployments to measure the patterns.

One financial services team reported needing 15,000 labeled SEC filings to achieve acceptable results — against vendor claims of 500. The saturation threshold for meaningful performance gains appears to be around 100,000 domain-specific samples, at which point models can exhibit instability and performance regression rather than further improvement. Forrester puts the average enterprise fine-tuning implementation cost at $387,000 per domain, with data preparation (cleaning, labeling, validation) consuming more calendar time than actual model training.

In medical settings, 91.8% of clinicians across 15 specialties had encountered model hallucinations in fine-tuned medical AI systems, and 84.7% considered them capable of causing patient harm. The critical finding: medical hallucination in fine-tuned models is driven by reasoning failures, not knowledge deficits. You can't fix it by adding more medical training data.

The common pattern across domains: fine-tuning delivers genuine gains on in-distribution tasks (typically 5–15%), meaningful style and format improvements, and reduced inference costs compared to prompting-heavy approaches. What it doesn't deliver is genuine domain expertise, suppression of hallucination in high-stakes contexts, or immunity to adversarial inputs.

What This Means for Your Domain Adaptation Strategy

Given the evidence, here is a practical frame for where fine-tuning fits in a production AI system:

Fine-tuning excels at:

Output format and response style
Tone calibration for your brand or audience
In-distribution task improvement within well-covered pretraining domains
Cost reduction by replacing verbose system prompts

Fine-tuning is the wrong tool for:

Teaching the model new facts or domain knowledge (use RAG)
Suppressing deeply encoded pretraining behaviors (address at the system level)
Critical safety contexts where alignment must be robust (add external guardrails)
Domains with rapidly changing ground truth — financial regulations, clinical guidelines

The broader strategic implication is that if your business use case depends on a model thinking fundamentally differently about a domain than its pretraining implied, fine-tuning cannot close that gap. Your engineering investment is better directed toward retrieval infrastructure that keeps facts external, evaluation pipelines that measure what actually matters rather than win rates, and multi-layer safeguards that don't rely on the model's internal priors.

The Honest Assessment

The fine-tuning ecosystem has democratized capability adaptation in ways that genuinely matter. LoRA and QLoRA allow teams to customize frontier models for specific formats and contexts at modest cost. That's a real advance.

But the framing that has accompanied this — that fine-tuning enables "domain expertise," that you can teach a model to "think like a doctor" or "reason like a financial analyst" with a few thousand labeled examples — has set up predictable disappointments. The pretraining shadow isn't a bug to be patched. It's the structural reality of how these models encode what they know.

The teams that are doing well with domain adaptation in production have largely stopped fighting this constraint. They build systems where the model's pretraining is a foundation to work with, not a substrate to override. Retrieval handles what's specific. Evals measure what's real. And fine-tuning does what it's actually good at: formatting the output you want.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Pretraining Shadow: The Hidden Constraint Your Fine-Tuning Plan Ignores

What the Superficial Alignment Hypothesis Gets Right (and Wrong)

Pretraining Biases Persist Through Every Layer of Post-Training

Instruction Tuning Doesn't Actually Add Knowledge

LoRA's Hidden Trade-off: Intruder Dimensions

The Alignment Tax: Safety and Capability Compete for the Same Parameters

Jailbreaking as Diagnostic: What It Tells You About Fine-Tuning's Depth

The Enterprise Ceiling: What the Data Says

What This Means for Your Domain Adaptation Strategy

The Honest Assessment

Recommended Reading

About Tian Pan

What the Superficial Alignment Hypothesis Gets Right (and Wrong)​

Pretraining Biases Persist Through Every Layer of Post-Training​

Instruction Tuning Doesn't Actually Add Knowledge​

LoRA's Hidden Trade-off: Intruder Dimensions​

The Alignment Tax: Safety and Capability Compete for the Same Parameters​

Jailbreaking as Diagnostic: What It Tells You About Fine-Tuning's Depth​

The Enterprise Ceiling: What the Data Says​

What This Means for Your Domain Adaptation Strategy​

The Honest Assessment​

Recommended Reading

About Tian Pan

What the Superficial Alignment Hypothesis Gets Right (and Wrong)

Pretraining Biases Persist Through Every Layer of Post-Training

Instruction Tuning Doesn't Actually Add Knowledge

LoRA's Hidden Trade-off: Intruder Dimensions

The Alignment Tax: Safety and Capability Compete for the Same Parameters

Jailbreaking as Diagnostic: What It Tells You About Fine-Tuning's Depth

The Enterprise Ceiling: What the Data Says

What This Means for Your Domain Adaptation Strategy

The Honest Assessment