The Ghost in the Weights: How Pretraining Residue Breaks Your Fine-Tuned Model in Production
Your fine-tuned model passes your eval suite with 93% accuracy. You ship it. Three weeks later, a customer sends a screenshot: the model answered a question it had never seen in training with complete confidence — and it was completely wrong. The answer wasn't a hallucination in the usual sense. It was a memory. A pattern baked in during pretraining, resurfacing on a distribution the fine-tune never covered. This is pretraining residue, and it's one of the most underdiagnosed failure modes in production fine-tuning.
Fine-tuning adjusts weights. It does not retrain the model from scratch. The patterns — the calibration mechanisms, the confidence signals, the world-model priors — developed during pretraining at trillion-token scale remain in the weights. Your fine-tuning dataset, no matter how carefully curated, is a thin layer on top of a much deeper prior. When inputs arrive that fall outside your fine-tuning distribution, the model doesn't say "I don't know." It reaches back to pretraining and answers as if it does.
What Pretraining Residue Actually Is
The term "pretraining residue" describes the behavioral fingerprints that persist in a fine-tuned model because fine-tuning never fully overwrites the signals encoded during pretraining. To understand why this happens, you need to understand the asymmetry between the two phases.
Pretraining trains on hundreds of billions to trillions of tokens across a vast, diverse corpus. The model develops general representations: grammar, world facts, reasoning patterns, and — crucially — a learned sense of what it "knows" versus what it should hedge on. This confidence calibration is implicit. It's not a separate module; it's baked into the weight distributions of every layer.
Fine-tuning trains on thousands to millions of task-specific examples. It shifts the output distribution toward the target task. But the underlying representations and their calibration properties change much less than engineers expect. Research measuring accuracy and confidence separately during fine-tuning reveals a telling pattern: on data that overlaps with pretraining knowledge, confidence escalates continuously while accuracy plateaus early. After just 200 steps on familiar data, accuracy stops improving — but the model keeps getting more confident.
This is the calibration failure that defines pretraining residue. The fine-tuned model looks sharp on evaluation data, which inevitably resembles the training distribution. But point it at novel inputs — customer queries that deviate from the fine-tuning template, edge cases, domain shifts — and the model confidently extrapolates from pretraining, not from what it was actually taught.
The Production Failure Mode: Confident and Wrong
The "ghost in the weights" failure has a specific shape that makes it particularly hard to catch before shipping:
The model answers fluently. Pretraining instills a strong prior toward producing coherent, confident text. When a fine-tuned model encounters an OOD input, it doesn't output garbage — it outputs a plausible-sounding answer in the fine-tuned style. The failure is semantic, not syntactic.
The model doesn't hedge. One of the more counterintuitive findings from recent research: when models perform chain-of-thought reasoning before answering, their confidence increases — but the confidence increase is larger when the final answer is incorrect than when it's correct. Reasoning makes the model more convinced, not less, on the wrong answers. If you added CoT to your fine-tune because it improved benchmark scores, you may have amplified calibration failure on OOD inputs.
The error surface is invisible to your eval suite. Your evaluation dataset was drawn from the same distribution as your training data, or close to it. The model performs well there. The OOD failure only surfaces in production, where users bring inputs you didn't anticipate.
The failure compounds with prompt complexity. Simple, direct prompts on familiar inputs are where fine-tuning works best. Complex, multi-part queries — or inputs that partially match the fine-tuning domain but deviate in specific ways — are where pretraining fills in the gaps. The model answers the question it was trained to answer, not the one it was asked.
Why Your Fine-Tuning Data Is Part of the Problem
Most teams underestimate how much their fine-tuning data overlaps with pretraining. The model has seen versions of your domain before. Medical records, legal documents, customer service transcripts, code — all of these exist at scale on the internet, in papers, in public databases. When you fine-tune on domain-specific data, you're often reinforcing patterns the model already partially learned, not teaching it something genuinely new.
The practical consequence: the model quickly "remembers" the familiar parts of your domain and assimilates them at high confidence, while the genuinely novel parts of your task — the edge cases, the domain-specific terminology, the unusual patterns — don't get enough weight updates to displace pretraining priors. You end up with a model that's excellent at the easy cases (which were already somewhat covered by pretraining) and confidently wrong on the hard cases (which weren't).
There's also the opposite failure mode: catastrophic forgetting. When fine-tuning on sequential tasks, models can overwrite capabilities they developed earlier. Recent mechanistic analysis identifies three causes that operate in parallel during fine-tuning: attention head reorganization (15-23% of heads undergo severe entropy shifts), representational drift in intermediate layers, and flattening of the loss landscape around previously learned tasks. The model that was great at task A before fine-tuning on task B may be significantly worse at A afterward — not because it learned something incompatible, but because the weight updates displaced the representations that supported A.
These two failure modes — not forgetting enough (residue bleeding through) and forgetting too much (catastrophic overwrite) — are both downstream of the same root cause: fine-tuning is shallow compared to pretraining, and the balance between residue and forgetting is not something you can reliably tune without instrumentation.
- https://arxiv.org/html/2505.20903v1
- https://arxiv.org/html/2501.09775
- https://arxiv.org/html/2601.18699v1
- https://arxiv.org/abs/2308.08747
- https://arxiv.org/html/2410.10880v1
- https://arxiv.org/html/2410.10739v1
- https://arxiv.org/abs/2306.01708
- https://arxiv.org/html/2504.12585
- https://arxiv.org/abs/2305.18290
