The Fine-Tune Artifact Your Departing Engineer Took With Them
A fine-tune is not a file. It is the closure of a pipeline over a training set, and the team that ships the file without the closure has built a production dependency whose source code is in someone else's head. The day that person leaves with two weeks of notice and a clean handoff document is the day your bus factor on a revenue feature drops to zero and nobody notices, because the weights are still in the registry and the registry tag is still stable and the model still serves traffic. The reckoning shows up later, in a routine base-model migration that should have taken a sprint and takes a quarter instead.
The pattern is consistent across teams I have watched run into it. An ML engineer spends six months iterating on a fine-tune — data curation, hyperparameter sweeps, behavioral patches evaluated by feel against a held-out set. The final adapter weights get pushed to the model registry with a tag. The training pipeline that produced those weights is a notebook on the engineer's laptop, with hard-coded paths and floating dependencies that resolved to whatever was the latest version on the day each cell was last executed. The team accepts the handoff at face value because the weights work and the eval scores are good and the registry tag is stable. Eighteen months later, the engineer departs. Six months after that, a base-model migration requires regenerating the adapter against an updated base, the notebook runs and produces weights that score three points lower and regress visibly on the hardest customer segment, and the team spends four months trying and failing to reproduce the original artifact.
This post is about why that failure is structural rather than personal, and about the artifacts a team needs to own — not the weights, but the closure around them — to keep a fine-tuned model maintainable across the years of operational life it will actually have.
The Weights Are Not the Artifact
The mental model that gets teams in trouble treats a fine-tuned model the way a backend team treats a compiled binary: an opaque output that you build once, deploy, and only rebuild when you change the source. That model is wrong for fine-tunes for two reasons.
The first is that the inputs to a fine-tune are not a tractably small piece of source code. They are a training set, a base model, a tokenizer, a data preprocessing pipeline, a randomization scheme, a hyperparameter configuration, a hardware environment, and a software stack whose dependency graph includes hundreds of pinned-or-floating Python packages and several layers of CUDA. Any one of those inputs drifting between the original run and a regeneration run can move the resulting weights in ways the eval set will or will not catch. To pick up where someone else left off with a fine-tune, you need much more than a Git repository — you need to know what data the code is using and how that data came to be.
The second is that the artifact decays through no fault of its own. Base models get deprecated. Tokenizers get upgraded. Frameworks issue breaking changes. CUDA drivers move. The fine-tuned weights you trained against one base model are not portable to a successor base model in any general sense — at best, recent research on cross-model adapter transfer suggests that under certain conditions and specific architectures you can avoid a full retrain, but in the common case a base-model migration requires regenerating the adapter from the original training data, which means running the original pipeline again. If that pipeline is not deterministically reproducible, the regeneration is a guess.
These two facts compound. The artifact decays on a vendor schedule you do not control, and reproducing the artifact requires inputs that no one wrote down. The team that does not budget for the second of those will discover the first as a surprise.
What Reproducibility Actually Requires
Most teams underestimate what it takes to reproduce a fine-tune because the failures are quiet. A reproduction run will usually complete and produce weights — they just will not be the same weights, and the difference will not always be visible on the canonical eval set.
A reproducible fine-tune pipeline has to pin every input that can move:
- The training data, by content hash, with the exact filter steps that produced it from raw sources captured as code rather than as a manual cell run once and forgotten.
- The base model, by exact revision in the upstream registry, including the tokenizer version, since tokenizer drift will quietly change which examples cross your context window during training.
- The dependency closure — Python packages, CUDA, driver versions — as a container image with an immutable digest, not a requirements file that resolves to different versions tomorrow.
- The randomization, by explicit seeds at every source of randomness: weight initialization, data shuffling, dropout, augmentation. A common pitfall is to set a global seed and assume it covers everything; in distributed training, the per-worker shuffle order is its own random source, and getting reproducibility across worker counts is its own problem.
- The hyperparameter configuration, as a versioned artifact in the registry alongside the weights, not as values typed into a notebook cell.
- The hardware shape — number and type of accelerators — because in mixed-precision training the order of accumulations can shift results in ways that look like noise on the eval set and like a regression on the production tail.
Each of these items has a literature behind it of teams discovering, the hard way, that the obvious approach is not quite enough. Setting random.seed(42) does not cover the data loader's shuffle. Pinning torch==2.3 does not pin the CUDA kernel it dispatches to. A dataset stored as a single Parquet file and labeled "v1" does not capture the filter step that dropped 200,000 examples whose language tag was wrong.
The discipline that closes the gap is not a tool. It is a contract: every artifact in the model registry ships with a reproduction command. Run that command, and you get the same artifact, byte-for-byte or within a tolerance you have explicitly defined and tested. If you cannot write that command for an artifact, the artifact is not maintainable, and you should treat its presence in your production stack as a piece of technical debt with an unknown maturity date.
The Notebook-to-Pipeline Tax
The honest reason most fine-tune artifacts are not reproducible is that the work that produced them was research, and the work that maintains them is engineering, and the two are usually done by the same person under the same job title with no transition between them.
A research notebook is the right tool for iteration. You explore the space, try things, throw away what does not work, and keep what does. The cells you keep are the ones that contributed to the final result, but you do not annotate them as such — you remember which ones mattered, because you wrote them yesterday and you will run them again tomorrow. The notebook accumulates floating cells, dead branches, and assumptions encoded as ambient state. It runs top-to-bottom only if you happen to execute it that way; the kernel will quietly hold variables from an earlier run that the cells below depend on without naming.
The engineering work to make that notebook a maintainable training job is real, and it is not glamorous. It includes:
- Extracting the cells into a script with a versioned entry point and a configuration file.
- Replacing the manual filter steps — "I dropped these rows because they looked weird" — with code, comments that explain the criterion, and a unit test on a small fixture.
- Pinning the dependency graph to a container.
- Documenting every external resource the pipeline touches and verifying the pipeline still runs when those resources are accessed read-only by a different account.
- Running the script end-to-end on a clean environment and asserting the resulting weights match the original artifact within a defined tolerance.
This work is often deferred. The reason is structural: the team that ships the model gets credit for the eval improvement, not for the engineering hygiene around it. Promotion cases and quarterly reviews reward shipped features and improved metrics. They rarely reward a clean pipeline that nobody else has had to touch yet. So the pipeline stays a notebook, and the institutional memory of how it works stays in the head of the person who built it, and the implicit bet the team is making is that this person will be available the next time the artifact has to be regenerated.
That bet pays off until it doesn't.
The Patterns That Close the Gap
A fine-tune team that treats this failure mode as foreseeable can close the gap with a few specific practices. None of them is novel. All of them are routinely skipped because their value is most visible in the runs you do not have to do.
A reproduction command in the registry. Every artifact in the model registry should ship with the exact command that produces it, plus the container image digest, the dataset hash, and the hardware shape needed to run it. If a successor cannot produce the artifact from this command, the artifact is not really in the registry; only its weights are. The discipline of writing this command at the time of training, rather than after, surfaces the missing pieces — the manual cell run once, the data path that points to a personal home directory — at the moment they are still fixable.
A periodic re-training drill. Once a quarter, the on-call engineer regenerates a sample of production artifacts against the current pipeline and asserts behavioral parity on a stable eval set. This catches dependency drift, accidental changes to upstream resources, and the slow rot of fine-tune pipelines that nobody is running because the production artifact is fine. The drill also forces the team to keep the reproduction commands current; a command that does not run today is a command that did not run last quarter either.
A fine-tune review board. Before a fine-tuned artifact is allowed to serve production traffic, a small group with engineering ownership reviews the pipeline that produced it and signs off that the artifact meets the registry's reproducibility contract. The board is not a research review — the eval scores and behavioral parity are still the researcher's call. It is an engineering review whose job is to convert a notebook into a maintainable training job before it crosses the line into production. The board is more sustainable than a hand-off process because it is not coupled to a specific person's departure.
A hiring scope that names the deliverable. When an ML engineering role is opened, the job description should name "leave a reproducible pipeline" as a deliverable equal in weight to "ship the model." This sounds bureaucratic and it is, but the alternative is that the deliverable that the team rewards is the one that goes on the promotion case, and the deliverable that the team does not reward is the one that gets deferred until it cannot be done.
These practices do not eliminate the dependency on the researcher's judgment. The choice of what data to include, what hyperparameters to sweep, and how to evaluate a fine-tune is still real and still tacit. What they do is separate the parts of the work that can be made institutional — the pipeline, the reproduction command, the eval contract — from the parts that cannot, so that the institutional layer survives the departure of any one person.
A Fine-Tune Is a Closure, Not a File
The architectural realization that closes the loop is that a fine-tune is the closure of a pipeline over a training set, and the team that ships only the file has shipped only the value of the closure on one particular day. If the pipeline changes the next day — because a dependency moved, because the data source rotated, because the base model deprecated — the value of the closure changes, and the team that did not keep the closure executable cannot recompute it.
The same logic applies to many other ML artifacts. A retrieval index is the closure of an embedding model over a corpus. An eval set is the closure of a labeling protocol over a sample. A prompt template is the closure of a model version over an instruction. In every case, treating the file as the artifact and the closure as ambient is a way of shipping a production dependency whose source code is somewhere your team does not own.
The cost of that shortcut is invisible until the bill comes due, and the bill always comes due. The base model deprecates, the engineer leaves, the dataset rotates. The team that has not maintained the closure discovers that the only way to recover the original artifact is to remember it, and memory is not a maintainable substrate. The team that has maintained the closure runs the pipeline and ships the new artifact and treats the migration as a sprint, which is how it should have been all along.
- https://valohai.com/blog/the-bus-factor-in-machine-learning-development/
- https://introl.com/blog/model-versioning-infrastructure-mlops-artifact-management-guide-2025
- https://atlan.com/know/llm-training-data-versioning-strategies/
- https://aws.amazon.com/blogs/machine-learning/end-to-end-lineage-with-dvc-and-amazon-sagemaker-ai-mlflow-apps/
- https://speytech.com/ai-architecture/deterministic-ml-pipeline/
- https://medium.com/the-constellar-digital-technology-blog/exploring-lora-on-google-colab-the-challenges-of-base-model-upgrades-91fd9809511c
- https://arxiv.org/pdf/2501.16559
- https://www.databricks.com/blog/llm-fine-tuning
