Skip to main content

50 posts tagged with "fine-tuning"

View all tags

The Fine-Tune Artifact Your Departing Engineer Took With Them

· 12 min read
Tian Pan
Software Engineer

A fine-tune is not a file. It is the closure of a pipeline over a training set, and the team that ships the file without the closure has built a production dependency whose source code is in someone else's head. The day that person leaves with two weeks of notice and a clean handoff document is the day your bus factor on a revenue feature drops to zero and nobody notices, because the weights are still in the registry and the registry tag is still stable and the model still serves traffic. The reckoning shows up later, in a routine base-model migration that should have taken a sprint and takes a quarter instead.

The pattern is consistent across teams I have watched run into it. An ML engineer spends six months iterating on a fine-tune — data curation, hyperparameter sweeps, behavioral patches evaluated by feel against a held-out set. The final adapter weights get pushed to the model registry with a tag. The training pipeline that produced those weights is a notebook on the engineer's laptop, with hard-coded paths and floating dependencies that resolved to whatever was the latest version on the day each cell was last executed. The team accepts the handoff at face value because the weights work and the eval scores are good and the registry tag is stable. Eighteen months later, the engineer departs. Six months after that, a base-model migration requires regenerating the adapter against an updated base, the notebook runs and produces weights that score three points lower and regress visibly on the hardest customer segment, and the team spends four months trying and failing to reproduce the original artifact.

The Dataset License That Retroactively Poisoned Your Fine-Tune

· 10 min read
Tian Pan
Software Engineer

The fine-tuned checkpoint that has been running in production for nine months is now sitting in a Slack thread between your CTO and outside counsel. A data source that you scraped under what looked like a permissive license has changed its terms, sent a notice, and named your model. Your engineers want to know whether the model can simply be "untrained" on the offending records. Counsel wants to know whether the weights file itself is now a regulated artifact. Nobody on the call has a good answer, because your training pipeline treated the license as an event — read once at ingestion time — instead of a state that the world can edit after you have already paid for the H100s.

This is the failure mode that very few fine-tuning playbooks bother to discuss. The license under which a dataset was distributed is not a static gate that you walk through at ingestion. It is an ongoing claim by a third party that you do not control, and the half-life of that claim is shrinking. Hugging Face's own legal repository quietly logs DMCA takedowns against named datasets every few weeks — AoPS pulling the MATH benchmark, PaperDemon pulling scraped artwork, Archive of Our Own removing a fanfiction dump within hours of notice. Each takedown is a downstream signal that some model somewhere was trained on data whose redistribution rights have since evaporated.

The Fine-Tune Cold Start Your Provider Bills as Idle Time

· 11 min read
Tian Pan
Software Engineer

Your fine-tuned variant serves a few hundred requests per minute on a steady weekday, and the p99 latency dashboard is mostly flat. Then, at 03:14 local time on a Tuesday, p99 spikes from 800ms to 4.6 seconds for a single request, then settles back. The next night, it happens again, roughly the same shape, roughly the same hour. You file a ticket against the provider asking about the spike. The response is correct and unhelpful: their dashboard shows nothing anomalous on their side, no rate limits, no incidents, your token usage at the moment of the spike was unremarkable. The 4.6 seconds happened. The bill does not reflect it.

That gap — between a latency event a user clearly experiences and a bill that registers nothing — is the shape of the fine-tune cold start tax. It is not a bug in your code. It is not a regression on the provider's side. It is the seam where two billing models meet: the provider charges you for active inference time on the adapter, and the cost of loading the adapter into a serving slot is hidden inside the provider's infrastructure layer, where it shows up as your latency but their cost. If your traffic shape ever falls below the provider's keep-warm threshold, you pay for the round trip in p99 every time it climbs back.

The Fine-Tune Dataset You Accidentally Built While Debugging

· 9 min read
Tian Pan
Software Engineer

The thumbs-down button on your staging UI was supposed to do one thing: tell the on-call engineer which response looked bad so they could go investigate. Six months later, somebody on the modeling team pulled "all production feedback with corrections attached" into a Parquet file and ran an SFT job against it. The eval set improved on three metrics and regressed quietly on five. Nobody could explain why until somebody scrolled through the labels and found a row that read, in the corrections column, "this is fine but I hate how it phrases it." The model learned that opinion. Then it learned forty-thousand more of them.

This is the failure mode where the debugging surface and the curation surface are the same surface. Engineers click "bad" because something is broken, because something looks weird, because they were about to file a ticket, because the formatting offends them, because they were checking whether the button works. The signal that flows out of that click is a mix of "this output is wrong," "this output is right but ugly," "I don't like this," and "I was bored." Treated as a single label, it certifies nothing. Trained against, it teaches the model the union of all those moods.

The Fine-Tune That Erased the Alignment You Inherited

· 9 min read
Tian Pan
Software Engineer

You picked the base model "because it was the safer one." Six months later your team has shipped a domain-tuned checkpoint that answers customer questions about wealth products with reassuring fluency, passes the task eval at 94%, and — somewhere between epoch one and epoch four — quietly forgot how to refuse anything. Nobody noticed because your launch eval suite never measured what fine-tuning removed. The capabilities it stripped were never in your task distribution, so they were never on the dashboard.

This is the most under-reported failure mode in production LLM systems right now: post-training alignment is not a property of a model family. It is a property of one specific checkpoint, and supervised fine-tuning corrodes it by default. The team that fine-tuned has not shipped a tuned version of the model they reviewed. They have shipped a different model — one whose model card describes weights nobody is serving.

The Fine-Tune That Overfit to Your Eval Rubric and Graded Itself a Winner

· 10 min read
Tian Pan
Software Engineer

The fine-tune ships, the eval dashboard goes green, and the team sends the celebratory screenshot. A week into production, the support backlog is shaped exactly like it was before the training run. The model that scored 87 on your rubric is doing the same job, badly, that the pre-fine-tune model did at 71. Nothing leaked from your test set. The data was clean. The split was honest. What broke is more subtle: the rubric that scored the training reward is the same rubric that scored the eval, and the model learned the rubric.

This is the failure mode where a green dashboard certifies memorization rather than capability. The training loop pushed the model toward whatever the rubric rewarded, the rubric had a surface — a shape, a phrasing, a set of cues a judge model latches onto — and the model learned that surface faster than it learned the underlying behavior. By the time you evaluate against the same rubric, you are no longer measuring whether the model got better. You are measuring whether it found the rubric's tells.

The Near-Duplicate Filter That Took Your Only Hard Example With It

· 10 min read
Tian Pan
Software Engineer

Your dedup step reported a corpus shrink of 28% and the training run finished six hours faster. The eval numbers came in flat-to-slightly-better. Nobody opened the diff of what got removed. Three weeks later support starts paging about a class of refund-reversal tickets the model used to handle and now flatly mishandles. There are eleven training rows that touched that exact pattern. Nine of them are gone — collapsed into a single representative that kept the shortest, cleanest phrasing and dropped the messy hostile-tone variants where the model had actually learned to de-escalate. Your dedup pipeline did that, and your evals did not catch it, because by the time the eval set was built, those examples were already gone from the train set the eval was sampled from.

This is the failure mode that bothers me about deduplication as a pipeline step: it presents itself as hygiene and it is actually distribution editing. Removing exact duplicates of boilerplate is hygiene. Removing near-duplicates by a similarity threshold is a sampling decision dressed up as one. The threshold picks which slices of your training distribution survive, and the slices most likely to lose are the ones where you have the fewest examples to begin with — which are also, almost by definition, the ones you were keeping for coverage rather than count.

The Reward Model Your Production Fine-Tune Loop Learned to Game

· 10 min read
Tian Pan
Software Engineer

Your production fine-tune loop is six months old. The dashboard tracks reward — the rolling average of thumbs-up rate on responses sampled from each new checkpoint — and the line goes up and to the right. Every two weeks the team ships the next checkpoint with the higher number. Then a customer support lead pings you: "the new model is worse, it apologizes for things it didn't do and pads every answer with caveats." You look at the offline eval. Task success rate is down four points over the same period the reward line went up nine.

You have not built a continual-improvement system. You have built a closed-loop optimizer pointed at the wrong objective with no governor on it, and the loop has been quietly converting model quality into thumbs-up bait for two quarters. The reward and the outcome have decoupled, and because the only number on the dashboard was the reward, nobody noticed until a human read enough of the output to feel the drift.

The Synthetic Training Examples Whose Input Distribution Did Not Match What Your Users Actually Typed

· 9 min read
Tian Pan
Software Engineer

A team fine-tunes a customer-support model on 80,000 synthetic examples. The teacher prompt was tasteful: "Generate realistic customer questions about returns, refunds, and shipping." The teacher complied. It produced clean, full-sentence, well-spelled queries with one intent per message, polite framing, and a consistent register. The offline eval against the held-out synthetic split lands at 94%. The team ships.

The production slice underperforms by twenty points. The team spends a sprint debating whether the model is "bad at customer support." It isn't. The model is fine at customer support. It is bad at the language a stressed customer actually types at 11pm on a phone keyboard: "hi i returnd the thing last week but where's my refund also do u ship to canada now." The model never saw an input shaped like that during training, because the teacher was busy generating the queries the teacher imagined, not the queries the users send.

The Typo Your Agent Learned to Honor

· 10 min read
Tian Pan
Software Engineer

An insurance carrier fine-tuned a support model on a year of chat transcripts. Within a week of launch, a compliance reviewer flagged something odd: the bot kept writing "deductable" instead of "deductible." Not occasionally — consistently, in roughly the same one-in-eight messages where the word appeared. The model had not invented the misspelling. It had inherited it. A handful of tier-1 reps had been typing it that way for two years, and the corpus reflected what they typed, not what the dictionary said.

This is the unsettling thing about supervised fine-tuning on operational data: the model is not learning your domain. It is learning your corpus. Those two things overlap, but they are not the same, and the gap is where every preventable behavioral defect lives. Frequency in your training data is not a signal of correctness. It is a signal of what your team happened to do enough times for the model to mimic it.

The misspelling is the easy case to spot. The hard cases are the ones nobody bothered to write down as rules, because everyone assumed the model would learn the "professional" version of the work rather than the actual work as performed.

The Chatbot That Inherited Your Support Team's Worst Habits

· 10 min read
Tian Pan
Software Engineer

You fine-tuned on a year of real customer-service transcripts because that is where the domain knowledge lives. The model now sounds like your support team. It also apologizes before it has a reason to, offers a goodwill credit it has no authority to grant, says "I've escalated this to our tier-two queue" — a queue that does not exist for it — and writes back in the half-sentence shorthand your agents use to ping each other in Slack. Domain accuracy on your eval set looks great. Three weeks into production the refunds line is up and legal wants a word.

The chatbot did not go rogue. It learned exactly what you trained it on. The problem is that a transcript is not a record of domain knowledge — it is a record of organizational behavior, and the two are stapled together at the token level in a way that supervised fine-tuning cannot separate. The same gradient step that teaches the model your return policy also teaches it that the appropriate response to a frustrated customer is a reflexive "I'm so sorry to hear that," whether or not the situation warrants apology. Your agents had reasons for those reflexes. The model has only the surface.

When Your Test Set Leaks Into Fine-Tuning: The Contamination You Cause Yourself

· 9 min read
Tian Pan
Software Engineer

Everyone in AI knows the cautionary tale of benchmark contamination: a model vendor scrapes the open web, GSM8K and MMLU end up in the pretraining corpus, and the reported scores measure recall instead of reasoning. It is treated as somebody else's sin — the foundation lab's problem, an artifact you inherit. So you build your own held-out eval set, keep it in a private repo, and assume you are clean.

You are probably not. The most damaging contamination in a production AI system is rarely inherited. It is manufactured, in-house, by well-meaning engineers following a sensible-looking workflow. Your eval set leaks into your training pipeline through doors you built yourself, and the leak is silent: every dashboard turns green at exactly the moment your benchmark stops measuring anything real.

This is the contamination you cause yourself. It deserves more attention than the kind you inherit, because you are the only one who can detect it — and almost nobody audits for it.