Skip to main content

27 posts tagged with "fine-tuning"

View all tags

Your Fine-Tuning Corpus Is a GDPR Data Artifact, Not Just an ML Asset

· 11 min read
Tian Pan
Software Engineer

The moment your first fine-tune lands in production, your weights become a new kind of record your privacy program has never cataloged. A customer support transcript that made it into your training mix is no longer just a row in a database you can DELETE — it is now encoded, redundantly and non-extractably, into the parameters your API serves. The original record can be scrubbed from S3, erased from your warehouse, and removed from your RAG index, while the model continues to complete prompts with fragments of that customer's name, account ID, or medical history. The Data Protection Agreement your sales team signed promised you'd honor erasure requests. Nobody asked the ML team whether that was technically possible.

Research on PII extraction shows this is not hypothetical. The PII-Scope benchmark reports that adversarial extraction rates can increase up to fivefold against pretrained models under realistic query budgets, and membership inference attacks using self-prompt calibration have pushed AUC from 0.7 to 0.9 on fine-tuned models. Llama 3.2 1B, a small and widely copied base, has been demonstrated to memorize sensitive records present in its training set. The takeaway for anyone shipping fine-tunes on production traces is blunt: you cannot assume your weights forgot.

This matters because most fine-tuning pipelines were designed by ML engineers optimizing for loss, not by data stewards optimizing for Article 17. The result is an artifact whose legal status is ambiguous, whose lineage is rarely documented, and whose "delete user X" workflow doesn't exist.

The Orphan Adapter Problem: When Your Fine-Tune Outlives Its Base Model

· 12 min read
Tian Pan
Software Engineer

A senior engineer left six months ago. She owned the classifier adapter that routes customer support tickets — a 32-rank LoRA trained on 847 hand-labeled examples, pinned to a base model that hits end-of-life in 43 days. Nobody remembers why those 847 examples were chosen over the 2,000 they started with. The training data sits in an S3 bucket whose lifecycle policy purges objects older than one year. Her laptop was wiped. The fine-tuning notebook has a cell that calls a preprocessing function she imported from her personal dotfiles repo, now private.

This is the orphan adapter — a fine-tune that outlived its maintainers, outlived its data, and is about to outlive the base model it was trained on. It sits in your production stack, routing real user traffic, and nobody left on the team can rebuild it. The deprecation email didn't create this crisis. It just exposed it.

The Synthetic Preference Trap: How AI-Ranked RLHF Quietly Drifts Your Model Into the Teacher's Voice

· 12 min read
Tian Pan
Software Engineer

The first sign is almost always the same: your internal eval dashboard is green, reward-model scores are climbing, DPO loss is trending right — and a customer on a Zoom call shrugs and says "it sounds like ChatGPT now." No one on the training team wants to hear that. The evals say the model is better. The annotators who shipped the last batch of preferences say the model is better. But the user is telling you the truth, and the dashboard is lying. What broke is not any single label. What broke is that your preference data is no longer yours.

This is the synthetic preference trap. Label budgets get squeezed, someone proposes using a stronger model to rank a second model's completions, the experiment ships, and for a while it looks like a free lunch. The student model learns to sound more like the teacher on every turn, and because your reward model was trained on data the teacher also influenced, your reward model cheerfully agrees. The user sees a product that reads exactly like every other product built on top of the same frontier API. The differentiation you thought you were buying with fine-tuning has been quietly distilled away.

Knowledge Distillation for Production: Teaching Small Models to Do Big Model Tasks

· 9 min read
Tian Pan
Software Engineer

A healthcare company ran GPT-4 on 10,000 documents per day. Annual bill: 50,000.Afterfinetuninga27Bopensourcemodelonfrontieroutputs,thesameworkloadcost50,000. After fine-tuning a 27B open-source model on frontier outputs, the same workload cost 5,000—a 90% reduction. The smaller model also outperformed the frontier model by 60% on their specific task, because it had been shown thousands of examples of exactly the right behavior.

This is knowledge distillation in its modern form: you pay the frontier model API costs once to generate training data, then run a small specialized model forever. The math works because inference is cheap when you own the weights, and task-specific models beat general-purpose models on narrow tasks given enough examples.

But "collect outputs, fine-tune, ship" is not a complete recipe. Most teams that attempt distillation hit one of three invisible walls: bad synthetic data that teaches the student wrong behaviors, no reliable signal for when the student is actually ready, or silent quality collapse in production that doesn't surface until users complain. This post covers the pipeline decisions that determine whether distillation works.

The Latent Capability Ceiling: When a Bigger Model Won't Fix Your Problem

· 10 min read
Tian Pan
Software Engineer

There is a pattern that plays out on almost every AI project that runs long enough. The team builds a prototype, the demo looks good, but in production the outputs aren't consistent enough. Someone suggests switching to the latest frontier model — GPT-4o instead of GPT-3.5, Claude Opus instead of Sonnet, Gemini Ultra instead of Pro. Sometimes it helps. Eventually it stops helping. The team finds themselves paying 5–10x more per inference, latency has doubled, and the task accuracy is still 78% instead of the 90% they need.

This is the latent capability ceiling: the point at which the raw scale of the language model you're using is no longer the limiting factor. It's a real phenomenon backed by empirical data, and most teams hit it without recognizing it — because the reflex to "use a bigger model" is cheap, fast, and often works early in a project.

LoRA Adapter Composition in Production: Running Multiple Fine-Tuned Skills Without Model Wars

· 9 min read
Tian Pan
Software Engineer

The promise sounds clean: fine-tune lightweight LoRA adapters for each specialized skill — one for professional tone, one for JSON formatting, one for medical terminology, one for safety guardrails — then combine them at serving time. Teams ship this design, it works fine in development, and then falls apart in production when two adapters start fighting over the same weight regions and the output quality collapses to something indistinguishable from the untrained base model. Not slightly worse. Completely untuned.

This post is about what happens when you compose adapters in practice, why naive merging fails so reliably, and what strategies actually work at production scale.

Synthetic Seed Data: Bootstrapping Fine-Tuning Before Your First Thousand Users

· 9 min read
Tian Pan
Software Engineer

Fine-tuning a model is easy when you have data. The brutal part is the moment before your product exists: you need personalization to attract users, but you need users to have personalization data. Most teams either skip fine-tuning entirely ("we'll add it later") or spend weeks collecting labeled examples by hand. Neither works well. The first produces a generic model users immediately recognize as generic. The second is slow enough that by the time you have data, the task has evolved.

Synthetic seed data solves this — but only when you understand exactly where it breaks.

Preference Data on a Budget: Capturing RLHF Signal Without a Research Team

· 11 min read
Tian Pan
Software Engineer

Most teams that try to fine-tune a language model with RLHF give up before they start. The canonical story involves OpenAI's InstructGPT: 33,000 preference pairs, 13,000 supervised demonstrations, a team of specialized contractors, and a reinforcement learning pipeline that takes weeks to stabilize. If that's the bar, most product teams aren't playing this game.

They're wrong. The bar is not that high anymore. The research consensus in 2024–2025 has quietly shifted: data quality beats data volume, DPO eliminates the RL infrastructure entirely, and the most valuable preference signal is already flowing through your product unlogged. What looks like a research-team problem is actually an instrumentation problem.

Why Your AI Model Is Always 6 Months Behind: Closing the Feedback Loop

· 10 min read
Tian Pan
Software Engineer

Your model was trained on data from last year. It was evaluated internally two months ago. It shipped a month after that. By the time a user hits a failure and you learn about it, you're already six months behind the world your model needs to operate in. This gap is not a deployment problem — it's a feedback loop problem. And most teams aren't measuring it, let alone closing it.

The instinct when a model underperforms is to blame the model architecture or the training data. But the deeper issue is usually the latency of your feedback system. How long does it take from the moment a user experiences a failure to the moment that failure influences your model? Most teams, if they're honest, have no idea. Industry analysis suggests that models left without targeted updates for six months or more see error rates climb 35% on new distributions. The cause isn't decay in the model — it's the world moving while the model stays still.

LLM-as-Annotator Quality Control: When the Labeler and Student Share Training Data

· 10 min read
Tian Pan
Software Engineer

The pipeline looks sensible on paper: you have a target task, no human-labeled examples, and a capable large model available. So you use that model to generate labels, then fine-tune a smaller model on those labels. Ship it, repeat.

The problem nobody talks about enough is what happens when your annotator model and your target model trained on the same internet. Which, increasingly, they all have.

Post-Training Alignment for Product Engineers: What RLHF, DPO, and RLAIF Actually Mean for You

· 11 min read
Tian Pan
Software Engineer

Most teams building AI features assume that once they ship, user feedback becomes a resource they can tap later. Log the thumbs-up and thumbs-down signals, accumulate enough volume, and eventually fine-tune. The reality is more treacherous: a year of logged reactions is not the same as a year of alignment-quality training data. The gap between the two is where alignment techniques — RLHF, DPO, RLAIF — either save you or surprise you.

This post is not a survey of alignment research. It's a decision guide for engineers who need to understand what these techniques require from a data-collection perspective, so that what you instrument today actually enables the fine-tuning you're planning for six months from now.

The Pretraining Shadow: The Hidden Constraint Your Fine-Tuning Plan Ignores

· 9 min read
Tian Pan
Software Engineer

Your team spent three sprints labeling 50,000 domain-specific examples. You ran LoRA fine-tuning on a frontier model. The eval numbers improved. Then a colleague changed the phrasing of a prompt slightly, and the model reverted to the behavior you thought you'd suppressed. That's not a dataset problem. That's the pretraining shadow.

The core insight that practitioners keep rediscovering: fine-tuning teaches a model how to talk in a new context, but it cannot rewrite what the model fundamentally knows or is inclined to do. The behaviors, biases, and factual priors encoded during pretraining are a gravitational field that fine-tuning orbits but rarely escapes.