Skip to main content

3 posts tagged with "ml-engineering"

View all tags

Synthetic Seed Data: Bootstrapping Fine-Tuning Before Your First Thousand Users

· 9 min read
Tian Pan
Software Engineer

Fine-tuning a model is easy when you have data. The brutal part is the moment before your product exists: you need personalization to attract users, but you need users to have personalization data. Most teams either skip fine-tuning entirely ("we'll add it later") or spend weeks collecting labeled examples by hand. Neither works well. The first produces a generic model users immediately recognize as generic. The second is slow enough that by the time you have data, the task has evolved.

Synthetic seed data solves this — but only when you understand exactly where it breaks.

Staffing AI Engineering Teams: Who Owns What When Every Feature Has an AI Component

· 11 min read
Tian Pan
Software Engineer

Three years ago, "AI team" meant a group of specialists tucked into a corner of the org chart, mostly invisible to product engineers. Today, a senior software engineer at a fintech company ships a fraud-scoring feature using a fine-tuned model on Monday, wires up a RAG pipeline for customer support on Wednesday, and debugs LLM latency on Friday. The specialists didn't go away—but the boundary between "AI work" and "product engineering" dissolved faster than almost anyone planned for.

Most teams responded by bolting new titles onto existing job descriptions and calling it done. That's the wrong answer, and the dysfunction shows up quickly: unclear ownership, duplicated tooling, and an ML platform team that spends half its time explaining why product teams can't just call the OpenAI API directly.

This post is about getting the structure right—not in the abstract, but for the actual stages of AI adoption most engineering organizations go through.

Synthetic Data Pipelines for Domain-Specific LLM Fine-Tuning

· 9 min read
Tian Pan
Software Engineer

Your model fine-tuned on synthetic data scores 95% on your internal evals. Then you deploy it, and it confidently invents drug interactions that don't exist, cites legal precedents with wrong case numbers, and hallucinates API endpoints with plausible-sounding names. The model hasn't regressed on fluency — it's gotten worse in a way that fluency metrics completely miss. Researchers call this knowledge collapse: factual accuracy degrades while surface coherence stays intact. It's one of the more insidious failure modes in synthetic data training, and it happens most often when engineers build pipelines without accounting for it.

Synthetic data generation has become unavoidable for teams fine-tuning LLMs on specialized domains. Human annotation at scale is expensive, slow, and impossible for tasks that require expertise. Synthetic data generated by a capable teacher model can fill that gap cheaply. But the pipeline is not as simple as "prompt GPT-4 for examples, train your model." The details determine whether you get a specialized system that outperforms a general model on your domain, or a fluent but factually broken one.