Skip to main content

Synthetic Seed Data: Bootstrapping Fine-Tuning Before Your First Thousand Users

· 9 min read
Tian Pan
Software Engineer

Fine-tuning a model is easy when you have data. The brutal part is the moment before your product exists: you need personalization to attract users, but you need users to have personalization data. Most teams either skip fine-tuning entirely ("we'll add it later") or spend weeks collecting labeled examples by hand. Neither works well. The first produces a generic model users immediately recognize as generic. The second is slow enough that by the time you have data, the task has evolved.

Synthetic seed data solves this — but only when you understand exactly where it breaks.

The Cold-Start Problem Is Worse Than It Looks

The obvious problem is that you have zero training examples. The less obvious problem is that even ten or twenty examples you might scrape together from a demo, an internal pilot, or a stakeholder walkthrough will likely suffer from coverage collapse: they reflect the happy path, the use cases you already anticipated. Real user behavior is messier, more varied, and frequently bizarre in ways you will not predict.

When you fine-tune on a narrow seed set, the model learns to perform well on a narrow slice of the distribution. It handles the cases you demonstrated and fails on everything adjacent. You discover this six weeks after launch when users start complaining that the model "always gives the same kind of answer."

This is why the goal of synthetic seed data is not just to manufacture more examples of what you already have. It is to generate a diverse set of examples that covers the space you expect users to occupy — before you know what that space looks like.

Generating Your First 500–1,000 Examples

The foundational technique is few-shot prompting with explicit diversity pressure. Take ten to fifteen seed demonstrations you have hand-crafted or collected from early testers. Prompt a capable frontier model to generate new examples in the same format, but instruct it to vary systematically: different user intents, different phrasings, different levels of specificity, different error conditions.

The key phrase is "systematically vary." Without explicit diversity pressure, most models will cluster around the prototypical phrasing they associate with the task. You will end up with five hundred examples that are essentially the same example with minor word substitutions — and fine-tuning on those teaches the model to be confidently mediocre.

Practically, this means structuring your generation prompts around axes of variation:

  • Intent variation: For a customer-support task, generate examples that express frustration, confusion, urgency, and casual inquiry separately.
  • Complexity variation: Pair simple one-sentence queries with multi-clause requests that require the model to track several constraints simultaneously.
  • Domain sub-coverage: For a coding assistant, cover different languages, debugging scenarios, architecture questions, and documentation requests explicitly.

The Self-Instruct methodology — which underpins models like Stanford's Alpaca — used 175 seed examples to generate 52,000 instruction-response pairs. The cost was under $500. The quality was uneven, but even a noisy dataset of 52,000 diverse examples significantly outperformed a model fine-tuned on 175 clean ones. Scale and diversity matter more than perfection at the generation stage. Perfection comes from filtering.

Diversity Sampling: Don't Measure What You Think You're Measuring

The naive quality metric is coverage: did your generated examples span the intended input space? The naive diversity metric is something like unique n-gram ratio or vocabulary size. Both measures will mislead you.

The problem is that LLMs systematically over-generate common, prototypical phrasing while under-generating edge cases and low-frequency patterns — the exact patterns that separate useful fine-tuning from noise. A dataset can have high n-gram diversity while still clustering around a narrow band of the semantic space.

Adaptive Coverage Sampling (ACS), developed in recent research, addresses this with a graph-based approach: build a similarity graph over your generated examples, then use greedy maximum coverage to select a subset. The counterintuitive finding is that targeting coverage below 1.0 — typically 0.7–0.9 — produces better-performing models than using all generated examples. The reason is that full coverage forces you to include low-quality outliers alongside the good examples. Selective coverage lets you discard the noise.

A simpler heuristic that works in practice: use ROUGE-L similarity to deduplicate. Discard any generated example where the highest ROUGE-L score against existing examples exceeds 0.7. This removes near-duplicates while keeping semantic variation. It is not sophisticated, but it is fast, deterministic, and catches the worst homogenization failures.

The Three-Layer Quality Filter

Raw generation output is not training data. It requires filtering before you fine-tune on it.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates