Skip to main content

The Synthetic Training Examples Whose Input Distribution Did Not Match What Your Users Actually Typed

· 9 min read
Tian Pan
Software Engineer

A team fine-tunes a customer-support model on 80,000 synthetic examples. The teacher prompt was tasteful: "Generate realistic customer questions about returns, refunds, and shipping." The teacher complied. It produced clean, full-sentence, well-spelled queries with one intent per message, polite framing, and a consistent register. The offline eval against the held-out synthetic split lands at 94%. The team ships.

The production slice underperforms by twenty points. The team spends a sprint debating whether the model is "bad at customer support." It isn't. The model is fine at customer support. It is bad at the language a stressed customer actually types at 11pm on a phone keyboard: "hi i returnd the thing last week but where's my refund also do u ship to canada now." The model never saw an input shaped like that during training, because the teacher was busy generating the queries the teacher imagined, not the queries the users send.

This is one of the most common silent failure modes in 2026 fine-tuning workflows. The team iterates obsessively on output quality — better reasoning chains, more topic coverage, richer policy adherence — while leaving the input distribution exactly where the teacher's first sample landed it. The benchmark goes up. The customer experience goes down. The two numbers are measuring different populations.

Synthetic Data Is Two Distributions, Not One

The mental model most teams carry into a fine-tune is that the dataset is a thing you grade by sampling rows and asking "does the answer look good." That grades the output distribution. It says nothing about the input distribution. And input distribution is where the production gap lives.

A synthetic dataset has two distributions baked in:

  • The inputs: what queries does the model see during training? What are their lengths, intent counts, punctuation patterns, typo rates, code-switching behaviors, emoji density, fragment structure?
  • The outputs: given an input, what reasoning and response does the model produce?

When you fine-tune, the model learns a mapping from the input space it observed to the output space it observed. If the input space during training is "clean, full-sentence, well-spelled, single-intent, well-punctuated requests in the language the teacher was asked to write," then the model has a sharp, confident mapping over that subspace and an undefined mapping over everything outside it. Production inputs land outside it constantly. The model improvises, and improvisation in a fine-tuned model often looks like regression to a baseline the model already could have done before training.

Industry write-ups frame this as "distribution drift" or "covariate shift," but those terms understate the issue. It is not that the distribution drifted between training and production. It is that the training distribution was never the production distribution to begin with. The teacher was given a topic and a task, not a measured shape, and the teacher filled in shape from its own priors.

How the Teacher's Priors Sneak In

Every teacher model has a writing style. If you ask GPT-5 or Claude Opus to produce a thousand customer-support queries, you do not get a sample from the empirical distribution of customer-support queries. You get a sample from the teacher's distribution of "what a customer-support query looks like according to my pretraining and instruction tuning."

That distribution is recognizably teacherly. The recent literature on single-teacher bias documents how a single generator imprints stylistic invariants across the entire synthetic corpus: verbosity if the teacher is verbose, hedging if the teacher hedges, formal register if the teacher leans formal. A 2026 finding on teacher-student divergence frames this as a stylistic gap between the teacher's generated tokens and the distribution the student needs to model — a gap large enough that some training recipes now interleave teacher and student generations to bridge it.

The same effect shows up at the input side and is harder to notice. The team reviews a hundred sampled inputs, decides "these look realistic," and ships. "Realistic" was graded against the reviewer's intuition of a customer query, not against a measured fingerprint of real traffic. Both the teacher and the reviewer share the same prior. The fine-tune learns that prior. Production users do not match that prior.

A useful diagnostic: pull a thousand real production queries and a thousand synthetic ones, then compute trivial summary statistics — mean tokens per message, fraction with at least one typo, fraction with multiple intents per message, fraction with emoji, fraction below 20 characters, code-switching rate. If the two histograms look meaningfully different on any axis, the model is going to feel that difference at inference.

The Eval Numbers That Confirm the Wrong Thing

The synthetic-test-split eval is the easiest number to game and the easiest number to fool yourself with. You held out 10% of the synthetic dataset, ran the fine-tune against it, got 94%, and called it a lift. What you actually measured is the model's ability to generalize within the input distribution the teacher already generated.

That is not the question you needed answered. The question was "does the fine-tune do better on production inputs?" The synthetic test split cannot answer that, because it shares the teacher's input prior. A high synthetic-eval score paired with a stagnant production metric is the canonical signature of input-distribution mismatch, and it is often misread as "the offline-online gap" — a phrase that has come to absorb every methodological failure that does not have a better name.

Two specific patterns make this worse in 2026 stacks:

  • Templated eval prompts. Recent evaluation research shows that swapping a single separator character between few-shot examples can move benchmark accuracy enough to reorder a leaderboard. If your eval prompt is templated tightly and your production prompts vary by user, you are measuring something narrower than you think even before the input-distribution issue is considered.
  • Teacher-graded eval. If a teacher generates the data and a sibling model grades the eval, both the generator and the judge share the same prior. The judge confidently certifies that the student matches the teacher. Neither knows what the user looks like.

The fix is not "a better judge." The fix is to measure offline performance on a slice that is shaped like production.

What a Production-Shaped Fine-Tune Pipeline Looks Like

The teams that close the eval-to-production gap on fine-tunes do roughly the same handful of things. They all amount to forcing the input distribution to be a property the pipeline measures and conditions on, not a property the teacher inherits from its prior.

Five practices show up consistently:

  • Compute a production-input fingerprint. Sample 5,000 to 20,000 real user inputs and compute the per-axis distribution: length, typo rate, intent count, code-switching rate, abbreviation density, emoji prevalence, fragment ratio, punctuation pattern. This fingerprint is a deliverable, versioned next to the dataset, and refreshed on a cadence because user behavior drifts.
  • Seed generation from real production examples, not from topic lists. The "150 to 200 human-written examples covering the task types and edge cases" pattern that recurs in 2026 synthetic-data guides only works if those seeds are sampled from production traffic, not authored by the data team in a meeting. The teacher generalizes from the seed; the seed therefore has to look like the user.
  • Run an adversarial messiness pass on synthetic inputs. After the teacher generates clean inputs, a second pass injects production-grade noise to match the fingerprint: dropped punctuation, lowercase normalization, common typos, multi-intent compounding, abbreviations, and platform-specific artifacts (SMS truncation, voice-to-text errors, mobile autocorrect quirks). The goal is not to corrupt the data — it is to bring its shape into alignment with the inputs the model will actually see.
  • Build the eval set from production replays. The offline eval has to be sampled from real traffic with labels generated post-hoc, not held out from the synthetic set. Two eval sets are useful: a "clean" set for the teacher-style upper bound, and a "production" set for the number you actually report to leadership. The production number is the one that predicts the rollout.
  • Mix teachers and condition on style. Single-teacher bias is mitigated by using two or three teachers at different capability levels, and by explicitly conditioning the generation prompt on style attributes from the fingerprint ("write this query as a user who types in fragments and skips punctuation"). The teacher then has to generate to the style, not from its default.

None of these are expensive. The fingerprint is a one-day notebook. The adversarial pass is a small script. The production eval set is the cheapest data your team will ever assemble, because it already exists in your logs. What is expensive is shipping a fine-tune three times before noticing the gap, and the rework cycle that follows.

The Architectural Realization

A synthetic dataset is not a corpus. It is a generator output conditioned on a prompt, and that prompt is the most important hyperparameter in your fine-tune. If the prompt says "generate realistic customer questions," the teacher generates the inputs the teacher believes are realistic. If the prompt says "generate customer questions that match this fingerprint of length, typo rate, intent count, and register," the teacher generates inputs that look like the users.

The teams that conflate "more data" with "better data" keep scaling the synthetic corpus on the assumption that volume bridges the gap. It does not. A million synthetic inputs from the same teacher prior is one synthetic input replicated a million times along axes the model already had covered. The gap to production is set by the prior, not the count.

The actionable reframe: treat the input distribution as a contract between data and product. Product defines what real users send. Data has to generate to that contract, measure adherence, and gate on it before training. The team that fine-tunes against a population that does not exist is not training the wrong model — it is training a perfectly correct model for a different product than the one shipping. The model is fine. The dataset was the bug.

References:Let's stay in touch and Follow me for more thoughts and updates