The Synthetic Eval Your Real Users Never Resemble
There is a class of eval failure that no dashboard catches because it shows up as success. The score climbs week over week. The judge agrees with the answer. The regression tests stay green. Meanwhile, the support team is logging a slow drift in user-reported quality, sales is hearing "it doesn't quite get what I meant," and nobody in engineering can reproduce the complaint because every example anyone tries on the eval set passes. The eval and the users live in different distributions, and the eval is the more polished of the two.
The mechanism is simple, and it hides in plain sight: the model that wrote your eval prompts and the model under test are siblings, and siblings share priors. They smooth the same edges, prefer the same phrasings, leave out the same kinds of malformed input. The eval certifies behavior on a world the generator imagined users have. Your actual users live somewhere else.
The 2025 paper "Deconstructing Self-Bias in LLM-generated Translation Benchmarks" measured the effect directly and named it: when an LLM generates a benchmark and a sibling LLM is scored on it, the sibling wins. The bias has two compounding sources — the test data itself (LLM-as-testset) and the evaluator that grades it (LLM-as-evaluator) — and the combination amplifies the inflation rather than cancelling it. The follow-up SILENCER work showed the bias is large enough to flip leaderboard order on translation tasks, then proposed mitigation by mixing generators from different model families. The implication for production teams is the uncomfortable one: if your eval is one model and your prod model is another from the same family, your score is partly a measure of family resemblance.
Generator-Discriminator Collapse, Spelled Out
The standard mental model for synthetic eval data is "I needed 500 examples, the model generated 500 examples, the examples are fine." What that mental model misses is that the model did not sample uniformly from the space of inputs your users will produce. It sampled from the space of inputs the generator considers natural. Those are not the same distribution, and the gap between them grows in exactly the places that matter:
- Surface form. Real users type "u" and "ur" and "pls." Generators write "you" and "your" and "please." A model that has only seen the polished form will degrade quietly on the abbreviated form, and the eval will not notice because the eval does not contain the abbreviated form.
- Code-switching and mixed languages. A real Spanish-speaking user might write three sentences in English, drop a Spanish clause, then return to English. Generators rarely produce this. Models handle it less well than English-only inputs. The eval shows the model handles Spanish fine. Production shows it handles monolingual Spanish fine and code-switched Spanish poorly.
- Half-finished thoughts. Real users send "wait actually never mind — what about the other one?" Generators produce well-formed, single-intent prompts. The eval does not test whether the model can disambiguate a self-correcting input.
- Domain jargon. A practitioner asking about "the OOM on the worker pool after we bumped the JVM heap" will say it that way. A generator asked for "ten realistic SRE questions" will produce smoother phrasings that are not quite how the practitioner asked it.
- Wrong labels and false premises. Real users ask "why does the migration fail when I set the default to NULL on a NOT NULL column" — a question that contains a contradiction the user hasn't noticed. Generators tend to produce prompts whose premises hold. The eval does not measure the model's behavior under premise failure.
The aggregate effect is that the synthetic eval shaves off the tails of the input distribution. This is the same mechanism that drives model collapse in training data — the canonical Shumailov result that successive generations of training on synthetic data lose the tails of the original distribution — but applied to evaluation rather than training. The training-collapse version is widely discussed and there are accepted mitigations. The eval-collapse version is less discussed because the symptom is more flattering: your score goes up.
Why the Score Goes Up While the Users Notice
Three pieces have to line up for the eval to be useful: the input distribution has to match production, the labels have to be correct, and the scoring rubric has to reflect what users actually care about. Synthetic-everything eval pipelines compromise all three at once, in correlated ways.
The input compromise is what we just discussed. The label compromise comes next: when the eval generator also produces the "correct answer" used as ground truth, the eval rewards the answer the generator would have given, not the answer that would have helped the user. The 2024 "Finding Blind Spots in Evaluator LLMs with Interpretable Checklists" work measured this — evaluator LLMs failed to identify quality drops in over 50 percent of cases on average — which means the judge component of an LLM-as-judge eval is also part of the family-resemblance loop. The 2025 "Self-Correction Blind Spot" paper sharpened the point further: across 14 models, the average rate at which a model failed to correct its own errors while catching identical errors in others' work was 64.5 percent. Your judge is not a neutral arbiter; it is a participant.
The scoring rubric compromise is the most subtle. When the eval set is generated by a sibling model, the rubric ends up encoding what good responses look like in the generator's idiom. Conciseness preferences. Formatting preferences. Hedging-vs-asserting tradeoffs. The model under test learns to satisfy the rubric. Real users, who have idiomatic preferences of their own, may want something different. You see the divergence in CSAT before you see it in the eval score, because the eval score is no longer measuring CSAT — it is measuring intra-family agreement.
How the Gap Closes in Practice
The remediation pattern that survives contact with production has a few elements, and they stack. None of them individually fix the problem; the goal is to drag the eval distribution toward the prod distribution by enough that the score becomes a usable signal again.
Sample real prompts as ground truth. The single highest-leverage move is to make a fraction of the eval set come from anonymized production traffic rather than from a generator. The 2026 industry pattern is to run automatic regression metrics on 100 percent of outputs, use an LLM judge to continuously score 10 to 20 percent of sampled traffic, and reserve human review for the high-stakes 5 to 10 percent slice. The same sampling can feed a curated "wild eval" subset — a few hundred to a few thousand real prompts with PII scrubbed — that runs alongside the synthetic suite and is allowed to disagree with it. When the synthetic suite says the new model is better and the wild eval says it is worse, you trust the wild eval.
Cross-family generation when synthetic is unavoidable. When you do need to generate eval cases — coverage of a new feature, adversarial prompts for a known failure mode, scenarios production hasn't seen yet — generate them with a model from a different family than the model under test. If you ship Claude in prod, have GPT generate the eval cases, or vice versa. The SILENCER paper showed that mixing generators across families measurably reduces the self-bias inflation. The bias is not zero, but the family-resemblance gradient flattens.
Distribution-shift metrics as a first-class signal. Treat the gap between your eval inputs and your prod inputs as a metric you watch. Embed both sets, compute the distance — KL divergence on token distributions, MMD on embeddings, whatever your stack supports — and alarm when the gap widens. A widening gap is a signal that production has drifted somewhere your eval no longer covers. Most teams discover the gap by accident, in a postmortem. You can discover it on purpose, in advance, by measuring it.
A "wild eval" run that is allowed to be ugly. The temptation when you stand up a real-traffic eval set is to clean it. Don't. The whole point of the wild eval is to preserve the typos and the half-thoughts and the code-switching and the wrong premises. The PII scrubbing should remove personally identifying spans without normalizing the prose around them. If the wild eval looks like the synthetic suite after a week, somebody has been "fixing" it; revert.
Judges that aren't relatives. When you run LLM-as-judge, do not let the judge come from the same family as either the generator or the model under test. Where budget allows, use a panel of judges from different families — the "jury of judges" pattern — and accept the three-times-cost as the price of breaking the family-resemblance loop. The cost-benefit math is straightforward: a false positive in eval, where you ship a regression you did not catch, is almost always more expensive than the marginal token cost of two extra judges.
What the Architecture Looks Like When You Take This Seriously
A team that has internalized the problem ends up with an eval system that has three distinct layers, and they trust them in inverse order to how much they cost to maintain.
The cheapest layer is the synthetic regression suite. It runs on every commit. It catches obvious breakage. Everyone agrees it does not say much about quality; it says the model still does the basic things it used to do.
The middle layer is the cross-family generated suite, used for new features and known failure modes. It catches conceptual regressions on territory the synthetic regression suite was never asked about. It is more expensive per case but is generated rarely.
The most expensive and most trusted layer is the wild eval — real production prompts, PII scrubbed, scored by a judge from a different family than the prod model, refreshed continuously as production traffic evolves. When this layer disagrees with the other two, this is the one you believe. It is the only layer that knows what your users actually do.
The shift in how teams talk about evals when this architecture is in place is small but meaningful. "The eval score went up" stops being a publishable signal on its own. "The eval score went up on the wild slice" is the signal. "The eval score went up on the synthetic slice but stayed flat on the wild slice" is a red flag — it means the model got better at the eval and not better at the users. That distinction is invisible to teams running only synthetic suites, and it is the single most important quality signal in the lifecycle of a model in production.
The Practical Takeaway
If your eval set was generated by a model and graded by a model and you are about to use the result to decide whether to ship, pause and answer two questions. What fraction of the eval cases came from real users? And what fraction of the judges share a family with either the generator or the model under test? If the answers are "none" and "all," the score is a measure of how well the family agrees with itself, and the user-reported quality drift you are starting to see is the only honest data point you have. Add the wild eval before you add another feature. The cheapest version — a hundred PII-scrubbed real prompts, scored by a judge from a different family — already breaks the loop. The score will probably drop. That is the signal working for the first time.
- https://arxiv.org/pdf/2509.26600
- https://arxiv.org/pdf/2505.20738
- https://arxiv.org/pdf/2507.02778
- https://arxiv.org/pdf/2509.04013
- https://arxiv.org/pdf/2406.13439
- https://arxiv.org/pdf/2408.08808
- https://arxiv.org/pdf/2412.10056
- https://arxiv.org/pdf/2404.05090
- https://arxiv.org/abs/2410.12954
- https://www.semanticscholar.org/paper/AI-models-collapse-when-trained-on-recursively-data-Shumailov-Shumaylov/603d3f90fc40f79ff51258f0295de3ec5107f73e
- https://earezki.com/ai-news/2026-03-21-llm-evals-on-real-traffic-not-just-test-suites/
- https://futureagi.com/blog/llm-evaluation-frameworks-metrics-best-practices/
- https://galtea.ai/blog/llm-evaluation-complete-guide
- https://labelyourdata.com/articles/llm-as-a-judge
- https://www.braintrust.dev/articles/llm-evaluation-metrics-guide
