Synthetic Eval Bootstrapping: How to Build Ground-Truth Datasets When You Have No Labeled Data
The common failure mode isn't building AI features that don't work. It's shipping AI features without any way to know whether they work. And the reason teams skip evaluation infrastructure isn't laziness — it's that building evals requires labeled data, and on day one you have none.
This is the cold start problem for evals. To get useful signal, you need your system running in production. To deploy with confidence, you need evaluation infrastructure first. The circular dependency is real, and it causes teams to do one of three things: ship without evals and discover failures in production, delay shipping while hand-labeling data for months, or use synthetic evals — with all the risks that entails.
This post is about the third path done correctly. Synthetic eval bootstrapping works, but only if you understand what it cannot detect and build around those blind spots from the start.
The Circularity Problem Is Formally Quantified
The obvious objection to using LLMs to generate test cases for LLMs: if the generator and the system under test share the same training distribution, the synthetic tests will miss exactly the failures that matter.
This isn't intuition — it's been formally measured. A 2025 paper accepted at ICML (preference leakage, arXiv 2502.01534) showed that when the LLM generating test cases and the LLM acting as judge belong to the same model family — or when one is fine-tuned from the other — the judge exhibits measurable bias favoring outputs from the related model. Worse, the paper notes that "preference leakage is harder to detect compared to previously identified biases." The failure is silent.
Three modes of relatedness amplify this: identical models, inheritance relationships (one model fine-tuned from the other), and membership in the same architectural family. Running GPT-4o to generate synthetic evals and GPT-4o-mini to judge them is not two independent opinions — it's one perspective with a hat change.
This is the core failure mode you are designing against.
Four Techniques That Break the Circularity
1. Human-Validated Anchors
The consistent finding across practitioners who have built production eval systems is that 20–30 carefully human-labeled examples are worth more than 30,000 synthetic ones.
The mechanism: a domain expert makes binary pass/fail judgments on a diverse example set, writing detailed written critiques for each decision. These critiques become few-shot examples in the judge prompt. You iterate until the automated judge reaches >90% agreement with the expert on held-out cases. This technique — called critique shadowing — was documented by Hamel Husain and validated at scale by LinkedIn, which improved judge accuracy from 68% to 94% through iterative refinement against a human-labeled validation set.
The 20–30 number isn't arbitrary. Shankar et al.'s EvalGen work (UIST 2024) showed empirically that agreement degrades if you try to fit criteria to more than 30 examples due to prompt length constraints. The anchor set needs to be small enough to fit into context and diverse enough to cover the evaluation dimensions that matter.
The anchor set is what makes everything else trustworthy. Build it first, before generating any synthetic data.
2. Cross-Model Disagreement Signals
Where two models from different training lineages disagree on an evaluation judgment, that disagreement is a high-confidence signal that the case is genuinely hard. These are the cases worth human review.
The operational pattern: run two judges with meaningfully different training histories (not GPT-4o and GPT-4o-mini — those are in the same family). Cases where they agree are likely reliable. Cases where they disagree go to a human reviewer. This gives you a principled way to allocate scarce labeling budget: instead of randomly sampling cases for review, concentrate review on maximum-uncertainty cases where automated evaluation is least trustworthy.
Research on judge robustness confirms why model family diversity matters. GPT-4o achieves 97.7% robustness on fact-related tasks but only 56.6–83.2% on fallacy oversight tasks. If both your judges are from the same family, they will share the same systematic blind spots.
3. Metamorphic Testing
Metamorphic testing sidesteps the oracle problem entirely. Instead of asking "is this output correct?" — which requires ground truth — you ask "does the output satisfy a known relationship with a transformed version of the input?"
You define metamorphic relations (MRs): semantic invariance (paraphrasing an input should not change the classification), demographic swap fairness (changing a name from male to female should not change a recommendation), negation consistency (adding "don't" to an instruction should produce the opposite action). Then you generate test pairs mechanically and check whether the relationship holds.
The yield is significant. A 2025 empirical study (LLMORPH, arXiv 2511.02108) ran roughly 560,000 metamorphic tests across a representative subset of NLP tasks and found an average failure rate of 18% — about 1 in 5 tests exposed a genuine behavioral flaw with no labeled data required. For a technique that requires zero ground-truth labeling, this is a compelling return.
The practical constraint: metamorphic relations must be defined by humans who understand the task. The technique does not generate test cases from nothing — it amplifies a small set of human-specified invariants into a large number of concrete test instances. That human specification is your anchor.
4. Behavioral Invariants
Testing properties instead of outputs eliminates the need for correct answers entirely. Rather than evaluating whether an output is correct, you evaluate whether it satisfies properties you can verify mechanically.
Properties that work in practice:
- Reference integrity: Does the output cite sources that actually exist in the provided context?
- Entailment consistency: Is a generated summary entailed by the source document? (Checkable with an NLI model, no labeling needed.)
- PII boundary: Does the output contain PII from context that wasn't present in the user's question?
- Self-consistency: Does the system produce outputs that don't contradict each other across runs on the same input?
- Monotonicity: Given more specific context, does the system produce more specific answers, not less?
The practical appeal of invariant-based evaluation is that the same checks work simultaneously as evals and as production guardrails. You write them once and get value in two places.
How Synthetic Generation Fails
Even with the above mitigations, synthetic evals have systematic failure modes you need to understand.
Distribution compression is the most pervasive. When LLMs generate synthetic test cases at scale, they cluster around the central tendency of the training distribution. The output looks diverse but isn't — rare, messy, out-of-distribution queries that are most likely to break your system are chronically underrepresented. One analysis found that with 6,000 LLM-generated samples, a classifier achieved 76% accuracy, compared to 88% with just 3,000 human-labeled samples. The synthetic data simply didn't cover the hard cases.
Logical error blindness is specifically a LLM-as-judge problem. Frontier models achieve only 56.6–83.2% consistency when evaluating logical reasoning tasks. LLM judges can reliably detect factual inconsistency, citation errors, and formatting violations — but they are systematically weak at detecting fallacious reasoning in outputs that read fluently. If your application involves reasoning chains, don't rely on LLM judges to catch reasoning errors.
Criteria drift without anchoring is a subtle failure that accumulates when synthetic evals are used to iteratively refine evaluation criteria without returning to human judgment. The criteria get progressively crisper and more internally consistent, while drifting from what you actually care about. You end up optimizing for a well-specified proxy that doesn't represent real quality. Shankar et al.'s UIST 2024 paper found this empirically: "it is impossible to completely determine evaluation criteria prior to human judging of LLM outputs." Criteria and graded outputs co-evolve; synthetic loops freeze that evolution prematurely.
Benchmark contamination matters if you're using public evaluation benchmarks as part of your synthetic data pipeline. If the LLM generating your test cases was trained on those benchmarks, inflated agreement scores tell you nothing about real capability. Analysis of contamination across popular benchmarks found GPT-4 showing a 57% exact match rate on MMLU test option guessing — evidence of memorization rather than understanding.
Practical Bootstrapping Sequence
Given the constraints above, here is a sequencing that works:
Week 1 — Invariants and anchors. Before generating any synthetic data, define 5–10 behavioral invariants that your system must satisfy regardless of output content. Build your 20–30 example anchor set through human labeling. Write detailed critiques for each. This is the minimum viable eval infrastructure.
Week 2 — Synthetic expansion. Use the anchor set as few-shot examples to guide synthetic test generation. Generate tests across a dimensional grid: different user intents, complexity levels, edge cases, domain subdomains. For RAG systems, use a document-grounded approach (RAGAS-style) where ground truth answers are derivable from your actual knowledge base — this eliminates circularity for retrieval evaluation.
Week 3 — Adversarial coverage. Run metamorphic test generation using your defined invariant relations. Apply adversarial mutation (paraphrase attacks, negation injections, persona swaps) against existing test cases to expand coverage into the distribution tails.
Ongoing — Cross-model disagreement triage. Run two judges from different model families on all new cases. Route disagreements to human review. Update the anchor set quarterly as real production queries reveal new failure categories the anchor set didn't cover.
What Self-Taught Evaluators Change
Meta's Self-Taught Evaluators (arXiv 2408.02666, 2024) demonstrated something important: with zero human annotations, bootstrapping via synthetic contrastive pairs and rejection sampling improved an open-source judge from 75.4 to 88.3 on RewardBench — matching GPT-4 judge performance.
The mechanism: generate modified instructions from original prompts (creating implicit preference pairs), generate strong responses to modified versions (which are weak responses to the originals), then use rejection sampling to accept only judgments that align with the synthetic preference signal. The iterative retraining on accepted examples breaks the circularity.
This doesn't mean human anchors are unnecessary — it means that with the right sampling strategy, the number of human-labeled examples required to get a reliable judge is lower than practitioners assume. The Prometheus framework operationalized a similar insight: 50 seed rubrics (human-written) expanded to 1,000 by GPT-4, generating 100K training examples, produced an evaluator matching GPT-4 judge performance at a fraction of the cost. The 50 human seeds are what give the synthetic expansion direction.
The Eval Set Is Never Done
The most costly misconception about synthetic eval bootstrapping is treating it as a one-time build. Synthetic evals have a half-life. Distribution shift, silent base model updates, and upstream data quality changes all cause an eval set built at launch to produce increasingly misleading signals over time.
The 2025 ZenML analysis of 1,200 production deployments found that teams maintaining continuous human-anchored recalibration — not just static eval sets — were the ones that caught behavioral regression before users reported it. The operational pattern that works: monthly review of a random sample of production outputs against your anchor set, with quarterly anchor set refresh to incorporate new failure categories discovered in production.
Synthetic evals solve the day-one problem. The real work is the pipeline that keeps them honest as the system evolves around them.
Building evals without labeled data is tractable — but only if you're explicit about what the technique can and cannot detect. The anchor set of 20–30 human-labeled examples with written critiques is not a shortcut: it is the foundation that determines whether everything built on top of it is measuring something real. Get that right first, then let synthetic generation do what it's actually good at: scaling coverage across the distribution once you know what the distribution is.
- https://arxiv.org/abs/2502.01534
- https://arxiv.org/abs/2408.02666
- https://arxiv.org/abs/2404.12272
- https://arxiv.org/html/2412.05579v2
- https://llm-judge-bias.github.io/
- https://alignment.anthropic.com/2025/bloom-auto-evals/
- https://arxiv.org/abs/2511.02108
- https://www.anthropic.com/research/discovering-language-model-behaviors-with-model-written-evaluations
- https://github.com/prometheus-eval/prometheus-eval
- https://docs.ragas.io/en/stable/getstarted/rag_testset_generation/
- https://eugeneyan.com/writing/llm-evaluators/
- https://hamel.dev/blog/posts/llm-judge/
- https://applied-llms.org/
- https://www.zenml.io/blog/what-1200-production-deployments-reveal-about-llmops-in-2025
