Skip to main content

Building LLM Evals from Sparse Annotations: You Don't Need 10,000 Examples

· 12 min read
Tian Pan
Software Engineer

Teams building LLM applications consistently make the same mistake: they wait for enough labeled data before investing in evaluation infrastructure. They tell themselves they need 5,000 examples. Or 10,000. The eval system stays on the backlog while "vibe checks" substitute for measurement. A ZenML analysis of 1,200 production deployments found that informal vibe checks remain common even in mature deployments — and many teams never graduate to systematic evals at all.

The data-size intuition is borrowed from classical ML, where more labeled examples reliably improved model performance. For LLM evaluation, it is largely wrong. Research on sparse benchmarks demonstrates that 20–40 carefully selected items reliably estimate full-benchmark rankings, and 100 items produce mean absolute error below 1% compared to thousands. The problem is not data volume. The problem is that most teams skip the structured process that makes small evaluation sets trustworthy.

This post covers what that process actually looks like: how to select the right examples through active learning, how to generate noisy labels at scale with weak supervision, how to bootstrap with LLM judges, and how to know when your small eval set is ready to use.

Why Teams Wait (and Why They Should Not)

The delay is not laziness. It is a structural trap. Evaluation requires knowing what failure looks like, but failure patterns only become visible once you start labeling. Teams stall waiting for stable criteria that never appear pre-empirically.

There is also a chicken-and-egg timing problem. Data accumulates meaningfully once the system is in production. The eval system feels premature before you have real traffic. So teams ship without evals, discover quality problems from user complaints, and then try to build retroactive eval infrastructure on top of undocumented failure modes.

The fix is to start earlier than feels justified — with 30 examples if that is all you have — and treat the eval set as a living artifact rather than a prerequisite to be completed before use.

Hamel Husain and Shreya Shankar's research identifies what they call criteria drift: annotation criteria are not stable. Humans refine what they mean by "good" as they review more examples, making early labels inconsistent with later ones. This is not a problem to solve before labeling. It is the labeling process itself. The implication: small eval sets built iteratively through active error analysis are more consistent, not less, than large sets built through one-shot annotation batches.

Active Learning: Picking Examples That Teach You Something

The conventional approach to building an eval set is random sampling. Pick 500 examples from your production logs. Send them to annotators. This is efficient only if your production traffic is uniformly informative, which it almost never is. The vast majority of queries in most applications are routine successes that tell you nothing about where the system breaks.

Active learning inverts this. Instead of sampling randomly, you select the examples where the model (or the evaluator) is most uncertain, most inconsistent, or most novel.

The three core strategies, in order of practical usefulness for LLM eval construction:

Uncertainty sampling targets examples where your LLM judge cannot confidently categorize the output. High uncertainty — measured by sampling multiple completions from the judge and observing disagreement — indicates a case you actually need a human to label. These are the examples that will stress-test your rubric.

Diversity sampling ensures coverage across the input space. Uncertainty sampling alone tends to cluster on a narrow slice of edge cases, producing eval sets that over-represent one failure mode. Combining uncertainty and diversity (selecting uncertain examples that are also semantically distant from already-selected examples) builds comprehensive coverage faster.

Query-by-committee runs multiple judges — from different model families, or with different prompts — and surfaces examples where they disagree. Disagreement between judges is a strong signal that you have found a hard case worth labeling.

In practice, the process looks like this: take a sample of production queries, run them through your system, compute disagreement across multiple LLM judge runs, rank by uncertainty, and manually label the top 30–50 examples. This seed set will be dramatically more informative than random selection. An active learning study on low-resource annotation found that using a frontier model as annotator in an active learning loop reached baseline performance with only 20% of ground-truth labeled data — 42 to 53 times cheaper than exhaustive human annotation.

Weak Supervision: Generating Labels Without Labeling

Once you have a seed set of 30–50 human-labeled examples, you can extend coverage through weak supervision — using multiple noisy labeling functions to annotate unlabeled examples, then aggregating the noisy labels into a cleaner signal.

A labeling function is any heuristic that produces a label. It could be a regex (response mentions "I cannot" → likely a refusal), a keyword check (response contains a number when none is expected → possible hallucination), or an LLM prompt (does this response answer the question? yes/no). Individual labeling functions are unreliable. A generative label model that combines them — weighting each function based on its agreement and disagreement patterns with other functions — produces labels substantially more accurate than any individual function alone.

The key empirical finding: off-the-shelf LLM-as-judge approaches agree with domain experts only 70–75% of the time. Custom evaluators built using programmatic labeling — multiple prompts as labeling functions, aggregated and calibrated against a small expert-labeled seed set — achieve 88–90% accuracy on the same enterprise datasets. The seed set of expert labels is what calibrates the label model, which is why active learning (building the most informative seed set) and weak supervision (extending coverage cheaply) are complementary rather than alternative strategies.

One important constraint: weak supervision still requires a ground-truth test set that it is never used to optimize against. The label model learns which labeling functions are reliable by observing their agreement patterns, not by fitting to labeled examples directly. But you still need human labels to validate that the resulting evaluator actually measures what you think it measures.

LLM-Bootstrapped Labeling: What LLM Judges Can and Cannot Do

LLM-as-judge has become the default approach for scaling evaluation without human annotation. GPT-4 achieves 85% agreement with human experts on pairwise comparisons — slightly higher than the 81% agreement between two human experts on the same task. For many applications, that is good enough to build useful signal.

The failure modes are well-documented and need to be explicitly controlled for:

Position bias: Models systematically prefer the response listed first. GPT-3.5 does this roughly half the time; Claude has shown up to 70% first-position preference. The fix is to run each pairwise comparison twice with order swapped and only count the result when both orderings agree.

Verbosity bias: Both Claude and GPT-3.5 prefer longer responses over 90% of the time regardless of quality. Mitigation: use reference-anchored evaluation rather than pairwise comparison, or explicitly instruct the judge to evaluate content not length.

Self-enhancement bias: GPT-4 rates its own outputs roughly 10% higher; Claude rates its own outputs roughly 25% higher. Do not use the same model family for both generation and evaluation.

Grading scale effects: Using a 0–5 scale produces significantly stronger human-judge alignment (ICC=0.853) than a 0–10 scale (ICC=0.805). Binary pass/fail produces less annotator disagreement and requires smaller sample sizes to detect differences. For most practical eval sets, binary classification is the right choice.

The practical architecture: one evaluator prompt per dimension you care about, not a single "God Evaluator" that tries to measure everything at once. Multi-dimensional single prompts are poorly calibrated and hard to debug. If you need to evaluate factual correctness, helpfulness, and format compliance, write three separate evaluator prompts.

For synthetic example generation, LLMs can expand a seed set by generating variations, edge cases, and adversarial examples. The risk is that synthetic examples tend to be cleaner and better-formed than real production inputs. Synthetic defects are often out of distribution — real failures from actual production traffic better represent what you need to catch. Generate synthetic examples to increase coverage across scenarios you have identified as important, not to substitute for real data collection.

Statistical Trustworthiness: When Is the Eval Set Ready to Use

The most common question about small eval sets is whether they have enough statistical power to detect real differences. The answer depends on what difference you are trying to detect.

The fundamental relationship: detecting a performance gap half the size requires four times as many samples (standard error decreases as 1/√n). For a binary eval (pass/fail):

  • 200 examples with a 3% defect rate: 95% confidence interval of ±2.4%
  • 400 examples: ±1.7%
  • The effect size you care about determines the sample size you need

Below 100 examples, Central Limit Theorem-based confidence intervals become unreliable. Use Bayesian alternatives. Above 100 examples, CLT-based intervals work.

More practically: an eval set with 200 examples is almost always enough to detect regressions that would affect user experience. An eval set with 50 examples can detect large regressions but not subtle ones. An eval set with 30 examples is sufficient for error analysis — learning what categories of failure exist — but not for rigorous A/B decisions between prompt versions.

Paired comparison dramatically increases statistical efficiency. If you evaluate two system versions on the same set of inputs, you can analyze per-example differences rather than aggregate rates. Because model performance correlates across prompts (difficult inputs are hard for both versions), paired comparison exploits that correlation to reduce the effective variance by 3–5×. The same 100 examples give you the statistical power of 300–500 unpaired examples.

Three calibration checks that tell you the eval set is ready to use:

Theoretical saturation: You stop discovering new failure patterns when you review new examples. If every new example fits into categories you have already seen, you have sampled enough to represent the failure space.

Hold-out parity: Split labeled examples into 75% development and 25% holdout. Calibrate your LLM judge on the development set. Run on the holdout. If accuracy drops substantially, you have overfit the evaluator to the development labels. The holdout set should never be used to adjust the judge prompt.

False negative rate check: A 100% pass rate is a warning sign, not a success. If every example passes, the eval set is not challenging enough to detect future regressions. Target a pass rate in the 70–85% range for a meaningful regression test.

Building the Process, Not the Dataset

The practical sequence for teams starting from scratch:

  1. Open-coded error analysis on 30 examples: Review 30 production outputs with no rubric. Take notes on every failure you see. Group notes into categories. This is your failure taxonomy. Do not skip this step — evaluators built without it measure things that do not matter.

  2. Appoint a ground-truth arbiter: For small teams, one domain expert serves as the benevolent dictator on label disputes. This eliminates the annotation disagreement problem that plagues larger annotation efforts. If you need multiple annotators, run an inter-rater reliability check with Cohen's Kappa before proceeding; 0.6 is the minimum threshold worth building on.

  3. Active selection of 50–100 hard examples: Using uncertainty sampling and diversity sampling, select the 50–100 most informative examples from your production log. Label them manually.

  4. Extend with weak supervision: Write 3–5 labeling functions (LLM prompts, heuristics, or both) targeting the failure categories from step 1. Aggregate them with a label model calibrated against your seed labels. This gives you 500–2,000 noisy labels for the cost of the inference.

  5. Calibrate the LLM judge: Use your labeled examples (development split) to align an LLM judge. Measure true positive rate and true negative rate separately. A high TPR (catching real failures) matters more than raw accuracy. Validate on the holdout split.

  6. Run the eval set in CI: At this point the eval set is a regression test, not just a measurement tool. Any prompt change or model update runs against it. The signal from 100–200 well-chosen examples, with a calibrated evaluator, is more actionable than informal vibe checks on 10,000 random examples.

The investment to reach this point is roughly 4–8 hours of a senior engineer's time plus annotation cost for 50–100 examples. Teams that treat this as a prerequisite requiring months of data accumulation are making a category error. The eval system is not waiting for data. The data is waiting for the eval system to tell you what to collect.

The Goodhart Problem and How to Avoid It

Any eval set you optimize against will eventually become misleading. When prompt engineers can see the eval examples, they optimize for the visible set rather than for the underlying quality the set was meant to measure. This is not a failure of intent — it is a structural property of optimization.

The standard mitigation is the held-out set: a collection of examples that is never used to tune the prompt, judge, or system. When internal pass rates diverge from held-out pass rates, you have found Goodharting. The typical response — adding more examples to the visible set — makes the problem worse, not better.

A better structural approach: treat the eval set as read-only after creation. Changes to the eval set are a separate, deliberate decision made when the failure taxonomy genuinely changes — not when a new system version needs better scores. Regularly rotating in production examples ensures the set tracks real user behavior rather than a frozen snapshot of what the system looked like at launch.

The final discipline: use the eval set to decide whether to deploy, not to decide how to prompt. Prompt engineering uses the development split. Deployment decisions use the holdout. Keep those roles separate and the small eval set stays trustworthy.

References:Let's stay in touch and Follow me for more thoughts and updates