Skip to main content

Annotation-Free Evaluation: Measuring LLM Quality Before You Have Ground Truth

· 12 min read
Tian Pan
Software Engineer

Most teams ship an LLM feature, then spend weeks arguing about whether it's actually good. The evaluation question gets deferred because building a labeled dataset feels like a separate project. By the time you have ground truth, you've also accumulated two months of silent regressions you can never diagnose. This is backwards. You can get a meaningful quality signal in week one — before a single annotation is complete — if you know which techniques to reach for and where each one breaks.

This post is a field guide to annotation-free evaluation: the reference-free methods that work, the conditions they require, and the specific failure modes that will fool you if you're not careful.

Why "We'll Evaluate Later" Is a Bug, Not a Plan

The conventional justification for deferring evaluation is that good evaluation requires ground truth, and ground truth requires labeling, and labeling requires time and budget. This is true for some evaluation techniques. It's not true for all of them.

The real cost of deferring is that you lose the ability to make causal claims. When you discover a quality problem three months in, you don't know whether it was introduced by your latest prompt change, a model API update, a shift in your user query distribution, or a bug in your preprocessing. Without a baseline established early — even an imperfect one — your retrospective debugging is guesswork.

Annotation-free evaluation isn't a substitute for ground truth forever. It's a way to establish a signal before you have labels, so that when you do get labels, you can use them to calibrate and extend rather than to build everything from scratch.

Self-Consistency: Free Signal from Your Own Model

Self-consistency is the simplest annotation-free technique to implement. Instead of generating one response to a query, you generate several — typically using temperature sampling — and then measure how much they agree.

The core insight is that uncertainty manifests as variance. A model that truly "knows" the answer to a factual question will produce the same answer repeatedly across samples. A model that's guessing will produce a distribution of different answers. Measuring that distribution gives you a factuality proxy without any external reference.

In practice, you hash or normalize the outputs and compute majority agreement. If you ask a model a factual question ten times and it gives the same answer eight times, that's a stronger signal than if it gives five different answers. The technique was formalized for reasoning tasks — where it shows 4–18% accuracy improvements on math benchmarks — but the underlying principle applies broadly to any task where a correct answer exists.

The limitations are important to understand:

  • Consistent hallucination defeats it. If a model consistently invents the same false claim — "The treaty was signed in 1847" — high self-consistency gives you high confidence in a wrong answer. The method assumes that errors are random; systematic errors are invisible.
  • Open-ended tasks have many correct answers. For creative writing, summarization, or conversational responses, variance between samples isn't a bug — it's expected and even desirable. Measuring consistency on these tasks will penalize good outputs.
  • Cost scales with samples. Generating ten responses per query is ten times your inference cost. For high-volume production traffic, you need to sample strategically rather than evaluate everything.

Use self-consistency for tasks with determinate answers: factual Q&A, classification, structured extraction, arithmetic. Skip it for generative tasks where diversity is appropriate.

Constraint Satisfaction: Structural Correctness Without Labels

For many LLM use cases, correctness has a structural component that can be checked without any reference to what the "right" answer should be. This is constraint satisfaction evaluation, and it's often the highest-signal, lowest-effort technique available.

Examples of constraints you can verify programmatically:

  • Format constraints: Is the output valid JSON? Does it match a required schema? Does it contain the expected fields?
  • Length constraints: Is the summary under 150 words? Does the generated code fit in the expected range?
  • Reference integrity: Does every claim cite a source that exists? Do all URLs return 200?
  • Domain-specific invariants: Does the generated SQL parse? Does the code compile? Does the regex produce valid output on the test cases?

The value here is that these checks are binary, fast, and entirely deterministic. You don't need a model to evaluate them. They catch a specific, high-impact class of failures — the model produces output that's formally broken — without any annotation burden.

The limitation is equally clear: constraint satisfaction tells you nothing about semantic quality. An output can pass every structural check and still be factually wrong, logically incoherent, or completely useless for the user's actual goal. Constraint satisfaction is necessary but not sufficient. Treat it as your first filter, not your only filter.

In practice, the right approach is to enumerate every structural property that a correct output must have, encode those as automated checks, and run them on every evaluation candidate. This takes an afternoon to set up and gives you a floor signal immediately.

Behavioral Invariants: Testing What Shouldn't Change

Behavioral invariants are a more sophisticated technique that tests the model's internal consistency under input transformations. The intuition: if you change something about the input that shouldn't change the output, but the output changes dramatically, you've found a brittleness.

Some examples of invariants worth testing:

  • Semantic paraphrasing invariance: "What is the capital of France?" and "Can you tell me which city is France's capital?" should produce the same answer. Large divergence indicates the model is responding to surface form rather than meaning.
  • Negation consistency: If you ask "Is X true?" and "Is X false?" separately, the model's answers should be logically consistent. Many models fail this.
  • Instruction robustness: Changing formatting details in a system prompt (newlines, capitalization, label phrasing) while preserving semantics should not significantly alter the output for factual tasks.
  • Order invariance where applicable: For retrieval-augmented tasks, the answer to a question shouldn't flip based on whether supporting evidence appears at the start or end of a long context.

Behavioral invariant testing serves a different purpose than factuality checking. It's measuring robustness and reliability rather than correctness. A model that produces different answers to semantically identical questions isn't just unreliable — it's likely to degrade in production as query phrasing varies across your user base.

The practical limitation is that you need to generate the invariant pairs yourself, and it's not always obvious which invariants matter for your specific task. Invariant testing also doesn't tell you which answer is right — just that the model is inconsistent. For factual tasks, you need an additional signal to distinguish "inconsistent and wrong" from "inconsistent but sometimes right."

Model-Graded Rubrics: LLM-as-Judge Without Gold Labels

The most flexible annotation-free technique is using a strong LLM to evaluate outputs from your production model against a custom rubric. This is commonly called "LLM-as-judge" and has become a standard component of evaluation pipelines.

The G-Eval framework provides a practical template:

  1. Write a rubric that specifies the criteria for a good output in your task (coherence, factual consistency with the provided context, relevance to the query, instruction following).
  2. Ask the judge LLM to generate a chain-of-thought evaluation against those criteria.
  3. Score on a numeric scale (typically 1–5), using token probability weighting for continuous scoring rather than categorical output.

When done well, LLM judges correlate strongly with human judgments on many tasks. They're especially useful for assessing qualities that don't have formal definitions — whether a response is helpful, whether a summary captures the most important points, whether the tone is appropriate.

The failure modes, however, are severe enough that you should not use this technique uncritically:

Length bias is pervasive. Longer outputs reliably score higher across most judge models, even when length adds no value. An output that's twice as long can score 20–30% higher despite being semantically equivalent or even worse. Always control for length in your rubric and in your prompt design.

Position bias affects pairwise comparisons. When asking a judge to compare two outputs, the one presented first wins more often than chance would predict. If you're doing A/B comparisons, you need to randomize order and average scores across both orderings.

Domain expertise gaps are significant. Research shows LLM judges agree with subject matter experts only 64–68% of the time in specialized domains like medicine, law, and mental health. For general tasks the agreement is higher, but the gap remains. Do not use LLM judges as your sole evaluation signal in high-stakes or specialized domains.

Factual verification is not possible without reference material. A judge model cannot reliably distinguish a confident hallucination from a correct claim. Fluent, well-structured false claims often score higher than correct but awkwardly phrased outputs. LLM judges measure presentation quality more reliably than factual accuracy.

Agreeableness bias inflates scores. Most judge models have a tendency toward approval rather than criticism. Scores drift upward over time, especially when the evaluated outputs look similar to the judge's training distribution.

To mitigate these biases in practice: include explicit rubric language that penalizes unnecessary length; always run pairwise comparisons in both orders; calibrate your judge by running it against a small set of human-labeled examples before trusting it at scale; never use it as your sole signal in domains where factuality matters.

Combining Techniques: A Week-One Evaluation Stack

No single annotation-free technique is sufficient on its own. The practical approach is to layer them, with each one catching a different class of failure:

Layer 1 — Deterministic checks (constraint satisfaction): Runs on every output. Catches format failures, schema violations, and structural errors. Zero inference cost beyond your production call. Set this up first.

Layer 2 — Consistency sampling (self-consistency): Runs on a random sample, perhaps 5–10% of traffic, for factual tasks. Generate three to five additional outputs and flag high-variance queries for human review. Use the variance distribution to track changes over time.

Layer 3 — Behavioral regression tests (invariants): A fixed test suite of 50–100 invariant pairs, run on every model or prompt change before deployment. This is your change-detection layer.

Layer 4 — LLM judge scoring (rubric evaluation): Runs on a random sample of outputs, or on all outputs for lower-volume use cases. Use a strong judge model with a calibrated rubric. Review judge outputs manually each week until you trust the calibration.

This stack gives you four independent failure detectors, each with different sensitivity and different failure modes. When multiple layers flag the same output, confidence in the quality signal increases substantially. When layers disagree, that's a signal worth investigating manually.

Where the Methods Fail Together

Even the combined stack has a shared blind spot worth naming explicitly: it cannot detect the failure mode of the model consistently producing the same high-quality-looking wrong answer.

Self-consistency gives you high confidence. Constraint satisfaction passes. Behavioral invariants show no variance. The LLM judge scores it highly. And the output is still factually wrong — consistently, convincingly, and at scale.

This class of failure — confident, consistent hallucination — is what ultimately requires ground truth to catch. The annotation-free methods can tell you when something is inconsistent, structurally broken, or obviously wrong by some formal criterion. They cannot tell you when the model has learned a false belief that it expresses with high confidence.

This is the honest ceiling of annotation-free evaluation: it measures reliability and formal correctness, not truth. For applications where truth matters — medical information, legal documents, financial data, scientific claims — you will eventually need human expert annotation, external knowledge base comparison, or execution-based verification. The annotation-free stack buys you time and catches a large fraction of issues. It does not eliminate the need for ground truth.

Starting Before You Have Perfect Infrastructure

The temptation when reading about evaluation frameworks is to feel like you need to build everything before you can learn anything. That's wrong. Start with one layer.

Pick the constraint satisfaction check most relevant to your task — does the output parse as expected, does it contain the required fields, is it within the expected length range — and instrument it today. Track the pass rate over time. That single number, even if imperfect, is more useful than zero evaluation, which is what most teams have in week one.

Add the other layers as you scale. The important thing is to establish the habit of measuring early, when you still have the ability to investigate what you find and correlate it with your own intuitions about quality. Ground truth doesn't arrive on day one. Your intuitions, your task specification, and your ability to run cheap automated checks do.

Annotation-free evaluation is not evaluation theater. It is a first approximation, honestly labeled, that prevents the worst class of failure: shipping changes with no idea whether quality went up or down.

Building an LLM evaluation pipeline? See also AI Product Evals: Why Your Test Suite Is Probably Lying to You and Judge Model Independence: Avoiding Eval Blind Spots.

References:Let's stay in touch and Follow me for more thoughts and updates