Skip to main content

Why Your LLM Evaluators Are Miscalibrated — and the Data-First Fix

· 9 min read
Tian Pan
Software Engineer

Most teams build their LLM evaluators in the wrong order. They write criteria, then look at data. That inversion is the root cause of miscalibrated evals, and it's almost universal in teams shipping their first AI product. The criteria sound reasonable on paper — "the response should be accurate, helpful, and concise" — but when you apply them to real model outputs, you discover the rubric doesn't match what you actually care about. You end up with an evaluator that grades things you're not measuring and misses failures that matter.

The fix isn't a better rubric. It's a different workflow: look at the data first, define criteria second, and then validate your evaluator against human judgment before trusting it to run unsupervised.

The Criteria-First Trap

The standard eval setup goes like this: a team defines pass/fail rules ("the response must not contradict the source document"), writes a prompt for an LLM judge, and runs it across their test set. If agreement with human labels looks passable — say, 75% — they ship the evaluator.

The problem is that 75% agreement is meaningless without knowing which 25% it gets wrong. LLM evaluators tend to fail systematically on exactly the cases you care most about: ambiguous outputs, borderline hallucinations, and responses that are technically correct but practically useless. These aren't random errors; they cluster around failure modes that aren't visible until you've read hundreds of actual outputs.

Writing criteria before you've read the data means writing criteria for the model you imagined, not the model you have. Evaluation criteria defined in isolation lead to two failure modes: they're either too abstract (the evaluator can't apply them consistently) or too narrow (they miss entire categories of failure because those categories weren't visible at definition time).

There's also a subtler issue: the criteria often reflect what the team wants the model to do rather than what it actually does. When you run those criteria against real outputs, you don't learn whether the model is good — you learn whether your mental model of the model was accurate.

Data-First: What It Looks Like in Practice

The alternative workflow flips the sequence. Before writing a single line of evaluation criteria, you spend time labeling real model outputs — typically 20 to 50 samples — as pass or fail. No rubric, just gut judgment applied to actual data. This forces you to encounter the specific failure modes your model has, not the ones you anticipated.

After labeling, patterns emerge. You notice the model consistently hedges on factual claims in a way that's technically defensible but practically unhelpful. Or it formats lists correctly but uses the wrong level of formality. Or it gets the right answer 80% of the time but confidently gives wrong answers in a predictable slice of edge cases. These patterns become the foundation of your criteria — grounded in evidence rather than theory.

This approach has a meaningful side effect: it dramatically reduces wasted effort. Teams that write criteria upfront often spend hours refining a rubric, only to discover after labeling that the rubric measures something their model never actually struggles with, while missing the failure mode that actually matters. Starting from data makes the criteria compression of real observations rather than speculation.

The labeling step also forces alignment within the team. Different people often have different intuitions about what "good" looks like. Surfacing those differences through concrete examples is far more effective than debating abstract criteria. You either agree on specific outputs or you don't — and if you don't, you've found the actual disagreement before it becomes a calibration problem in your evaluator.

Binary Labels Beat Complex Scales

One of the least intuitive but most empirically supported findings in eval design is that binary labels — pass or fail, good or bad — outperform multi-point scales for most production use cases.

The argument for 5-point or 7-point scales is that they're more expressive. A response that's mediocre shouldn't get the same score as one that's actively harmful. But in practice, humans applying multi-point scales spend most of their cognitive effort on the boundary between adjacent ratings. Is this a 3 or a 4? That deliberation reduces throughput and introduces inconsistency. Two reviewers often agree that something is "bad" but disagree on whether it's a 2 or a 3.

Binary labels eliminate this boundary problem. Either the output meets your standard or it doesn't. This makes labeling faster, more consistent, and easier to validate. It also makes your evaluator simpler to build and easier to interpret. An F1 score comparing binary predictions against binary ground truth is far more actionable than trying to interpret inter-rater correlation on a Likert scale.

The tradeoff is information density. Binary labels can't distinguish between "barely passing" and "excellent," which matters if you're trying to improve a model that's already above the pass threshold. In practice, most teams are nowhere near that problem — they're trying to catch clear failures, not rank acceptable responses. Binary is the right default until you've exhausted what it can tell you.

Validating Your Evaluator Before Trusting It

Building an LLM-as-judge evaluator is straightforward. Validating it is the part most teams skip, and it's where the real work lives.

The validation workflow: take your labeled dataset, split it into a development set and a held-out test set, then run your LLM evaluator against the development set. Track precision, recall, F1, and Cohen's kappa — not just overall agreement. Kappa matters because it corrects for the agreement you'd expect by chance, which is especially important for imbalanced datasets where one label dominates.

Then optimize. Adjust the evaluation criteria, the model temperature, the prompt structure, or the model itself. Run multiple trials on the development set. When you think you have a good evaluator, validate against the test set — once, cold. If test performance drops significantly from development performance, you've overfit to the dev set distribution.

This last step is where most teams learn an uncomfortable lesson. A small development set — even 20 samples — can lead to evaluators that perform well on dev but fail to generalize. The failure isn't random; it's that the dev set captured a specific subset of cases, and the test set surfaced a different subset the evaluator never saw. This is the evaluation equivalent of training set memorization.

The practical implication: treat your eval development set as a design artifact that requires the same diversity attention as your main training or fine-tuning data. Under-represented edge cases in the dev set become blind spots in your evaluator.

When to Use Ensemble Evaluators

A single LLM judge is convenient but brittle. It has a fixed perspective on what "good" looks like, it's sensitive to prompt phrasing, and it can be confidently wrong on categories of inputs it hasn't seen in context. Teams that have run evals at scale consistently find that multiple smaller evaluators, each focused on a specific criterion, outperform a single large evaluator trying to assess everything at once.

The practical design: decompose your evaluation into independent dimensions. Factual accuracy is a separate judgment from response format compliance, which is separate from tone appropriateness. Build one evaluator per dimension using a fast, cheap model — gpt-4o-mini or claude-3-haiku work well for most structured evaluation tasks. Aggregate the results upstream.

This architecture has compounding benefits. When an evaluator fails, you know exactly which dimension it failed on. You can replace or retrain a single component without disrupting the others. And the specialized prompt for each dimension tends to be more consistent than a general prompt trying to capture everything.

The main cost is overhead. More evaluators means more API calls and more prompt maintenance. But the precision gains typically justify it for any system running evals in production at meaningful volume.

Making Eval Infrastructure a First-Class Product

The deepest mistake teams make with evals is treating them as a one-time setup task rather than ongoing infrastructure. You define criteria, build an evaluator, validate it, and declare eval done. Then the model changes, the use cases drift, and the evaluator quietly becomes irrelevant.

Eval infrastructure needs the same continuous investment as the product it measures. This means:

  • Version your labeled datasets. Ground truth labels are the most durable asset in your eval system. Store them with the metadata to reconstruct how and why they were created.
  • Add new examples when you find new failures. Every production incident is an eval case waiting to be captured. If your eval suite had caught the failure, it belongs in the suite now.
  • Re-validate evaluators when the underlying model changes. An evaluator calibrated against one model version may not transfer cleanly to the next.
  • Monitor evaluator agreement over time. If human reviewers start disagreeing with your LLM evaluator at higher rates, the criteria have drifted or the model behavior has changed.

The teams shipping reliable AI products in 2026 aren't the ones with the best base models. They're the ones with the most disciplined feedback loops — and that discipline starts with evals that are calibrated to reality, not to expectations.

The Bottom Line

If your LLM evaluators were defined before you'd read the data, they're probably measuring something slightly different from what you care about. The fix is uncomfortable because it requires doing things in the right order: label first, define second, validate against held-out data third.

Binary labels will feel like a step down in expressiveness. They're not — they're a step up in reliability. Multiple specialized evaluators will feel like over-engineering. They're not — they're the difference between an evaluator that catches real failures and one that gives you false confidence.

Building eval tooling is engineering work. It deserves the same rigor — iteration, validation, versioning — that you'd apply to any production system. The teams that treat it that way are the ones whose evals actually improve the product over time.

References:Let's stay in touch and Follow me for more thoughts and updates