LLM-as-a-Judge: A Practical Guide to Building Evaluators That Actually Work
Most AI teams are measuring the wrong things, in the wrong way, with the wrong people involved. The typical evaluation setup looks like this: a 1-to-5 Likert scale, a handful of examples, and a junior engineer running the numbers. Then someone builds an LLM judge to automate it—and wonders why the whole thing feels broken six months later.
LLM-as-a-judge is a powerful pattern when done right. But "done right" is doing a lot of work in that sentence. This post is a concrete guide to building evaluators that correlate with real quality, catch real regressions, and survive contact with production.
Why LLM Judges Fail in Practice
Before building an LLM judge, it's worth understanding why so many teams struggle with them. Studies estimate that over 90% of teams find LLM-as-judge implementation harder than expected. The failure modes cluster into a few recurring patterns.
The bias problem is real and measurable. LLM judges exhibit at least a dozen documented bias types. The most impactful:
- Position bias: Simply swapping the order of two responses being compared can shift accuracy by more than 10% on code evaluation tasks. The judge isn't evaluating quality—it's evaluating position.
- Verbosity bias: Longer responses systematically score higher, independent of whether the extra length adds value.
- Familiarity bias: Judges favor outputs that pattern-match to their training distribution. GPT-4 scores outputs with lower perplexity higher, regardless of actual quality. This isn't self-preference so much as stylistic familiarity.
Agreement rates drop sharply in specialized domains. On general instruction-following, LLM judges achieve 80%+ agreement with human preferences. But in healthcare, legal, and other expert domains, that drops to 60-70%—which is barely above what you'd get by chance on many tasks.
The real problem is usually the evaluation design, not the judge. A poorly specified rubric produces poor judgments whether it's a human or an LLM doing the evaluation. Teams that cargo-cult their way to LLM judges without fixing their underlying evaluation methodology end up with a faster, cheaper version of the same broken signal.
Stop Using 1-to-5 Scales
The single most impactful change you can make to your evaluation setup is switching from multi-point scales to binary pass/fail judgments.
Here's why: when you ask someone to score an output "3 out of 5," they're doing two things simultaneously—deciding whether the output is good enough, and picking a position on an arbitrary scale. The second task introduces noise that swamps the first. What does a 3 mean vs. a 4? It's not immediately obvious, and it won't be consistent across reviewers or across time.
Binary judgments force clarity. Did this output accomplish the task, or didn't it? That's a question with a defensible answer. You can train an LLM judge on binary labels far more reliably than you can train it to replicate someone's intuition about the difference between a 3 and a 4.
The key companion to binary judgments is requiring written critiques. When a reviewer marks something as failing, they must explain specifically why. This does two things: it makes the evaluation defensible, and it produces material you can actually use to improve both your judge and your underlying system.
The Critique Shadowing Process
The most reliable path to a working LLM judge is an iterative process that starts with domain experts and ends with an automated evaluator that has earned its agreement score.
Step 1: Identify the right domain expert. This is not the most senior person on the team, and not the most convenient person. It's the person with the deepest subject matter knowledge—the one who would catch a subtly wrong medical recommendation, a legal non-sequitur, or a code snippet that technically runs but violates the invariants of your system. One expert setting consistent standards beats three well-intentioned generalists producing noise.
Step 2: Build a diverse evaluation dataset. Structure it across three dimensions:
- Features: the specific capabilities you're evaluating
- Scenarios: edge cases and failure modes (ambiguous inputs, missing context, off-topic queries)
- Personas: representative users (experts, novices, non-native speakers)
Aim for 30+ examples as a minimum. Keep going until you stop seeing new failure modes. A mix of real production examples and synthetic inputs works well—use your actual system to generate outputs from LLM-crafted inputs.
Step 3: Have the expert evaluate with pass/fail + written critiques. The critiques must be specific. "This response is unhelpful" is not a critique. "This response recommends increasing the dose without checking the patient's renal function, which is contraindicated for this drug class" is a critique. The specificity is what makes few-shot learning possible.
A useful side effect: the act of writing critiques forces the expert to articulate criteria they may not have consciously held. Evaluation standards often only become clear through the process of evaluating—what practitioners call "criteria drift."
Step 4: Fix pervasive failures before building the judge. If 40% of your outputs are obviously broken in the same way, fix that first. An LLM judge built on fundamentally broken outputs will learn to discriminate between degrees of broken, which isn't useful.
Step 5: Build the judge iteratively. Embed expert examples in your judge prompt and test against the expert's ground truth labels. Track precision and recall separately—raw agreement rate is misleading when your pass/fail split is imbalanced. Target >90% alignment, which typically takes two or three rounds of refinement.
Step 6: Perform error analysis. Calculate failure rates broken down by feature, scenario, and persona. Classify failure root causes manually. This is where the real insight lives—not in the aggregate accuracy number, but in understanding which inputs the judge gets wrong and why.
Step 7: Create specialized judges only if needed. If your error analysis reveals that your judge systematically fails on a specific subset of inputs, build a specialized judge for that subset. Don't do this preemptively.
Production Architecture: Tiered Evaluation
The best evaluation systems use a tiered architecture that balances coverage, cost, and accuracy:
Tier 1 — Automated LLM judge: Handles 80-90% of evaluation volume. Fast, cheap, runs in CI/CD. Catches obvious regressions before they reach production.
Tier 2 — Multi-judge consensus: For uncertain cases and boundary decisions, run multiple judges and require agreement. Multi-judge consensus can push Cohen's Kappa into the 0.95 range—far more reliable than single-judge evaluation.
Tier 3 — Human review: Reserved for high-stakes decisions, novel failure modes, and periodic calibration of the automated layers. The output of human review feeds back into your expert dataset.
This architecture also supports continuous monitoring. Track agreement rates between tiers over time. When Tier 1 and Tier 2 disagree more than expected, that's a signal your judge has drifted from your ground truth. When humans and your automated system consistently disagree on specific input types, you've found a gap in your evaluation criteria.
Choosing Your Judge Model
No single model is the best judge across all tasks. The choice depends on your constraints:
- GPT-4 class models: Well-studied baseline, strong cost/speed balance for short evaluations, widely used as a reference point in research benchmarks.
- Claude: Prompt caching advantages make it cost-effective when your rubric is long and repeated across many evaluations. Particularly useful for policy-heavy or compliance-adjacent criteria.
- Gemini: Long context window reduces chunking overhead for evaluations that require comparing across long documents, transcripts, or multi-turn conversations.
The model you use in production doesn't have to match the model you use for evaluation. Many teams run a more capable model as judge even when the production system uses a cheaper model—the quality delta can justify the cost.
One caution: the ranking of which model performs best on a given task is not consistently preserved when using LLM judges. A judge built with GPT-4 may produce different rankings than a judge built with Claude, even when both are well-calibrated. This matters if you're using LLM judges to compare models against each other.
The One Rule You Can't Skip
Even if you take shortcuts everywhere else—even if you're the domain expert yourself, even if your evaluation data is already well-structured—there is one rule that has no exception: look at your data.
The business value of an evaluation pipeline is not the automated judge. It's the understanding you develop about your system's failure modes. A team that looks at 200 examples and manually classifies 50 failures learns things that no dashboard can tell them. The LLM judge is infrastructure for scaling that understanding, not a replacement for it.
Aggregate metrics drift over time. Prompts change. User behavior shifts. The inputs that were rare last quarter become common this quarter. The teams that maintain high evaluation quality are the ones that regularly re-examine their raw examples, update their criteria, and treat evaluation as an ongoing practice rather than a one-time artifact.
Getting Started
If you're building your first LLM judge, the path is simpler than it might seem:
- Pick the single most important thing your system needs to get right
- Find the person who knows best when it gets that wrong
- Have them evaluate 50 examples with written critiques
- Build a judge that reproduces their judgments at >90% agreement
- Run it in CI/CD before every deployment
Everything else—specialized judges, tiered architectures, calibration pipelines—follows from having that baseline working. Get the first judge right before adding complexity.
The teams that get the most value from LLM evaluation are not the ones with the most sophisticated infrastructure. They're the ones that started with one well-calibrated judge and a habit of actually reading their failure cases.
