Skip to main content

LLM Evals: What Actually Works and What Wastes Your Time

· 10 min read
Tian Pan
Software Engineer

Most teams building LLM applications fall into one of two failure modes. The first is building no evals at all and shipping features on vibes. The second is building elaborate evaluation infrastructure before they understand what they're actually trying to measure. Both are expensive mistakes.

The teams that do evals well share a common approach: they start by looking at data, not by building systems. Error analysis comes before evaluation automation. Human judgment grounds the metrics before any automated judge is trusted. And they treat evaluation not as a milestone to cross but as a continuous discipline that evolves alongside the product.

This is what evals actually look like in practice — the decisions that matter, the patterns that waste effort, and the tradeoffs that aren't obvious until you've been burned.

Error Analysis Comes Before Evaluation Infrastructure

The most common mismatch in LLM development is building evaluators before you know what failures look like. It sounds methodical to write evals early — test-driven development for AI systems. In practice, LLMs have near-infinite failure surfaces. You cannot enumerate what will go wrong in a prompt-response system ahead of time.

The right sequence is:

  1. Collect 100+ real user interactions or traces from a staging environment
  2. Have domain experts read through them and write open-ended notes about failures (open coding)
  3. Group similar failures into categories (axial coding)
  4. Iterate until the taxonomy stabilizes
  5. Then write evaluators targeting the failure categories you discovered

This approach produces evaluators that measure real problems in your application. Abstract quality frameworks produce numbers that look healthy while hiding the specific failure modes your users encounter.

One important rule during error analysis: focus on the first upstream failure in each trace. Downstream errors in a multi-step system often cascade from a single root cause. Fix the root, and several downstream problems resolve automatically without dedicated evaluators.

A practical implication: budget 60-80% of your AI development time on error analysis and evaluation, not on model selection or infrastructure. Most teams underinvest here and then can't explain why their product feels wrong even though their metrics look fine.

The Benchmark Score Trap

Top models now exceed 90% on HumanEval and GSM8K. These benchmarks are saturated and have largely stopped being useful for model selection decisions. A model that ranks near the top of any public leaderboard may still fail systematically on the specific inputs your application generates.

The reason is data contamination. Models trained on massive web corpora have often seen test questions during training. Static question sets amplify leakage risk significantly — scores reflect memorization as much as capability. Some organizations now refresh benchmark datapoints monthly specifically to counter this.

The deeper problem is that benchmark performance doesn't transfer. Production inputs are messier, more contextual, and more diverse than curated evaluation sets. A recipe assistant that achieves high scores on culinary knowledge benchmarks can still fail on ingredient parsing, calorie estimation, or handling ingredient substitutions — components that no benchmark directly targets.

The fix is domain-specific evaluation built from real traces. Generic metrics like BLEU, ROUGE, or BERTScore measure abstract properties with no diagnostic value for specific applications. They're useful as exploration signals to surface interesting traces for human review. They're not useful as quality indicators for production decisions.

Binary Pass/Fail Beats Likert Scales

A consistent finding across evaluation research: 1-5 rating scales produce worse evaluations than binary pass/fail judgments.

The failure mode is predictable. Annotators default to middle values (3/5) to avoid hard decisions. This compresses variance, obscures real quality differences, and requires larger sample sizes to extract signal. A binary judgment forces the annotator to commit: does this output meet the bar or not?

Binary evals are faster, achieve higher annotator agreement, and require smaller sample sizes for statistical significance. They also make the quality criterion more concrete. A 1-5 scale requires defining five distinguishable quality levels. A pass/fail judgment requires defining one threshold — which is a harder but more valuable exercise.

If you find yourself designing Likert scales for evaluation, that's usually a sign the quality criterion isn't well-defined yet. Binary judgments surface ambiguity earlier.

One corollary: aim for a pass rate around 70%, not 100%. A 100% pass rate signals that your evals aren't testing hard enough. A 70% rate with clear failure categories means you've built evaluators that actually discriminate.

When LLM-as-Judge Works and When It Fails

LLM-as-judge has become the dominant approach for scaling evaluation to production traffic, where human review is mathematically impossible. It works well for a specific class of problems.

Use LLM-as-judge for:

  • Clearly defined criteria that can be operationalized precisely in a prompt
  • Reference-free production monitoring where no "correct answer" exists (chatbots, support agents)
  • RAG faithfulness and answer relevancy checks
  • Pairwise comparison during development for model or prompt selection

Research shows LLM judges achieve greater than 80% agreement with human evaluators in pairwise comparisons when implemented carefully. That's a reasonable signal for development decisions.

Avoid LLM-as-judge for:

  • Real-time guardrails — API latency is too high; use rules or embeddings instead
  • Tasks requiring immediate in-generation validation
  • High-stakes decisions in expert domains without labeled validation data

The biggest gap is in specialized domains. In dietetics, mental health, and medical contexts, LLM judges agree with human domain experts only 60-70% of the time. For creative writing, agreement drops to about 58%. These tasks require human review; LLM judges provide noise, not signal.

Several systematic biases degrade LLM judges further if not corrected:

  • Position bias: Favoring whichever response appears first in pairwise comparisons. Mitigate by randomizing order and averaging across both orderings.
  • Verbosity bias: Preferring longer responses regardless of quality.
  • Self-enhancement bias: Favoring outputs from models in the same family, confirmed at NeurIPS 2024.

A critical operational requirement: validate your LLM judge against 50-200 labeled examples before trusting it for production decisions. An uncalibrated judge is worse than no judge — it provides confident but unreliable scores that hide real problems.

Evaluation Costs Before You Build

LLM-as-judge is not free. It requires 100+ labeled examples to validate, periodic re-calibration as model versions change, and cross-functional coordination to maintain. Building an automated evaluator for a problem you'll only look at once is wasteful.

The practical filter: only build LLM-as-judge evaluators for persistent failure categories you'll iterate on repeatedly. One-off checks are better done with direct human review. Rule-based checks (format validation, keyword presence, length constraints) are cheaper and more reliable for deterministic requirements — use them first.

This leads to a useful two-tier taxonomy:

  1. Deterministic checks: Exact-match, regex, schema validation — these function like unit tests and should run on every inference
  2. Statistical evals: LLM judges, human review panels — these are sampled and monitored periodically

Mixing these in a single evaluation pipeline creates confusion. Keep them separate and track them differently.

A five-metric ceiling applies to most pipelines: exceeding five metrics in any single pipeline diminishes signal. Use one or two custom metrics for use-case specificity and two or three generic metrics for architecture-level concerns. More metrics don't increase coverage — they increase noise.

Evaluating Agents Is Different

Single-turn evaluation focuses on prompt-response pairs. Agent evaluation requires assessing entire trajectories — tool calls, state transitions, intermediate decisions, and multi-step reasoning across potentially dozens of steps.

The failure modes unique to agents don't show up in response quality scores:

  • Tool hallucinations (calling tools that don't exist or with wrong parameters)
  • Context loss across long trajectories
  • Multi-agent oscillation where responsibility bounces between agents without resolution
  • Memory poisoning that causes long-term behavioral drift
  • Cascading failures where a single wrong tool call triggers downstream errors

Two types of metrics are both necessary for agents: outcome metrics (was the final result correct and complete?) and process metrics (how efficiently did the agent reach that result — steps taken, tokens used, tool call accuracy, timing?). Outcome metrics alone miss systematic inefficiencies. Process metrics alone miss correctness failures.

A useful diagnostic technique is the transition failure matrix: for each trace, record the last successful state and the first failure point. Accumulate these across many traces to identify systematic failure hotspots. Early-stage failures have larger cascading impact and deserve attention first.

Teams that establish baseline agent metrics before production deployment detect quality regressions roughly three times faster than teams without baselines. Hybrid automated-plus-human evaluation produces about 40% better system quality than purely automated approaches in production agent systems, based on reported Amazon findings.

Evals Are Never Done

A common mistake is treating evaluation as a one-time milestone. Build the eval suite, get passing scores, ship. This fails because the system you're evaluating doesn't stay constant.

Model providers update base models. User behavior shifts over time. Input distributions change as your product reaches new user segments. "Criteria drift" — the gradual evolution of what "good" means for your application — is expected and normal. An eval suite that was well-calibrated six months ago may be measuring stale quality criteria today.

The operational implication is that evaluation is a continuous practice, not a gate:

  • Refresh evaluation datasets as new failure modes emerge from production
  • Re-calibrate LLM judges when the underlying model changes
  • Review human-labeled examples periodically to catch criteria drift
  • Treat evaluation rubrics as living documents, versioned alongside prompts in Git

Storing prompts and evaluation criteria in version control — treated as software artifacts with review, history, and deployment — is one of the highest-leverage process improvements available to teams building on LLMs. It creates a traceable record of what quality standard was in place at any point in time, which makes debugging regressions much faster.

The Organizational Reality

Evals don't self-justify. Stakeholders want to see concrete failure modes fixed, not methodology described. The most effective path to evaluation buy-in is results-driven: show specific bugs discovered through error analysis, demonstrate improvements on high-frequency failure categories, document surprising user patterns uncovered from trace review.

Appoint a single domain expert as the definitive quality decision-maker. Annotation by committee produces inconsistent labels and resolves poorly. One person's calibrated judgment is more valuable than five people's averaged opinions. If a single person's judgment isn't sufficient to determine quality, that's often a signal that the product scope is too broad — not that you need more annotators.

Build custom annotation tooling before investing in evaluation infrastructure. Teams with purpose-built annotation interfaces iterate roughly ten times faster than those using generic platforms. The interface shapes how annotators think about quality; a well-designed tool embeds the evaluation criteria directly into the review workflow.

Summary

The principles that hold up across teams doing evaluation well:

  • Start with error analysis on real traces before writing any evaluators
  • Binary pass/fail judgments outperform rating scales in nearly every dimension
  • Validate LLM judges with labeled examples; don't trust them uncalibrated
  • Keep deterministic checks and statistical evals in separate tiers
  • Agents require trajectory-level evaluation — response quality scores miss agent-specific failures
  • Treat evals as a continuous discipline, not a one-time gate
  • Generic metrics create a false sense of health; domain-specific evals built from real failures create actionable signal

The teams that get real value from evaluation are the ones that treat it as a way to externalize tacit knowledge about quality — not as a checkbox or an optimization target.

References:Let's stay in touch and Follow me for more thoughts and updates