Eval Engineering for Production LLM Systems
Most teams building LLM systems start with the wrong question. They ask "how do I evaluate this?" before understanding what actually breaks. Then they spend weeks building eval infrastructure that measures the wrong things, achieve 90%+ pass rates immediately, and ship products that users hate. The evaluations weren't wrong—they just weren't measuring failure.
Effective eval engineering isn't primarily about infrastructure. It's about developing a precise, shared understanding of what "good" means for your specific system. The infrastructure is almost incidental. In mature LLM teams, 60–80% of development time goes toward error analysis and evaluation—not feature work. That ratio surprises most engineers until they've shipped a broken model to production and spent a week debugging what went wrong.
The Error Analysis Loop Comes First
Before you write a single evaluator, you need to watch your system fail. The sequence matters: error analysis precedes automation, always.
The four-step process that works looks like this. First, gather 100+ representative traces from production—real user interactions, not synthetic examples you constructed to test the happy path. Second, have a domain expert review these traces and write unstructured notes about what's wrong. Don't categorize yet; just observe. Third, group observations into a failure taxonomy and count occurrences. This tells you what to prioritize—if 40% of failures are hallucinations and 5% are tone issues, your evaluation infrastructure should reflect that ratio. Fourth, continue until you stop seeing new failure modes, then you're ready to automate.
Run this cycle every 2–4 weeks for major analysis, with a lighter 10–20 trace review weekly focused on outliers. The cadence keeps your evaluators aligned with how your system actually fails in the real world, not how you imagined it would fail when you designed it.
The team structure matters too. Appoint one domain expert with final quality judgment authority. Multiple annotators create annotation paralysis—endless debates about whether a response scores 3 or 4 on some Likert scale. One person with clear authority eliminates that friction. And don't outsource error analysis to external teams. They lack the unspoken product context that makes the difference between a 3 and a 4 in your specific domain.
The Eval Stack: Cost Hierarchy
Once you understand your failure modes, build your evaluation stack from cheapest to most expensive. Use each layer as liberally as its cost allows.
Simple assertions and regex are your cheapest and most reliable tools. Does the response contain a phone number when it shouldn't? Does it start with a greeting? Is the JSON parseable? Run these on every request in CI/CD without hesitation. They're deterministic, fast, and never produce false positives from prompt sensitivity.
Schema validators and format checks sit one layer above—still deterministic, still cheap, validating that structured outputs actually conform to their expected structure. These catch a surprising fraction of production issues.
Reference-based checks require a known-good response to compare against. They work for narrow domains where ground truth exists, but don't scale to open-ended generation.
Fine-tuned judge models provide much of the value of LLM-as-judge at a fraction of the cost. A 7B classifier fine-tuned on your labeled examples can run on every commit in your CI/CD pipeline where GPT-4 class inference would be prohibitively expensive.
LLM-as-judge with frontier models is the most expensive evaluation option. It provides the most nuanced assessment of qualities like correctness, completeness, and tone, but should be reserved for production monitoring, major release evaluation, and LLM judge training—not for every commit.
Binary Labels Win, Every Time
The single most consistently validated finding in eval engineering: binary pass/fail labels outperform Likert scales (1–5) for almost every evaluation use case.
The reason isn't obvious until you've tried both. Numeric scales introduce subjective interpretation differences between annotators. The difference between a 3 and a 4 is inherently ambiguous—one annotator's "mostly fine with minor issues" is another's "acceptable but not great." Human inter-rater reliability on 5-point scales often lands at Cohen's Kappa 0.2–0.3, which is barely better than chance. Binary labels for the same task typically reach 0.6–0.8.
Binary labels also force conceptual clarity. To annotate pass/fail, you have to decide: what actually matters? That question, answered concretely, becomes the specification your system is actually built to satisfy. Teams that skip this step optimize for the wrong things.
When building LLM judges, target >90% agreement with your domain expert. Measure precision and recall separately rather than raw accuracy, especially when your dataset is imbalanced. A judge that correctly identifies 95% of passes but misses 40% of failures isn't useful.
Building LLM Judges That Actually Work
LLM-as-judge is genuinely useful for evaluating subjective qualities at scale—correctness, completeness, tone appropriateness—that deterministic checks can't capture. But building one that aligns with expert judgment requires a specific methodology.
The critique shadowing approach works as follows. Your domain expert reviews 50–100 examples and writes detailed written critiques explaining why each passes or fails. Terse notes ("this is wrong") aren't enough—the critique needs to capture the reasoning. These become few-shot examples for your judge.
Build the judge iteratively, starting with 5–10 critique examples. Run it against your labeled dataset. Examine every disagreement, update the examples, and repeat. You typically reach >90% agreement with the domain expert within three iterations.
Build separate judges per criterion rather than one "God Evaluator" that scores everything at once. Granular judges are easier to debug, easier to update when one criterion changes, and produce more actionable signal when something fails.
LLM judges have well-documented biases you need to account for. Position bias: judges tend to favor responses based on their position in the prompt. Verbosity bias: longer responses score higher regardless of quality. Bandwagon bias: judges agree with stated consensus. Mitigate position bias by running evaluations twice with swapped positions. Reduce verbosity bias by including explicit rubric language about length appropriateness. Have your judge explain its ratings—this single change significantly improves alignment by forcing the model to reason rather than pattern-match.
Even frontier model judges have a ceiling. The best-in-class models achieve "both-correct" accuracy values below 0.7 when compared against human judgment. This isn't a reason to avoid LLM judges—human annotators miss about 50% of defects due to fatigue—but it's a reason to treat automated eval scores as signals rather than ground truth.
Guardrails Are Not Evaluators
A critical architectural distinction that gets conflated constantly: guardrails run synchronously and block responses; evaluators run asynchronously and inform dashboards.
Guardrails are inline checks—PII detection, profanity filters, JSON schema validation, prompt injection detection. They must run in milliseconds because they're in the critical path of your user-facing response. False positives are production bugs. Favor cheap, deterministic rules over LLM inference here.
Evaluators are post-hoc. They never block a response. They sample production traffic, run expensive quality checks, and feed monitoring dashboards. An evaluator that takes five seconds to run isn't a problem. An evaluator that flags 20% of good responses as failures isn't a production incident—it's a signal to investigate.
Mixing these two abstractions leads to the worst outcomes: either LLM inference in the critical path (adding 2+ seconds of latency) or async evaluators replaced by simplistic guardrails that miss the failures you actually care about.
CI/CD Integration and Production Monitoring
The two-tier eval architecture that works in practice separates development-time regression testing from production monitoring.
The CI/CD pipeline runs on every commit. It uses a curated dataset of 100–200 examples covering core features, known regression cases, and edge cases from past failures. Deterministic assertions and smaller fine-tuned judges do the heavy lifting here. The pipeline gates deployment—failures block promotion. LLM-as-judge with frontier models runs only on nightly sweeps or release candidates, where latency isn't a constraint.
Production monitoring samples live traffic asynchronously. You don't need to evaluate every request—10–20% sampling is sufficient to detect regressions in aggregate metrics. The key metrics to track continuously: faithfulness (for RAG systems), task completion rate (for agents), and user-facing quality signals. Track confidence intervals rather than point estimates; investigate only when lower bounds cross thresholds.
Your eval maturity level determines which of these you have in place. Most teams shipping LLM features are at Level 1: some offline evals, manual review. Level 2 adds CI/CD integration with a regression suite. Level 3 adds continuous production sampling. Level 4—where the most reliable teams operate—runs continuous evaluation on production traffic with automated red-teaming to discover new failure modes before users do.
Agent Evaluation Requires Different Thinking
Evaluating agents introduces a structural complication that doesn't exist for single-turn generation: you need to evaluate trajectories, not just final outputs.
Three grader types apply. Code-based graders handle objective, verifiable outcomes: did the agent write valid JSON? Did it call the correct API? Did the final database state match the spec? These are fast, cheap, and should be exhausted before reaching for model-based graders. Model-based graders using rubric scoring handle outcomes that require judgment—was the research summary accurate? Were the right sources consulted? Human review remains the gold standard but can't scale past occasional spot-checks.
The most common mistake in agent evaluation is grading the path rather than the outcome. If the agent achieved the correct end state via an unconventional sequence of tool calls, that should count as success. Locking evaluators to specific execution paths means you'll regress every time you make a valid architectural change.
Two metrics matter differently depending on your use case. pass@k measures the probability of at least one correct solution in k attempts—the right metric when one success is enough. pass^k measures the probability that all k trials succeed—the right metric when you need reliability. These diverge dramatically at k=10: pass@k often approaches 100% for capable models while pass^k may approach 0%. Which you optimize for shapes your entire training and eval strategy.
Handle non-determinism carefully: use isolated environments per trial with no shared state between runs. A single extraction bug moved one published benchmark from 50% to 73%—validate your data pipeline before attributing changes to model quality.
The Pitfalls That Sink Eval Programs
Twelve failure modes worth calling out explicitly:
Eval-driven development: Building evaluators for imagined failures before examining real data. LLM failures are counterintuitive—write evaluators only for failures you've observed.
Generic metrics: "Helpfulness," "coherence," and "relevance" scores create false confidence. They don't correlate with user satisfaction in practice; they measure what's easy to measure, not what matters.
Similarity metrics: BERTScore and ROUGE aren't useful for most LLM outputs. They're appropriate only for retrieval evaluation where you have exact reference texts.
100% pass rates: If you're acing all your evals, they're not hard enough. Target around 70% pass rate to have meaningful signal.
Criteria drift: Domain experts can't fully specify quality criteria upfront. The act of judging outputs is how criteria get defined. Build in recalibration cycles.
Single-run claims: Always compute confidence intervals. 200 samples gives ±2.4% margin of error; 400 samples gives ±1.7%. Never claim improvement from one eval run.
One God Evaluator: A single evaluator scoring everything fails to diagnose which specific criterion regressed. Build per-criterion evaluators.
Prompt optimization tools too early: These hill-climb on your current metrics while new failure modes accumulate. Maintain human-in-loop error analysis in parallel.
Flaky CI tests: LLMs are non-deterministic. A test may pass one run and fail the next. Reserve LLM judges for nightly sweeps; use deterministic assertions in commit-blocking tests.
Getting to Production-Grade Evals
The path from zero to reliable eval infrastructure has three phases.
In the first phase (20–50 examples), manually label a small dataset of representative production outputs. Use binary pass/fail. Start running error analysis cycles. Identify your top three failure modes. Build simple assertion-based tests for each.
In the second phase (50–200 examples), expand your labeled dataset. Train or configure an LLM judge with critique shadowing. Measure alignment against expert labels—target >90% agreement before trusting automated scores. Build the CI/CD eval pipeline against this dataset.
In the third phase (200+ examples), integrate production monitoring with async sampling. Establish CI/CD gates that block regressions. Run weekly error analysis cycles to catch new failure modes before they accumulate. Expand your golden dataset continuously from production traces, treating it as a living document rather than a static artifact.
The investment compounds. One team spent four weeks building evaluation infrastructure, then ran dozens of experiments in two subsequent weeks and hundreds in the following months—work that was previously blocked by manual review bottlenecks. The eval infrastructure didn't slow them down; it's what allowed them to move fast without breaking things.
The goal isn't automation for its own sake. It's a shared, precise definition of quality that's cheap enough to check continuously and reliable enough to trust when it says something regressed.
- https://hamel.dev/blog/posts/evals-faq/
- https://hamel.dev/blog/posts/llm-judge/
- https://eugeneyan.com/writing/aligneval/
- https://eugeneyan.com/writing/product-evals/
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://blog.langchain.com/agent-evaluation-readiness-checklist/
- https://www.datadoghq.com/blog/llm-evaluation-framework-best-practices/
- https://langfuse.com/blog/2025-03-04-llm-evaluation-101-best-practices-and-challenges
- https://arize.com/blog/how-to-add-llm-evaluations-to-ci-cd-pipelines/
- https://deepeval.com/guides/guides-regression-testing-in-cicd
- https://arxiv.org/abs/2411.15594
- https://arxiv.org/abs/2412.12509
- https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
- https://www.statsig.com/perspectives/golden-datasets-evaluation-standards
