In Defense of AI Evals, for Everyone
Every few months, a new wave of "don't bother with evals" takes hold in the AI engineering community. The argument usually goes: evals are too expensive, too brittle, too hard to define, and ultimately not worth the overhead for a fast-moving product team. Ship, iterate, and trust your instincts.
This is bad advice that produces bad software. A 2026 LangChain survey found that only 52% of organizations run offline evaluations and just 37% run online evals against live traffic — yet 32% cite quality as their number one barrier to production deployment. That is not a coincidence.
The teams that claim to skip evals are still evaluating — they just do it informally, inconsistently, and without the ability to detect regressions. Dogfooding, error triaging, output review sessions: these are all evaluation activities. The question is not whether you evaluate but whether you do it deliberately enough to learn from it.
The Case Against Evals Is Actually a Case Against Bad Evals
The most common critique isn't really about evaluation itself. It's about how teams implement evals: picking metrics that sound rigorous but reveal nothing, building eval suites that pass while production burns, and spending engineering cycles on infrastructure that never surfaces actionable signal.
These are real problems. Generic metrics — "helpfulness score," "relevance," "coherence" — produce numbers that tell you almost nothing useful. Teams optimize the score, ship, and still get user complaints about completely new failure modes. The evals pass because they test what the system says, not what it actually does.
The answer is better evals, not no evals. Throwing out systematic quality measurement because some implementations are bad is like abandoning unit tests because you've seen flaky ones.
You're Already Doing Evals
Take any honest accounting of your team's quality assurance process. You probably:
- Review model outputs before shipping a prompt change
- Monitor error rates and complaint tickets after releases
- Periodically read through a sample of conversations or completions
- Have internal power users who stress-test edge cases
All of these are evaluation. The difference between this and "running evals" is structure: whether you define failure modes in advance, whether you sample consistently, whether you track how failure rates change over time.
Teams with deep domain expertise and rigorous dogfooding can operate with lighter formal eval infrastructure — but only if they are genuinely disciplined about the sampling and analysis. This is the exception, not the rule. For most product teams, the implicit eval process has blind spots large enough to let serious regressions through undetected.
When You Can Afford Lighter Evals
There are two scenarios where less rigorous evaluation is defensible:
Tasks well-covered by foundation model post-training. Code generation, basic summarization, instruction following in well-trodden domains — foundation models see enormous volumes of these tasks during training, and public benchmarks give reasonable signal on whether a model handles them. If your application sits squarely in this space, you can lean on model-level benchmarks more heavily and spend less on custom eval infrastructure.
Small teams with tight feedback loops and genuine domain expertise. A two-person team where both founders are deep domain experts, reviewing every output themselves, and shipping incrementally to known users — they will catch most problems through proximity. This breaks down fast as the team grows or the user base diversifies.
Outside these cases, skipping evals is a debt that compounds. Every model update, every prompt change, every retrieval configuration tweak lands in production with no baseline to compare against.
Where Evals Are Non-Negotiable
Complex document processing and analysis is the clearest case. When your system extracts entities from legal documents, classifies policy violations, or summarizes medical records, the failure modes are numerous, the consequences of errors are high, and no amount of informal review catches the long tail.
Multi-step agent pipelines are similar. A single pipeline might involve retrieval, reasoning over context, tool calls, and output formatting — each a potential failure point. An agent that passes your output quality check while making subtly wrong tool calls is a ticking clock. You need evals at each step, not just at the final output.
The rule of thumb: ask whether your bottleneck is specification (you haven't clearly defined what good looks like) or model capability (the model can't reliably do what you need). Evals are most critical when the answer is capability — because that's when you need to measure whether changes actually move the needle.
The Highest-ROI Eval Practice: Error Analysis
Before you build eval infrastructure, do error analysis. It is the highest-return activity in AI product development, and most teams skip it in favor of automation too quickly.
The process:
- Collect 100 diverse, real traces. Not hand-crafted examples — actual user interactions or representative samples from your input distribution.
- Read them. All of them. Manually identify where the system fails and why.
- Build a failure taxonomy. A small, coherent, non-overlapping set of binary failure modes: did the system hallucinate a citation? Did it misunderstand the user's intent? Did it produce a structurally correct but factually wrong answer?
- Quantify each failure mode. Now you know where to invest.
This process takes a day or two. It will tell you more about your system's real weaknesses than any automated metric you could design in a week. And critically, it gives you the definitions you need to write evals that actually measure what matters.
LLM-as-Judge: Scaling Your Eval Budget
Once you have failure modes defined, LLM-as-judge lets you scale evaluation cheaply. The pattern is straightforward: present a judge model with the input, the output, and a scoring rubric derived from your failure taxonomy, and ask it to evaluate.
Done naively, this introduces its own reliability problems. Judge models have biases — they tend to favor longer outputs, outputs that sound authoritative, and outputs from the same model family as the judge. The techniques that reduce these biases:
- Chain-of-thought scoring: require the judge to reason before scoring, not just output a number
- Reference-based evaluation: provide examples of correct outputs for comparison rather than asking the judge to assess in isolation
- Position swapping: for pairwise comparisons, run the evaluation in both orderings and average
- Failure-mode-specific rubrics: one rubric per failure mode, not one rubric for "quality"
The practical payoff is significant. Teams report up to 98% cost reduction compared to human review at scale. More importantly, it lets you run eval suites on every prompt change rather than batching evaluation into release cycles.
What Separates Products That Last
The AI applications that survive model updates, competitive pressure, and user base growth share a common trait: the teams building them can measure what changed and why. They can tell you which prompt revision improved extraction accuracy on the failure modes that mattered. They can catch when a model update degrades a behavior that was previously reliable.
This is not a property of sophisticated ML teams with research budgets. It comes from discipline: defining what good looks like, sampling real data, reading outputs, and building the feedback loops that let you iterate on evidence rather than intuition.
Evals are not a tax on shipping velocity. They are the mechanism that makes shipping sustainable.
