Skip to main content

16 posts tagged with "llm-evaluation"

View all tags

The Debug Tax: Why Debugging AI Systems Takes 10x Longer Than Building Them

· 10 min read
Tian Pan
Software Engineer

Building an LLM feature takes days. Debugging it in production takes weeks. This asymmetry — the debug tax — is the defining cost structure of AI engineering in 2026, and most teams don't account for it until they're already drowning.

A 2025 METR study found that experienced developers using LLM-assisted coding tools were actually 19% less productive, even as they perceived a 20% speedup. The gap between perceived and actual productivity is a microcosm of the larger problem: AI systems feel fast to build because the hard part — debugging probabilistic behavior in production — hasn't started yet.

The debug tax isn't a skill issue. It's a structural property of systems built on probabilistic inference. Traditional software fails with stack traces, error codes, and deterministic reproduction paths. LLM-based systems fail with plausible but wrong answers, intermittent quality degradation, and failures that can't be reproduced because the same input produces different outputs on consecutive runs. Debugging these systems requires fundamentally different methodology, tooling, and mental models.

The Semantic Failure Mode: When Your AI Runs Perfectly and Does the Wrong Thing

· 9 min read
Tian Pan
Software Engineer

Your AI agent completes the task. No errors in the logs. Latency looks normal. The output is well-formatted JSON, grammatically perfect prose, or a valid SQL query that executes without complaint. Every dashboard is green.

And the user stares at the result, sighs, and starts over from scratch.

This is the semantic failure mode — the class of production AI failures where the system runs correctly, the model responds confidently, and the output is delivered on time, but the agent didn't do what the user actually needed. Traditional error monitoring is completely blind to these failures because there is no error. The HTTP status is 200. The model didn't refuse. The output conforms to the schema. By every technical metric, the system succeeded.

LLM Evals: What Actually Works and What Wastes Your Time

· 10 min read
Tian Pan
Software Engineer

Most teams building LLM applications fall into one of two failure modes. The first is building no evals at all and shipping features on vibes. The second is building elaborate evaluation infrastructure before they understand what they're actually trying to measure. Both are expensive mistakes.

The teams that do evals well share a common approach: they start by looking at data, not by building systems. Error analysis comes before evaluation automation. Human judgment grounds the metrics before any automated judge is trusted. And they treat evaluation not as a milestone to cross but as a continuous discipline that evolves alongside the product.

This is what evals actually look like in practice — the decisions that matter, the patterns that waste effort, and the tradeoffs that aren't obvious until you've been burned.

Why Your LLM Evaluators Are Miscalibrated — and the Data-First Fix

· 9 min read
Tian Pan
Software Engineer

Most teams build their LLM evaluators in the wrong order. They write criteria, then look at data. That inversion is the root cause of miscalibrated evals, and it's almost universal in teams shipping their first AI product. The criteria sound reasonable on paper — "the response should be accurate, helpful, and concise" — but when you apply them to real model outputs, you discover the rubric doesn't match what you actually care about. You end up with an evaluator that grades things you're not measuring and misses failures that matter.

The fix isn't a better rubric. It's a different workflow: look at the data first, define criteria second, and then validate your evaluator against human judgment before trusting it to run unsupervised.