Choosing Eval Metrics Is a Product Decision, Not a Technical One
A team building an LLM-based literature screening tool celebrated 96% accuracy on their test set. Their model was, by any standard engineering metric, performing excellently. There was one problem: it found zero true positives. It had learned to classify everything as irrelevant and still scored near-perfect accuracy, because relevant papers were rare in the dataset. The failure wasn't in the model — it was in the metric.
This failure mode is not exotic. It plays out silently across AI teams every week, in codebases where engineers select evaluation metrics the way they'd select a sorting algorithm: as a technical choice with a right answer. The framing is wrong. Metric selection is a product decision. It encodes which failure modes you're willing to tolerate, which users you're optimizing for, and what "good" actually means for your specific context. Getting this wrong produces eval suites that look rigorous and measure the wrong thing.
