Spec-to-Eval: Translating Product Requirements into Falsifiable LLM Criteria
Most AI features are specified in prose and evaluated in prose. The PM writes "the assistant should respond helpfully and avoid harmful content." The engineer ships a prompt that, at demo time, produces output that seems to match. The team agrees at standup. They disagree at launch — when edge cases surface, when different engineers assess the same output differently, and when "helpful" turns out to mean seven different things depending on who's reviewing.
This isn't a tooling problem. It's a translation problem. The spec stayed abstract; the evaluation criteria were never made concrete. Spec-to-eval is the discipline of converting English requirements into falsifiable criteria before you write a single prompt — and doing it upfront changes everything about how fast you iterate.
