Skip to main content

Spec-to-Eval: Translating Product Requirements into Falsifiable LLM Criteria

· 9 min read
Tian Pan
Software Engineer

Most AI features are specified in prose and evaluated in prose. The PM writes "the assistant should respond helpfully and avoid harmful content." The engineer ships a prompt that, at demo time, produces output that seems to match. The team agrees at standup. They disagree at launch — when edge cases surface, when different engineers assess the same output differently, and when "helpful" turns out to mean seven different things depending on who's reviewing.

This isn't a tooling problem. It's a translation problem. The spec stayed abstract; the evaluation criteria were never made concrete. Spec-to-eval is the discipline of converting English requirements into falsifiable criteria before you write a single prompt — and doing it upfront changes everything about how fast you iterate.

Why Prose Specs Are Incompatible with LLM Development

In traditional software, requirements ambiguity is painful but manageable. If a spec says "handle invalid inputs gracefully," a developer can ask for clarification, look at similar features, and make a defensible decision. The function either passes or fails a unit test — someone eventually specifies what "graceful" means in code.

LLMs collapse this feedback loop. The output space is enormous, the behavior is probabilistic, and "correctness" depends on context in ways that code can't easily capture. Two engineers reading the same spec can come away with genuinely different mental models of what success looks like — not because one of them is careless, but because the spec left room for multiple valid interpretations.

The consequence is that teams end up doing eval in their heads. You ship a change, look at some outputs, and form an intuition about whether things got better or worse. This intuition is frequently wrong. It's subject to recency bias (you over-index on the last thing you saw), to coverage blindness (you only sample cases that feel representative but aren't), and to anchoring (the current output looks fine compared to the previous bad output, but you've forgotten what "good" actually looked like).

The fix isn't better intuition. It's making the criteria explicit before you start.

The Five Classes of Spec Ambiguity

When you try to translate a prose spec into evaluation criteria, five categories of ambiguity surface consistently. Each forces a decision that the spec left unmade.

1. Success-definition ambiguity. "The response should be accurate" tells you nothing about what accuracy means here. Does it mean factually grounded? Consistent with the user's context? Free of hallucinations about specific entities? A falsifiable criterion requires a specific condition: "the response does not contradict information present in the source documents" or "the response references no persons, dates, or figures not present in the retrieved context." The original spec implies both of these — it doesn't say which.

2. Edge-case handling ambiguity. Specs describe the happy path. They rarely specify what happens when the user's query is underspecified, when context is missing, or when the task is technically possible but probably not what the user meant. Evals force you to enumerate these scenarios. Does the assistant ask for clarification or make a plausible assumption? Does it surface uncertainty or answer with false confidence? The spec didn't say.

3. Objective-versus-subjective confusion. Some criteria have right answers that can be checked with code: the response is under 200 words, the JSON is valid, the date is formatted correctly. Others require judgment: the tone is appropriate, the summary captures the main point, the recommendation is sensible given the user's situation. Conflating these leads to evals that are too rigid (failing valid outputs on technicalities) or too vague (accepting bad outputs that superficially match a rubric). You need to sort criteria into their correct bucket before you can evaluate them correctly.

4. Context-dependency ambiguity. "Respond in the user's language" seems unambiguous until you consider: what if the user writes in Spanish but uploads an English document? What if the user switches languages mid-conversation? What if the language can't be reliably detected? Each of these branches needs an explicit decision. Evals expose them because you have to construct test cases, and test cases require you to pick a specific scenario.

5. Grader-rigidity ambiguity. When teams write evaluations, they often accidentally encode the path rather than the outcome. They check that the model used a specific tool, followed a particular reasoning chain, or produced output in a specific format — when all they actually care about is whether the user's problem was solved. An eval that grades the approach penalizes valid alternative solutions and creates perverse incentives in prompt optimization.

The Translation Methodology

Converting a prose spec into falsifiable criteria is a concrete process, not a brainstorming session.

Start with manual checks. Before you've built any eval harness, you're already doing evaluation — you just aren't writing it down. Every time you test a prompt and think "that looks right," you're applying an implicit criterion. Start by externalizing those: what would you check before shipping a prompt change? What behaviors would make you say "this is broken"? Write these down as English assertions, then make them falsifiable. "It should answer the question" becomes "the response directly addresses the specific question asked in the user's last message, without redirecting to a different but related topic."

Apply the two-expert test. A usable eval criterion is one where two domain experts, given the same input and output, would independently reach the same pass/fail verdict. If they'd disagree, the criterion isn't specific enough. This test catches ambiguity that feels resolved on paper but isn't. "The tone is professional" fails this test. "The response avoids slang, informal contractions, and first-person plural ('we think') unless the persona explicitly uses them" passes it.

Separate objective from subjective, then handle each correctly. Objective criteria become code-based assertions: regex checks, word count limits, schema validation, entity matching against a known ground truth. Subjective criteria become LLM-as-judge evaluations with a binary pass/fail rubric — not a 1-to-5 scale, which introduces inter-annotator disagreement without adding signal. For subjective judges, the rubric itself must pass the two-expert test.

Balance positive and negative cases. A test suite that only contains "should succeed" cases will not catch a model that always answers confidently. For every behavior you want to see, include cases where that behavior should not occur. If your assistant should search the web for current information, include cases where a web search would be redundant or inappropriate. If it should ask for clarification when a query is underspecified, include clear queries where asking would be annoying and wrong.

Let failure modes inform the criteria. You won't enumerate all the right criteria upfront. Run an early version, look at failures, and treat them as specification gaps. When you see an output that's clearly bad but doesn't fail any existing criterion, that's a missing criterion. Treat this as error analysis, not just debugging: cluster the failure types, name them, and translate each cluster into a criterion. This feedback loop produces a test suite that reflects how your system actually fails, not how you imagined it might.

What Changes When You Do This Upfront

The business case for spec-to-eval is iteration speed. Without explicit criteria, every prompt change requires subjective review by someone with context. Two engineers can look at the same output diff and disagree about whether it improved. You run another review cycle to resolve it, and by then the original context is stale.

With explicit criteria, a prompt change produces a metric: the test suite pass rate went from 74% to 81%, specifically because cases in the "failure to ask for clarification" category improved from 52% to 76%. This is falsifiable, discussable, and reproducible. You can hand it to a teammate who wasn't in the original review and get the same assessment.

The iteration compression comes from eliminating the subjective review step. You still need human judgment — to write and calibrate the criteria, to validate that automated judges are tracking what you care about, to catch cases your eval suite doesn't cover. But you're spending human judgment on the criteria, once, rather than on every individual output, every time.

Teams that build this discipline report 3-5× shorter iteration cycles not because the tooling is faster, but because the decisions that used to require a meeting can now be made by looking at a number. The spec-to-eval translation is where you front-load the hard thinking.

Common Failure Modes

Writing criteria that are too generic. "Faithfulness" and "helpfulness" look like criteria but aren't. They're properties that need to be operationalized for your specific product. A legal document assistant's faithfulness criterion is different from a customer support bot's. Generic criteria produce evals that tell you everything is fine until something obviously breaks.

Building a God Evaluator. A single evaluator that scores multiple dimensions simultaneously is hard to calibrate and hard to improve. When it fails, you can't tell which dimension drove the failure. Build one evaluator per criterion. They can run in parallel; they should be diagnosed separately.

Treating the criteria as finished. The spec-to-eval translation should happen before you write prompts, but the criteria are a living document. New failure modes will surface in production. New features will introduce new success definitions. The test suite that shipped with the feature should be a starting point, not a ceiling.

Skipping the negative cases because they're harder to construct. It takes more imagination to invent cases where the model should not do something than cases where it should. This is why eval suites tend to be one-sided, and why one-sided evals produce models that are confidently overactive. Negative cases are worth the effort.

The Starting Point

You don't need an eval harness to start this process. You need a document — a list of criteria in the form "given [input conditions], the output [verb] [specific property]." Write it before you write the first prompt. Test it with the two-expert check. Separate the objective criteria from the subjective ones and decide how each will be measured.

The translation will surface questions you didn't know you had. That's the point. The questions that the spec-to-eval process forces you to answer are exactly the questions your team would have argued about at launch — after you'd already built and shipped the wrong thing.

Answer them first.

References:Let's stay in touch and Follow me for more thoughts and updates