Your Eval Suite Is the Product Spec You Refused to Write
Open the PRD for any AI feature shipping this quarter. Notice the adjectives. The assistant should be helpful. Responses should feel natural. The agent should understand the user's intent. The summary should be accurate and concise. Every one of these words is a place the team gave up. They did not decide what the feature does. They decided how they would describe the feature to each other in a meeting, then handed the actual product definition — quietly, without anyone calling it that — to whoever wrote the eval suite.
This is not a documentation problem. The eval is the spec. The PRD is a press release written before the product exists. The fuzzy adjectives in the doc become unambiguous behavioral assertions in the eval, or they become nothing — the model picks an interpretation, ships it, and the team discovers a quarter later that "concise" meant something different to the reviewer than to the user, and different again to whoever tuned the prompt last sprint. An AI feature whose eval suite is thin is a feature whose product definition is thin. The model didn't fail. The team never decided what success meant.
Adjectives Are Decisions Postponed
Read a traditional PRD as if you have to implement it from the words on the page. "Helpful" is a vector with a dozen components: does the assistant pick the user's task over the literal request, does it ask before guessing when ambiguous, does it suggest related tasks the user did not ask about, does it respond to a malformed question with the question the user meant or with the question they typed. A PM and an engineer can both nod at "helpful" while privately holding incompatible answers to all twelve. The eval, when it gets written, has to pick — every test case is an unspoken choice on each axis.
The same is true of "concise." A 200-word answer is not concise to a user who wanted a number. A 12-word answer is not concise to a user who wanted reasoning. Concise is a function of the request, the user's expertise, the channel they are reading on, and the consequence of being wrong. The PRD pretended this was one decision; the eval forces twelve of them, written down, defended in code review, ratified by the test passing or failing on real cases.
The shipping order most teams use puts these decisions in the wrong place. Engineering writes the eval after the prompt is mostly built, on data sampled from whatever traces have already accumulated. The eval encodes the prompt's existing behavior, calls that the spec, and freezes it. The PM, busy with the next launch, never reads the cases. They were not invited to. So the team optimizes a behavioral contract no product person ever ratified, against test cases written by whoever was on call when the eval got stood up.
Write the Eval First, in the PM's Voice
Inverting the workflow is harder than it sounds, because most PMs have never written an eval case and most engineers have never written a product spec. The discipline is not "the engineer writes evals instead of the PM writing the PRD." It is that the eval cases are the PRD, and writing them is a product-decision activity, not a testing activity.
Start with a feature brief that is exactly long enough to motivate the eval — the user, the surface, the success outcome. Then sit with the PM and write the cases. Cases come in three flavors and each one pins down a different kind of decision the PRD evaded:
- Disambiguation cases turn each adjective into a yes-or-no. "The assistant is concise" becomes ten request/response pairs where the team agrees the response is appropriately short and ten where it is too long, with the cutoff visible in the data, not asserted in prose.
- Refusal cases decide what is out of scope, by example. The PRD often says "the assistant focuses on X" without enumerating the not-X it must decline. Refusal cases are where the team admits which adjacent tasks they are deliberately not building.
- Edge cases force the trade-offs the PRD smoothed over. What does the assistant do with a request that is partially in scope and partially refused? With a question the user phrased rudely? With a fact that is true but not helpful? Each case is a decision; collecting them is the spec.
Pre-build eval frameworks formalize this. For each scenario you specify expected intent, required extracted data, allowed tool behavior, grounding rules, response expectations, and explicit failure conditions. The framework's discipline is that vague requirements have nowhere to hide — every line is a testable claim.
The Hamel Counterargument, and What It Actually Says
There is a real pushback in the practitioner community against this framing. The most cited version is Hamel Husain's: writing evaluators before implementing features sounds appealing but creates more problems than it solves, because LLMs have unpredictable failure surfaces, you cannot anticipate what will break, and the better path is error analysis on real outputs followed by evaluators that target observed problems.
This is correct, and it is not a refutation of eval-as-spec. The two arguments are about different things.
Hamel is talking about the coverage of an eval suite — the inventory of failure modes you score against. There, error analysis wins, because the long tail of weird LLM behavior cannot be enumerated from a desk. You discover it by reading thousands of traces and clustering the bad ones. Writing speculative evaluators for failure modes you imagine is wasted work; the failures you ship will not be the ones you imagined.
- https://www.braintrust.dev/blog/evals-are-the-new-prd
- https://hamel.dev/blog/posts/evals-faq/
- https://vikasgoyal.github.io/agentic/prebuildevals.html
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://www.lennysnewsletter.com/p/beyond-vibe-checks-a-pms-complete
- https://www.productboard.com/blog/ai-evals-for-product-managers/
- https://addyosmani.com/blog/good-spec/
- https://github.com/itsderek23/awesome-eval-driven-development
