The Eval Suite That Became the Spec Nobody Agreed To
Open any mature agent codebase and ask a simple question: where is the requirements document? Not the pitch deck, not the launch doc, not the Notion page that was last touched in Q3. Where is the artifact that says, concretely and unambiguously, what this agent is supposed to do?
For most teams, the honest answer is the eval suite. There is a folder of test cases — inputs paired with expected outputs, rubrics, judge prompts — and a CI gate that says pass or fail. That folder is the only place where "correct" is defined precisely enough to be executed. Everything else is prose, and prose drifts.
This is not inherently bad. An executable spec is more honest than a PRD that nobody reads. The problem is that almost nobody treats the eval suite as a spec. It was assembled by one engineer, under deadline, to make a release gate go green. It encodes a hundred judgment calls that were never written down, never reviewed, and never agreed to. And the model is now optimized precisely to it.
Whoever Writes the Evals Is Writing the PRD
In traditional software, the requirements document and the test suite are different artifacts with different authors. Product writes the spec. Engineering writes tests that check the spec. When they disagree, the disagreement is visible — a test contradicts a requirement, and someone has to reconcile them.
Agents collapse those two artifacts into one. There is no requirements doc that survives contact with production, because "the agent should be helpful and accurate" is not a specification — it is a vibe. The only place "accurate" becomes concrete is the eval set, where some engineer decided that this response to this prompt is correct and that one is not. Every one of those decisions is a product decision. The person writing evals is writing the PRD, whether they signed up for that job or not.
Consider a support agent. An eval case asks: "Can I get a refund?" The expected output the engineer wrote says the agent should explain the 30-day policy and offer to start a return. That single test case silently decides several things. It decides the agent quotes policy rather than deferring to a human. It decides the agent is proactive — it offers the next step rather than waiting to be asked. It decides the tone is informative rather than apologetic. None of those were in a spec. They were in the head of whoever wrote the test, at 6 p.m., trying to get a number above 90%.
Multiply that by three hundred test cases and you have a detailed product specification. It is just one that was never reviewed by product, never seen by legal, and never reconciled against what the marketing site promises. As one widely shared practitioner guide puts it, annotation is the place where opinions quietly become ground truth — and ground truth is the only thing an eval can stand on.
A Green Run Is a Statement About the Author's Assumptions
The dangerous part is what a passing eval run feels like. It feels like the product works. It is green, it is in CI, it gated the release. The emotional weight of "all tests pass" carries over from a world where tests checked an independently authored spec.
But an agent eval is not that. A green run is a statement that the model's behavior matches the assumptions of the person who wrote the eval set. If that person misunderstood the refund policy, the eval encodes the misunderstanding, the model is trained and tuned to satisfy it, and the green checkmark certifies the wrong behavior with full confidence. The suite cannot catch an error it shares.
This is worse than no eval, in one specific way: it manufactures false certainty. A team with no evals knows it is flying blind and stays nervous. A team with a comprehensive green suite stops being nervous. The suite becomes an authority nobody is auditing. Offline benchmarks measure performance against fixed prompts, fixed labels, and a fixed notion of correctness — and production violates all three. The label is only as good as the afternoon the engineer spent writing it.
There is also a quieter failure. Because the model is optimized against the eval set — through prompt iteration, fine-tuning, or reward signals — it learns the eval author's blind spots as features. Whatever the author forgot to test, the model is free to get wrong, and it will, because nothing pushed back. The eval set does not just measure the product. It shapes it, into the exact form of one engineer's attention.
The Drift Between What Evals Measure and What Was Promised
Three documents are supposed to say the same thing: what the eval suite checks, what the product team intends, and what users were told. In practice they diverge, and the eval suite is the one with teeth.
The marketing page says the agent "handles your scheduling end to end." The eval suite tests eleven scheduling scenarios, none of which involve conflicting time zones, because the engineer who wrote them did not think of it — or did, and could not figure out a clean expected output, so skipped it. The product is now defined, operationally, as "scheduling minus the hard part." Nobody decided that. It is an emergent property of which test cases were easy to write.
Two forces widen this gap over time. The first is convenience: cases with a clear, checkable answer get into the suite, and ambiguous ones — exactly the ones where the product makes its money or its lawsuits — get left out because they are annoying to grade. The second is staleness. The policy changes, the onboarding flow changes, a regulation changes, and the prose docs get updated while the eval suite does not, because updating evals is unglamorous and there is no owner. The agent keeps passing. It is now certified against a world that no longer exists.
The result is an agent whose real behavior is governed by an artifact that nobody is comparing against intent. The eval suite wins every argument by default, because it is the thing wired into CI. It does not win because it is right.
Treat the Eval Set Like an API Schema
The fix is not more evals. It is to treat the eval set as what it actually is — a contract — and give it the review rigor a contract deserves. An API schema does not get changed by one person quietly merging a commit. Neither should the artifact that defines what your agent is for.
A few concrete moves:
- Make eval changes reviewable as product decisions. A diff that adds, removes, or relabels test cases is a change to the spec. It should be reviewed by someone who owns the product surface, not just rubber-stamped by whoever can read TypeScript. The review question is "do we agree this is correct?" — not "does this run?"
- Require cross-functional sign-off on the rubric, not the runner. Product and, where stakes warrant it, legal or compliance should sign off on what "correct" means for high-risk categories — refunds, medical guidance, financial advice, anything that abstains or escalates. They do not need to read the harness code. They need to read the expected outputs.
- Version the eval set and tie it to a date. When an eval case was written matters. A case authored against last year's refund policy is a stale snapshot, and it should be visible as one. Treat the suite like a schema with a changelog, so "what does this version of the agent consider correct?" has an answer.
- Track coverage against intent, not against itself. Map eval categories back to the promises on the product page and the obligations in policy docs. The gaps — promised but untested — are your real risk register. A 98% pass rate on the cases you happened to write tells you nothing about the cases you avoided.
- Make "who reviewed the evals?" a release-gate question. Right now the gate asks "did evals pass?" Add a second question: "did anyone with product authority review the eval changes since the last release?" If the answer is no, the green checkmark is unsigned.
None of this requires new tooling. It requires admitting that the eval suite is the spec and routing it through the same approval path as anything else that defines the product.
The Quiet Question to Ask This Week
The next time an agent release goes green, find the eval suite and run git log on it. Look at who has touched it in the last six months. If it is one or two engineers, and no name from product or compliance appears, you have just found your real specification — and the list of people who actually wrote it.
That is not a disaster. An executable spec written by engineers is still better than a PRD nobody executes. But it is a spec written without the people who are accountable for the product, encoding assumptions they have never seen, certifying behavior they have never approved. The eval suite was always going to become the spec. The only choice you get is whether it is a spec someone agreed to.
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://hamel.dev/blog/posts/evals-faq/
- https://developers.openai.com/blog/eval-skills
- https://developers.openai.com/api/docs/guides/evaluation-best-practices
- https://www.evidentlyai.com/llm-guide/llm-as-a-judge
- https://towardsdatascience.com/production-ready-llm-agents-a-comprehensive-framework-for-offline-evaluation/
- https://arxiv.org/pdf/2503.16416
