Skip to main content

Writing Acceptance Criteria for Non-Deterministic AI Features

· 12 min read
Tian Pan
Software Engineer

Your engineering team has been building a document summarizer for three months. The spec says: "The summarizer should return accurate summaries." You ship it. Users complain the summaries are wrong half the time. A postmortem reveals no one could define what "accurate" meant in a way that was testable before launch.

This is the standard arc for AI feature development, and it happens because teams apply acceptance criteria patterns built for deterministic software to systems that are fundamentally probabilistic. An LLM-powered summarizer doesn't have a single "correct" output — it has a distribution of outputs, some acceptable and some not. Binary pass/fail specs don't map onto distributions.

The problem isn't just philosophical. It causes real pain: features launch with vague quality bars, regressions go undetected until users notice, and product and engineering can't agree on whether a feature is "done" because nobody specified what "done" means for a stochastic system. This post walks through the patterns that actually work.

Why "Should Return Accurate Results" Is Not a Spec

Traditional acceptance criteria come from a world where the system either does the thing or it doesn't. Given a valid HTTP request, the endpoint returns 200. Given a specific input, the function returns the expected output. These specs translate directly into automated tests that produce binary results.

LLM-powered features don't work this way. Ask the same question twice and you may get two different answers. Ask it a thousand times and you get a distribution. Some responses in that distribution are good; some are bad; most are somewhere in the middle. The question isn't whether the feature returns a correct result — it's what fraction of results are acceptable, under what conditions, and how that fraction holds up as input distribution shifts.

"Accurate" also carries a definition problem. For a summarizer, does accurate mean factually consistent with the source document? Concise? Covering the main points? Not hallucinating details not in the original? Each of these is a different criterion, each is measurable differently, and they sometimes conflict. A concise summary may omit nuance; a complete summary may include material the user didn't need. Collapsing all of this into "accurate" creates a spec that every engineer on the team will interpret differently.

The first step is decomposition: break down what "good" means for your specific feature into distinct, measurable dimensions. A summarizer spec might separate factual consistency (no hallucinated claims), coverage (key points present), concision (under 150 words for source docs under 2,000 words), and tone (neutral, professional). Each dimension needs its own criterion and its own measurement approach.

Eval Threshold Contracts

Once you have measurable dimensions, you can write eval threshold contracts — the AI analog of acceptance criteria. The structure is: this metric, measured this way, on this test set, should meet this threshold with this confidence.

A concrete example: "Factual consistency score, measured by cross-referencing extracted claims against the source document using an LLM judge, should exceed 0.85 on the 200-example golden dataset, with the lower bound of the 95% confidence interval above 0.82."

This contract specifies what you're measuring (factual consistency), how (LLM-based claim extraction and verification), on what data (a fixed golden dataset), what threshold passes (0.85), and what uncertainty is acceptable (CI lower bound above 0.82). An engineer can implement this test, run it on a PR, and get a clear signal.

The threshold level depends on context. A legal document review tool needs factual consistency above 0.95 with a tight confidence interval. An internal tool for summarizing meeting notes might accept 0.80. The important thing is that the threshold is explicit, written down before development starts, and agreed to by product and engineering before launch rather than debated after complaints arrive.

Statistical calibration matters here. A threshold of 0.90 on 20 examples means very little — the confidence interval is so wide that you can't tell whether you have a 0.85 feature or a 0.95 feature. Practical guidance: for features where failures are costly, use 300-500 examples and bootstrap the confidence interval. For lower-stakes internal tools, 50-100 examples with explicit acknowledgment that you're accepting more uncertainty. The sample size choice should be documented as a deliberate tradeoff, not left implicit.

Error budgets work well alongside thresholds. Instead of framing the spec as "must achieve 0.90," frame it as "may fail on up to 10% of cases." This framing maps naturally to operational thinking — teams that already work with error budgets for uptime find it intuitive to apply the same concept to AI feature quality.

Example-Based Behavior Specs

Threshold contracts tell you when a feature passes overall, but they don't tell you what behaviors are required or forbidden. Example-based specs fill this gap by anchoring requirements to specific instances.

A behavior spec for a customer support bot might look like this: given a message containing a cancellation request, the response must acknowledge the request, confirm the cancellation process, and not offer a retention discount to users flagged as high-churn-risk in the context. This is a concrete, testable behavior — you can run it, evaluate the output against the criteria, and get a clear answer.

Golden datasets encode these examples at scale. A well-constructed golden dataset covers the intended input distribution (common cases), edge cases (rare but important), adversarial cases (inputs designed to break the feature), and boundary conditions (cases near the decision boundaries where the feature might fail). Each example should have a clear expected behavior — not a single correct output, but a set of criteria that any acceptable output must satisfy.

Building a useful golden dataset requires discipline that most teams skip. The dataset should be constructed before you finalize the feature, not reverse-engineered from a working implementation. It should be reviewed by at least one person outside the feature team to catch assumptions the team has baked in. It should include cases drawn from your actual user distribution, not just cases the team found easy to write. And it should grow over time as new failure modes are discovered in production.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates