Skip to main content

Writing Acceptance Criteria for Non-Deterministic AI Features

· 12 min read
Tian Pan
Software Engineer

Your engineering team has been building a document summarizer for three months. The spec says: "The summarizer should return accurate summaries." You ship it. Users complain the summaries are wrong half the time. A postmortem reveals no one could define what "accurate" meant in a way that was testable before launch.

This is the standard arc for AI feature development, and it happens because teams apply acceptance criteria patterns built for deterministic software to systems that are fundamentally probabilistic. An LLM-powered summarizer doesn't have a single "correct" output — it has a distribution of outputs, some acceptable and some not. Binary pass/fail specs don't map onto distributions.

The problem isn't just philosophical. It causes real pain: features launch with vague quality bars, regressions go undetected until users notice, and product and engineering can't agree on whether a feature is "done" because nobody specified what "done" means for a stochastic system. This post walks through the patterns that actually work.

Why "Should Return Accurate Results" Is Not a Spec

Traditional acceptance criteria come from a world where the system either does the thing or it doesn't. Given a valid HTTP request, the endpoint returns 200. Given a specific input, the function returns the expected output. These specs translate directly into automated tests that produce binary results.

LLM-powered features don't work this way. Ask the same question twice and you may get two different answers. Ask it a thousand times and you get a distribution. Some responses in that distribution are good; some are bad; most are somewhere in the middle. The question isn't whether the feature returns a correct result — it's what fraction of results are acceptable, under what conditions, and how that fraction holds up as input distribution shifts.

"Accurate" also carries a definition problem. For a summarizer, does accurate mean factually consistent with the source document? Concise? Covering the main points? Not hallucinating details not in the original? Each of these is a different criterion, each is measurable differently, and they sometimes conflict. A concise summary may omit nuance; a complete summary may include material the user didn't need. Collapsing all of this into "accurate" creates a spec that every engineer on the team will interpret differently.

The first step is decomposition: break down what "good" means for your specific feature into distinct, measurable dimensions. A summarizer spec might separate factual consistency (no hallucinated claims), coverage (key points present), concision (under 150 words for source docs under 2,000 words), and tone (neutral, professional). Each dimension needs its own criterion and its own measurement approach.

Eval Threshold Contracts

Once you have measurable dimensions, you can write eval threshold contracts — the AI analog of acceptance criteria. The structure is: this metric, measured this way, on this test set, should meet this threshold with this confidence.

A concrete example: "Factual consistency score, measured by cross-referencing extracted claims against the source document using an LLM judge, should exceed 0.85 on the 200-example golden dataset, with the lower bound of the 95% confidence interval above 0.82."

This contract specifies what you're measuring (factual consistency), how (LLM-based claim extraction and verification), on what data (a fixed golden dataset), what threshold passes (0.85), and what uncertainty is acceptable (CI lower bound above 0.82). An engineer can implement this test, run it on a PR, and get a clear signal.

The threshold level depends on context. A legal document review tool needs factual consistency above 0.95 with a tight confidence interval. An internal tool for summarizing meeting notes might accept 0.80. The important thing is that the threshold is explicit, written down before development starts, and agreed to by product and engineering before launch rather than debated after complaints arrive.

Statistical calibration matters here. A threshold of 0.90 on 20 examples means very little — the confidence interval is so wide that you can't tell whether you have a 0.85 feature or a 0.95 feature. Practical guidance: for features where failures are costly, use 300-500 examples and bootstrap the confidence interval. For lower-stakes internal tools, 50-100 examples with explicit acknowledgment that you're accepting more uncertainty. The sample size choice should be documented as a deliberate tradeoff, not left implicit.

Error budgets work well alongside thresholds. Instead of framing the spec as "must achieve 0.90," frame it as "may fail on up to 10% of cases." This framing maps naturally to operational thinking — teams that already work with error budgets for uptime find it intuitive to apply the same concept to AI feature quality.

Example-Based Behavior Specs

Threshold contracts tell you when a feature passes overall, but they don't tell you what behaviors are required or forbidden. Example-based specs fill this gap by anchoring requirements to specific instances.

A behavior spec for a customer support bot might look like this: given a message containing a cancellation request, the response must acknowledge the request, confirm the cancellation process, and not offer a retention discount to users flagged as high-churn-risk in the context. This is a concrete, testable behavior — you can run it, evaluate the output against the criteria, and get a clear answer.

Golden datasets encode these examples at scale. A well-constructed golden dataset covers the intended input distribution (common cases), edge cases (rare but important), adversarial cases (inputs designed to break the feature), and boundary conditions (cases near the decision boundaries where the feature might fail). Each example should have a clear expected behavior — not a single correct output, but a set of criteria that any acceptable output must satisfy.

Building a useful golden dataset requires discipline that most teams skip. The dataset should be constructed before you finalize the feature, not reverse-engineered from a working implementation. It should be reviewed by at least one person outside the feature team to catch assumptions the team has baked in. It should include cases drawn from your actual user distribution, not just cases the team found easy to write. And it should grow over time as new failure modes are discovered in production.

The most valuable examples in a golden dataset are adversarial ones. These are inputs specifically designed to expose failure modes: edge cases that trigger hallucination, inputs that game a poorly designed criterion, cases where the feature might be technically correct but practically useless. Adversarial examples are hard to generate systematically — the best approach combines human red-teaming with automated perturbation (paraphrasing inputs, substituting entities, reordering context) to find the boundary conditions where the feature breaks.

Defining "Done" Without Pretending AI Is Deterministic

The acceptance criteria patterns above give you something concrete to write on a ticket. But there's a larger process problem: teams that build AI features often don't know when to ship because they lack a shared mental model for "done."

Deterministic features have a clear definition: all tests pass, all acceptance criteria met, done. AI features need a more nuanced model because you're always trading off coverage, confidence, and cost. More examples mean better confidence but higher evaluation cost. Stricter thresholds mean fewer regressions but longer development cycles. A 0.92 feature may be good enough to ship; a 0.89 feature may warrant another iteration — but these numbers are only meaningful if you agreed on the threshold before development, not after.

A useful practice is to define three checkpoints during development. The first is the minimum viable quality bar — the floor below which you won't ship regardless of schedule pressure. This gets written into the spec. The second is the target quality bar — what you're aiming for with available time and resources. The third is an aspirational bar for future iterations. Writing down all three prevents the common failure mode where the team ships at minimum viable quality while everyone thought they were aiming for the target.

Post-launch quality should also be specified in advance. AI features degrade in ways traditional features don't: the input distribution shifts as users find new ways to use the product, data sources go stale, and model behavior drifts with provider updates. The spec should include a monitoring threshold — the quality level below which the feature gets pulled or degraded gracefully — and a measurement cadence (weekly eval run against the golden dataset, monthly review of production samples).

Measurement Approaches That Work in Practice

You have three main tools for measuring AI feature quality: automated reference-free metrics, LLM-as-judge evaluation, and human rating. Each belongs in a different part of your process.

Automated reference-free metrics are fast and cheap. ROUGE, BERTScore, and similar metrics measure output properties (length, vocabulary, semantic similarity) without needing a ground-truth reference. They're useful for catching regressions at the extremes — outputs that are dramatically too short, semantically incoherent, or wildly different from a historical baseline. But they're weak signal for subtle quality issues and can be gamed by features that learn to score well without actually being good. Use them as a fast gate in CI, not as your primary quality signal.

LLM-as-judge evaluation uses another model to score your feature's outputs against a rubric. Done well, it aligns with human judgment 80-85% of the time and scales to thousands of examples cheaply. Done poorly, it introduces systematic biases that undermine your quality signal — models prefer longer outputs, prefer outputs that match their own style, and perform worse outside their training distribution (medical and legal domains are common failure points). The critical practices: use a judge model that's more capable than your generation model, provide a detailed rubric with examples rather than asking for a bare numeric rating, require the judge to explain its reasoning before scoring, and validate the judge against human ratings on a calibration set before trusting it in production.

Human rating remains the ground truth. You can't escape it entirely — your golden dataset needs human-labeled examples, your LLM judge needs a calibration set with human labels, and periodic audits of production samples need human review. The goal is to use human rating where it matters most (calibration, high-stakes edge cases, golden dataset construction) and automate everywhere else.

For each dimension in your spec, choose the right measurement tool before you write the threshold. "Factual consistency" needs a judge that can verify claims against source documents — a simple embedding similarity metric won't catch hallucinated proper nouns. "Response concision" can be measured automatically by counting words. "Tone appropriateness" needs either human raters or a well-calibrated judge. The measurement approach is part of the spec, not an implementation detail left to the team.

The Organizational Problem

Most of the friction in AI acceptance criteria isn't technical — it's organizational. Product managers write specs using the vocabulary of deterministic software because that's what they know. Engineers accept vague specs because they're used to working out quality in implementation. Neither side has a shared framework for discussing probabilistic quality, so quality decisions get made implicitly and inconsistently.

The solution is to make the spec template explicit. Teams that ship AI features reliably tend to use a structured spec format that forces the quality question to be answered upfront. A minimal version: for each user-facing behavior, specify (1) what dimension of quality matters, (2) how it will be measured, (3) the minimum acceptable threshold, and (4) the sample size for the evaluation. This takes maybe 30 extra minutes to write and saves multiple rounds of post-launch debate.

It also helps to separate the quality gate from the behavioral spec. The behavioral spec describes what the feature should do; the quality gate specifies the measurable threshold for shipping. These can be owned by different people — product owns the behavioral spec, engineering owns the quality gate — as long as both are written before development starts rather than inferred from the implementation.

The hardest pattern to break is post-hoc threshold setting: the team builds the feature, measures quality on some examples, and then writes acceptance criteria that the feature already passes. This is the AI equivalent of writing tests after the code — it validates nothing. The threshold needs to be set before you see the results, based on what level of quality is actually required for the use case, not calibrated to what the current implementation achieves.

Conclusion

Non-deterministic AI features require acceptance criteria that match their probabilistic nature. That means decomposing "good" into measurable dimensions, writing eval threshold contracts with explicit sample sizes and confidence intervals, building golden datasets that cover your real input distribution before you ship, and agreeing on measurement approaches before development rather than after launch.

The underlying principle is the same as good spec writing for deterministic software: make the criteria explicit enough that two engineers reading them independently would make the same judgment about whether the feature passes. The difference is that for AI features, "same judgment" means agreeing on what the statistical threshold is and how to measure it — not what the exact output should be.

Teams that get this right ship AI features with less debate, catch regressions earlier, and avoid the postmortems that start with "we thought it was working." The investment is front-loaded into specification, which is where it's cheapest — not discovered at launch, which is where it's most expensive.

References:Let's stay in touch and Follow me for more thoughts and updates