Behavioral Contracts: Writing AI Requirements That Engineers Can Actually Test
Most AI projects that die in the QA phase don't fail because the model is bad. They fail because nobody agreed on what "good" meant before the model was built. The acceptance criteria in the ticket said something like "the summarization feature should produce accurate, relevant summaries" — and when the engineer asked what "accurate" meant, the answer was "you know it when you see it." That is not a behavioral requirement. That is a hope.
The problem compounds because teams imported their existing requirements process from deterministic software and applied it unchanged to systems that are fundamentally stochastic. When you write assertTrue(output.equals("Paris")) for a database query, the test either passes or fails with complete certainty. When you write the same shape of assertion for an LLM, you get a test that fails on every valid paraphrase and passes on every confident hallucination. The unit test is lying to you, and the spec it was derived from was never designed for a system that generates distributions of outputs rather than single values.
This is the behavioral contracts problem: how do you write product requirements for AI systems in a way that is precise enough to test, flexible enough to accommodate valid variation, and honest about the failure modes you're willing to tolerate?
Why Classical Acceptance Criteria Fail on AI
Classical acceptance criteria inherit from design by contract — the idea, formalized in the Eiffel language by Bertrand Meyer in the 1980s, that every software component should specify its preconditions (what must be true before it runs), postconditions (what it guarantees when it returns), and invariants (conditions that always hold). This model works elegantly for deterministic code. A sort function accepts a list and must return a list with the same elements in non-decreasing order. You can verify that claim on every call.
The postcondition breaks on LLMs because "what the component guarantees when it returns" is not a single value — it is a distribution over a space of possible strings. Two correct answers to the same question look different. The same answer might be phrased six ways across six calls, all of which are acceptable. Requiring exact match fails all of them. Accepting anything passes the hallucinations. The binary postcondition is the wrong abstraction.
The failure cascades into project management. Product managers write requirements in terms they can hand off to QA. QA writes tests that map to those requirements. If the requirements say "the chatbot must answer factual questions correctly," QA picks five example questions, checks the answers, and ships a green build. Three weeks later, the feature fails on question six — the one nobody thought to include. The test suite is not wrong because the engineers were careless. It is wrong because "answer factual questions correctly" is a description of a class of inputs and a desired behavior property, not a specification of what correct means or how often failure is acceptable.
The Four-Field Contract Format
A behavioral contract for AI replaces the classical postcondition with four fields that together describe what a system must do, over what range of inputs, to what quality threshold, and how you verify it.
Input class: A precise description of the input category the contract governs. Not "user questions" — that's the entire product surface. Instead: "single-turn questions about product pricing, submitted from the checkout flow, where the user is authenticated and the question contains a product SKU." The input class bounds the contract. Requirements that fail to bound the input space are untestable because a system that works for 90% of questions and fails for 10% might be perfectly acceptable or catastrophically broken depending on which 10% you're talking about.
Expected behavior: A description of the desired output property, not the desired output. Not "the answer should be $29.99" but "the response should include the correct price for the referenced SKU, expressed in dollars, with no contradictory pricing information in the same response." Properties are verifiable across outputs that look different on the surface. They can be checked programmatically (does the response contain a price-shaped string that matches the catalog?) or by a model-based judge (does this response contradict itself?). The key shift is from specifying the artifact to specifying the artifact's properties.
Failure budget: The fraction of inputs from the defined input class for which the expected behavior is allowed to fail, over a measurement window. This is the field that classical requirements always omit, and it is the most important field in the contract. A pricing feature that gives wrong prices 2% of the time is a wildly different product than one that gives wrong prices 0.1% of the time. Both are different from a feature that fails 30% of the time on edge-case SKUs but 0.01% on the top-100 bestsellers. The failure budget forces the team to articulate the distribution of acceptable failure rather than implying that zero failures is required — a standard no AI system can meet — or that "low error rate is fine" without defining low.
Test oracle: A concrete, executable description of how you determine whether a given output meets or violates the expected behavior property on a specific input. This is the field that transforms the other three from requirements language into code. The oracle might be a regex check, a structured extraction, a comparison against a ground-truth table, a model-graded rubric, or a human review protocol with specified rubric criteria. The oracle must be specific enough that two engineers running it independently produce the same verdict on the same output.
A Contract in Practice
Here is what a behavioral contract looks like for a document summarization feature:
- Input class: English-language documents between 500 and 5000 words, submitted through the web UI summary endpoint, in the categories "legal," "technical," and "news."
- Expected behavior: The summary must cover all major entities mentioned in the source document (precision: no fabricated entities; recall: no omission of entities mentioned more than twice), must be between 10% and 25% of source length, and must not introduce factual claims absent from the source.
- Failure budget: Fewer than 3% of requests may violate any individual property on the golden test set; fewer than 0.5% may violate the no-fabrication property specifically.
- Test oracle: Entity precision and recall evaluated by spaCy NER extraction applied to both source and summary, with entities cross-referenced against a canonicalized list; factual novelty flagged by a GPT-4-class judge prompted to identify specific claims in the summary that cannot be traced to the source document, with per-claim verdicts logged for human review.
Notice what this contract does not say: it does not specify a model, a prompt, or an inference strategy. Behavioral contracts are implementation-agnostic. They specify what the system must do, not how. That distinction is critical for AI features because the implementation — model version, prompt, retrieval configuration, post-processing logic — changes constantly. A contract written against a specific prompt becomes obsolete the moment someone edits a sentence in the system message. A contract written against a behavioral property survives model upgrades, prompt rewrites, and architecture changes. It is the thing you test against in CI regardless of what changed.
Writing Input Classes That Actually Bound the Problem
The input class field fails in predictable ways that are worth cataloguing. The most common failure is being too broad: "questions about our product" encompasses a surface so large that any sample will fail to represent it. The second failure is being too narrow: specifying a handful of named examples rather than a category. Twelve example inputs is not an input class — it is a test set, and a small one. Input classes need predicates, not enumerations: the input falls in the class if and only if it satisfies conditions A and B and does not satisfy condition C.
A useful heuristic: an input class is defined correctly when you can write a function that accepts or rejects any input in constant time without running the model. If the classification requires inference, it is a behavioral property, not an input class.
Well-bounded input classes also make failure budget allocation tractable. Once you have input classes with known volume ratios — this class accounts for 60% of production traffic, that one for 5% — you can write contracts with differentiated failure budgets. High-volume, high-stakes classes get tight budgets. Low-volume, low-stakes edge cases get relaxed ones. This is how you avoid the trap of designing every requirement to the standard of the worst case.
Building Test Oracles Without Ground Truth
The oracle field is where most teams stall because they conflate "test oracle" with "labeled dataset." Building a labeled dataset with human annotations at scale is expensive and slow. For many AI features, especially open-ended generation tasks, human labeling at sufficient scale is not feasible before launch.
The practical resolution is that oracles exist on a spectrum from pure automation to pure human judgment, and most production-grade oracles are hybrid. A pricing bot can use a SQL lookup to verify factual accuracy — no human required. A summarization feature can use automated entity extraction for recall and a model-based judge for hallucination, with human spot-checks sampled at 1% to calibrate the judge against ground truth. A customer service feature can use a rubric-based model judge for tone and resolution quality, validated quarterly against human ratings on a stratified sample.
The oracle design question for any requirement is: what is the cheapest automated signal that correlates strongly enough with the behavioral property to serve as a proxy, and at what sample rate does that proxy need human calibration to stay honest? "Strong enough" is determined by measuring the agreement rate between your automated oracle and human judgment on a calibration set. An oracle that agrees with humans 95% of the time on your specific task is suitable for blocking deployments. An oracle at 70% agreement is noise.
Contracts as the Interface Between Product and Engineering
The four-field contract format resolves a structural tension that dogs most AI projects: product managers write requirements in business language; engineers implement and test in technical language; neither is translating correctly. PMs who write "users should get fast, helpful answers" are not being lazy — they are writing in the register that makes sense for their artifact, which is a feature specification, not a test plan. Engineers who implement against that spec and then write tests that amount to "the model ran and didn't throw an error" are also not being negligent — they are working with what they were given.
Behavioral contracts force the negotiation to happen explicitly rather than implicitly. The failure budget field in particular is a business decision masquerading as a technical parameter. What fraction of users can receive an incorrect price before the feature does more harm than good? That is a product question with a numerical answer, and answering it before implementation starts is qualitatively different from discovering the answer after a support escalation. The contract format makes it impossible to defer that negotiation, because the spec is literally incomplete without a number.
Teams that adopt this format report a second-order benefit: requirements are versioned alongside code, and when the model changes, the conversation about whether the behavioral properties still hold happens before deployment rather than after. The contract is the test plan. The test plan is the contract. They are the same document.
The Forward Path
Most engineering teams shipping LLM features in 2026 are testing them less rigorously than they test their login forms. The login form has a binary pass condition, a regression suite, and a deployment gate. The summarization feature has a few informal examples and a shared understanding that "usually it seems fine." Behavioral contracts do not solve all the hard problems of AI evaluation — data distribution shift, evaluator consistency, adversarial inputs — but they close the requirements gap that makes those harder problems impossible to even frame correctly.
The actionable starting point is to rewrite one existing AI feature requirement using the four-field format. Identify the input class. Turn the behavioral description from an output specification to a property specification. Pick a failure rate that the business can actually commit to. Define the oracle you will run in CI. If any of the four fields feel impossible to fill in, that feeling is information: it means the requirement was never precise enough to test in the first place.
A spec you can execute is a spec you can defend. Start there.
- https://ainna.ai/resources/faq/ai-prd-guide-faq
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://arxiv.org/abs/2601.05542
- https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/testing-modern-ai-systems-from-rule-based-systems-to-deep-learning-and-large-lan/4429518
- https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
- https://link.springer.com/article/10.1007/s10458-020-09473-8
