Skip to main content

The Requirements Gap: How to Write Specs for AI Features When 'Correct' Is a Distribution

· 10 min read
Tian Pan
Software Engineer

Here is a spec that ships broken AI features on a predictable schedule: "The assistant should accurately answer customer questions and maintain a helpful tone." Every stakeholder nodded, the PRD was approved, and six months later the team is arguing in a post-mortem about whether an 87% accuracy rate was acceptable — a question nobody thought to answer before launch.

The failure is not technical. The model may have been fine. The failure is that the requirements format imported directly from traditional software left no room for the defining property of AI outputs: they are probabilistic. "Correct" is not a state; it is a distribution. And you cannot specify a distribution with a user story.

Teams that close this gap early — that write real AI specs before touching a prompt — report compressing iteration cycles by 3–5× compared to teams operating on intuition. The mechanism is not magic; it is simply that discovering disagreements in a spec review is cheaper than discovering them in a post-mortem.

Why Traditional Acceptance Criteria Break for AI

Standard acceptance criteria assume a deterministic system: given input X, the system produces output Y. If Y matches the expected value, the test passes. This model fails for AI on every axis.

The same prompt can return different valid answers on different runs. There is no canonical "golden answer" for most open-ended AI tasks, only a range of acceptable outputs. Traditional QA does a one-time verification pass; AI features require continuous, ongoing measurement because the system's behavior shifts as models are updated, prompts are changed, and input distributions drift over time.

Consider what happens when you apply standard QA to a customer service summarization feature. You test it manually on ten conversations, it looks fine, and you ship. Three weeks later a user submits a conversation that's 40,000 tokens with three language switches and an embedded PDF reference. Your system never had a spec for how to handle that. The failure mode was always there; the spec just couldn't see it because it described a success condition for inputs that matched the demo.

The deeper problem is organizational. Without a quantified spec, engineering and product have no shared language for tradeoffs. Is a 93% accuracy rate acceptable? Depends on volume: at 10 million insurance claims per year, that 7% error rate is 700,000 potentially wrong adjudications. Framed as a percentage, it sounds reasonable; framed in business consequences, it demands a different answer. Traditional specs never force this calculation because they assume the system either works or it doesn't.

The Two-Tier Structure: Policy vs. Quality

The most important conceptual shift in AI specification is separating constraints into two fundamentally different categories: policy constraints and quality thresholds. Conflating them causes both over-blocking (teams halt development because a quality metric missed its target, which should have been renegotiable) and under-protection (teams accept edge-case policy violations as "within acceptable error rate," which they are not).

Policy constraints are non-negotiable behavioral boundaries. The system never crosses them, regardless of context, user request, or quality tradeoffs. Examples: "Never expose a real user's PII in a response," "Never generate content depicting a minor in a sexual context," "Never produce output that contains a CVE-exploitable code pattern." These are tested with binary pass/fail: either the system crossed the line or it did not. Any violation is a launch blocker, full stop. Thresholds like "less than 0.1% of 10,000 adversarial test cases trigger the constraint" are appropriate for measuring robustness of the guardrail implementation, but a 0.1% miss rate on a red-line policy is still a miss rate you need to understand before deciding whether to ship.

Quality thresholds are probabilistic targets for output quality across the distribution of expected inputs. These are negotiable based on business tradeoffs and improve over time. Examples: "Achieves ≥ 90% factual accuracy on a 500-question held-out eval set," "Rated ≥ 4/5 by human raters on 80% of sampled responses," "95th percentile latency under 2 seconds." Missing a quality threshold triggers investigation and roadmap reprioritization; it does not trigger immediate rollback.

The operational difference matters: policy violations escalate immediately; quality threshold misses create tickets. You need both, and they need to live in different sections of your spec document, enforced through different processes.

Writing Specs That Are Actually Testable

A well-formed AI acceptance criterion follows a specific structure:

[Dimension]: [Threshold] on [Test Set Description] as measured by [Measurement Method]

Compare these:

UntestableTestable
"The AI accurately answers customer questions""Factual accuracy ≥ 90% on a 500-question held-out eval set sampled from production queries, measured by LLM-as-judge with human-calibrated rubric"
"The AI should maintain a professional tone""Tone rated appropriate by human raters in ≥ 85% of sampled outputs across 200 random production samples, reviewed monthly"
"The AI should be safe""Less than 0.1% of outputs across 10,000 adversarial test trials flagged for policy violations by automated classifier, with weekly human audit of flagged samples"

The forcing function is that writing in this format immediately surfaces unanswered product questions. What is the test set? Who is measuring? What does "appropriate" mean, and can you write a rubric? These are questions your team will eventually need to answer — the spec format determines whether you answer them before or after the launch post-mortem.

Define error severity categories, not just error rates. Specifying that "90% of errors are classified as minor inconvenience, not egregious" forces you to build a severity taxonomy before launch. A 5% error rate means very different things depending on whether those errors are "generated a slightly awkward sentence" versus "told a user their medication dosage was safe when it was not."

Build multi-dimensional success criteria. Single-metric optimization is consistently identified as a failure pattern in AI feature development. Define separate quality thresholds for task fidelity, consistency across equivalent inputs, edge case handling, latency, cost, and policy compliance. Systems that look excellent on one dimension while failing on others are common; the spec needs to catch this.

Writing Evals Before Writing Prompts

The most counterintuitive — and highest-leverage — practice is to define your evaluation suite before you write a single prompt. This is the AI equivalent of test-driven development: the eval is the spec.

The reason it works is that it forces the team to agree on what "working" means before implementation starts. Teams that skip this step face a worse version of the same problem later: they have a live system with ambiguous quality, and now they are trying to reverse-engineer success criteria from a deployment that already has stakeholder expectations attached to it.

Start small. Twenty to fifty test cases drawn from real failures or representative scenarios is a productive starting set. These should include:

  • Canonical happy-path inputs that any working implementation must handle correctly
  • Known failure modes from similar systems or early prototype testing
  • Edge cases that exercise the system's boundaries: very long inputs, multilingual content, adversarial phrasing, implicit context that humans would recognize but models might miss
  • Policy-probing inputs designed to test red-line constraints under adversarial conditions

Your eval suite will grow over time as production failures surface new failure categories. The initial set does not need to be comprehensive; it needs to be honest about the real failure modes you know to be possible.

The grading method matters. Code-based grading (does the output contain a date in ISO format?) is fastest and most reliable; use it whenever the output has a deterministic subset. LLM-as-judge handles subjective quality dimensions at scale; prefer binary PASS/FAIL over 1–5 scales, which generate inconsistent data because "the distinction between a 3 and a 4" is not stable across raters or sessions. Human grading is the gold standard for calibration and high-stakes decisions, not the primary measurement mechanism at scale.

The Organizational Layer: Making Evals a Shipping Requirement

The technical spec format solves the definitional problem. The organizational problem — ensuring the evals are actually run and that results gate shipping — requires process.

Teams with the fastest AI iteration velocity share a pattern: eval results are required artifacts for any pull request that affects model output, prompt configuration, or tool definitions. This is a policy, not a suggestion. Engineers who change a system prompt need to run the eval suite and include the results in the PR description before review. This normalizes quality measurement as part of the standard developer workflow rather than a pre-launch scramble.

A related pattern is PM ownership of the eval framework definition. When product managers own the definition of what "good" means — expressed as specific thresholds on specific dimensions — it shifts the collaboration between product and engineering from "does this feel right?" to "did it hit the numbers we agreed on?" The eval becomes the highest-bandwidth communication channel between product intuition and engineering implementation.

The anti-pattern is treating evals as a one-time quality check performed shortly before launch. This consistently produces the vibe-check trap: the team manually tests a handful of inputs, it looks fine, and they ship — then spend two months in reactive mode after production failure surfaces real failure modes. The spec format and organizational process only work together; either one alone is insufficient.

What Happens When You Skip This

RAND Corporation research on AI project failures identifies unclear requirements — not technical failures — as the most common cause of AI initiatives that don't deliver value. The model rarely fails first; the requirements usually fail first.

The organizational symptoms are recognizable:

  • Engineering, product, and data science teams with divergent definitions of "done"
  • Months of iteration without measurable improvement because there is no agreed-upon measure
  • A post-mortem where the central question is whether a specific accuracy rate was acceptable — and nobody made that decision before launch
  • Stakeholder confidence collapse after a public failure that a red-line spec would have prevented
  • Features that survive demo day but never reach meaningful adoption because "good enough" was never defined well enough to actually achieve it

The requirements gap is not a documentation problem. It is a communication problem that is uniquely hard for AI because the system's behavior is genuinely distributed and cannot be captured by traditional artifacts. The teams that close it early are the ones that ship AI features with predictable quality — not because their models are better, but because they know what they are trying to build before they start.

A Starting Template

When writing a spec for an AI feature, start with these sections before touching implementation:

Policy constraints (binary, non-negotiable):

  • List each constraint as a named category
  • Document the test inputs that will probe each constraint
  • Specify that zero violations are acceptable at launch

Quality dimensions (probabilistic, negotiable):

  • For each dimension: threshold, test set description, measurement method
  • Separate launch-blocking thresholds (regression evals at near-100%) from aspirational quality bars

Error taxonomy:

  • Define severity categories (minor inconvenience / user-visible error / egregious)
  • Specify the acceptable distribution across categories, not just the overall error rate

Eval harness:

  • How and when will evals run?
  • Who is responsible for reviewing results?
  • What is the process for threshold misses during development versus post-launch?

None of this requires a new tool or a specialized process. It requires being specific, before implementation starts, about what the feature is actually supposed to do when the output is distributed rather than deterministic. That specificity is the work, and it is cheaper than the alternative.

References:Let's stay in touch and Follow me for more thoughts and updates