The PRD for an AI Feature: Why Your Old Template Misses the Cliff
The deterministic-software PRD template has aged into a kind of muscle memory. Problem statement, user stories, acceptance criteria, edge cases, success metrics, scope cuts. Engineers know how to read it. PMs know how to fill it in. Designers know which sections to lift wireframes from. It is a well-worn artifact that has shipped a generation of CRUD apps, dashboards, and SaaS workflows.
It also has no field for "what the model gets wrong five percent of the time." No field for "what we accept as a passing eval score." No field for "what the user sees when the model refuses to answer." No field for "which prompt version this PRD locks down, and who is allowed to change it after ship." Every AI feature shipped against that template is shipping with a hidden contract that nobody wrote down. Postmortems keep finding it the hard way.
The gap is not a stylistic mismatch. The deterministic PRD assumes the system maps inputs to outputs through a function whose behavior is known up front. Engineers can enumerate edge cases because the surface is finite. QA can write test cases because the same input produces the same output. Acceptance criteria can be a checklist because pass/fail is unambiguous. None of those assumptions hold for a feature whose core logic is a stochastic model.
When a PM hands an old-template PRD to an AI engineering team, three things happen. The PM writes acceptance criteria they cannot test. The engineer ships against an unspecified bar and hopes nobody notices. The designer draws screens for the happy path and nothing else. Then the feature goes live, the model behaves badly on a class of inputs nobody named, and the team discovers — in the incident channel — that the implicit contract everybody had in their head was not the same contract.
The fix is not a longer PRD. The fix is four new sections that bake the probabilistic nature of the system into the document so the contract is explicit before anyone writes code.
The Behavior Matrix Replaces the Acceptance Checklist
A deterministic acceptance criterion looks like this: "Given a user with role admin, when they click delete, the record is removed and the audit log is updated." It works because the input space is enumerable and the output is binary.
For an AI feature, the input space is effectively infinite and the output is a distribution. You cannot write "given a user prompt, when they hit submit, the model returns the right answer" — there is no row-level definition of "right." What you can write is a behavior matrix: a table whose rows are input classes the feature must handle, and whose columns are the failure modes that are acceptable per class.
A support-summarization feature might have rows like "ticket with attached screenshots," "ticket in a non-English language," "ticket about a topic outside the product domain," "ticket with mixed sentiment across messages." For each row the matrix names a target behavior, an acceptable degradation, and a forbidden output. The non-English row might say target: summary in the user's language; acceptable degradation: summary in English with a language-detected disclaimer; forbidden: silent translation that hides the original language from the agent.
What the matrix forces is the conversation that PMs and engineers were avoiding by leaning on a checklist. It makes the PM commit to which input classes are in scope. It makes the engineer commit to which failure modes are tolerable. It gives the designer a map of refusal and degradation states to draw, not a single happy-path screen. And it gives QA something testable that respects the probabilistic nature of the system: the matrix is evaluated against samples, not single cases.
The Eval-Set Callout Names the Score Threshold for Ship
The deterministic PRD has a "success metrics" section that names business outcomes — conversion lift, retention, NPS. Those still belong. But for an AI feature the document also needs a quality bar that gates ship, and that bar lives in eval-space, not metric-space.
The callout has three parts. First, the dataset: a named, versioned eval set that the team will run the feature against. It can be a hand-curated set of representative inputs, a sampled slice from production, or a synthetic set generated for a class the team cannot collect organically. The dataset is referenced by name and version in the PRD so that "we ran the eval" has a single unambiguous meaning. Second, the metric: the function that turns model outputs on that dataset into a score. It might be exact match, BLEU, a graded LLM-as-judge rubric, a human-rated quality score, or a combination. Third, the threshold: the score the feature must hit before it ships, plus the score below which it must roll back.
The trap most teams fall into is leaving the threshold for the engineer to set after the fact, in a Slack message that nobody archives. Without a number written into the PRD before work starts, "we ran the evals" becomes a ritual whose result is socialized in a meeting and forgotten. The right pattern is the same one teams use for performance budgets: a number in the doc, owned by the PM, debated up front, and changeable only through an explicit amendment with a paper trail.
- https://www.productcompass.pm/p/ai-prd-template
- https://labs.adaline.ai/p/ai-prd-missing-sections
- https://ainna.ai/resources/faq/ai-prd-guide-faq
- https://theinnovationmode.com/the-innovation-blog/ai-prd-for-ai-products
- https://www.oreilly.com/radar/how-to-write-a-good-spec-for-ai-agents/
- https://launchdarkly.com/blog/prompt-versioning-and-management/
- https://www.mindstudio.ai/blog/ai-agent-failure-modes-reasoning-action-disconnect
- https://www.chatprd.ai/learn/PRD-for-Cursor
