The PRD for an AI Feature: Why Your Old Template Misses the Cliff
The deterministic-software PRD template has aged into a kind of muscle memory. Problem statement, user stories, acceptance criteria, edge cases, success metrics, scope cuts. Engineers know how to read it. PMs know how to fill it in. Designers know which sections to lift wireframes from. It is a well-worn artifact that has shipped a generation of CRUD apps, dashboards, and SaaS workflows.
It also has no field for "what the model gets wrong five percent of the time." No field for "what we accept as a passing eval score." No field for "what the user sees when the model refuses to answer." No field for "which prompt version this PRD locks down, and who is allowed to change it after ship." Every AI feature shipped against that template is shipping with a hidden contract that nobody wrote down. Postmortems keep finding it the hard way.
The gap is not a stylistic mismatch. The deterministic PRD assumes the system maps inputs to outputs through a function whose behavior is known up front. Engineers can enumerate edge cases because the surface is finite. QA can write test cases because the same input produces the same output. Acceptance criteria can be a checklist because pass/fail is unambiguous. None of those assumptions hold for a feature whose core logic is a stochastic model.
When a PM hands an old-template PRD to an AI engineering team, three things happen. The PM writes acceptance criteria they cannot test. The engineer ships against an unspecified bar and hopes nobody notices. The designer draws screens for the happy path and nothing else. Then the feature goes live, the model behaves badly on a class of inputs nobody named, and the team discovers — in the incident channel — that the implicit contract everybody had in their head was not the same contract.
The fix is not a longer PRD. The fix is four new sections that bake the probabilistic nature of the system into the document so the contract is explicit before anyone writes code.
The Behavior Matrix Replaces the Acceptance Checklist
A deterministic acceptance criterion looks like this: "Given a user with role admin, when they click delete, the record is removed and the audit log is updated." It works because the input space is enumerable and the output is binary.
For an AI feature, the input space is effectively infinite and the output is a distribution. You cannot write "given a user prompt, when they hit submit, the model returns the right answer" — there is no row-level definition of "right." What you can write is a behavior matrix: a table whose rows are input classes the feature must handle, and whose columns are the failure modes that are acceptable per class.
A support-summarization feature might have rows like "ticket with attached screenshots," "ticket in a non-English language," "ticket about a topic outside the product domain," "ticket with mixed sentiment across messages." For each row the matrix names a target behavior, an acceptable degradation, and a forbidden output. The non-English row might say target: summary in the user's language; acceptable degradation: summary in English with a language-detected disclaimer; forbidden: silent translation that hides the original language from the agent.
What the matrix forces is the conversation that PMs and engineers were avoiding by leaning on a checklist. It makes the PM commit to which input classes are in scope. It makes the engineer commit to which failure modes are tolerable. It gives the designer a map of refusal and degradation states to draw, not a single happy-path screen. And it gives QA something testable that respects the probabilistic nature of the system: the matrix is evaluated against samples, not single cases.
The Eval-Set Callout Names the Score Threshold for Ship
The deterministic PRD has a "success metrics" section that names business outcomes — conversion lift, retention, NPS. Those still belong. But for an AI feature the document also needs a quality bar that gates ship, and that bar lives in eval-space, not metric-space.
The callout has three parts. First, the dataset: a named, versioned eval set that the team will run the feature against. It can be a hand-curated set of representative inputs, a sampled slice from production, or a synthetic set generated for a class the team cannot collect organically. The dataset is referenced by name and version in the PRD so that "we ran the eval" has a single unambiguous meaning. Second, the metric: the function that turns model outputs on that dataset into a score. It might be exact match, BLEU, a graded LLM-as-judge rubric, a human-rated quality score, or a combination. Third, the threshold: the score the feature must hit before it ships, plus the score below which it must roll back.
The trap most teams fall into is leaving the threshold for the engineer to set after the fact, in a Slack message that nobody archives. Without a number written into the PRD before work starts, "we ran the evals" becomes a ritual whose result is socialized in a meeting and forgotten. The right pattern is the same one teams use for performance budgets: a number in the doc, owned by the PM, debated up front, and changeable only through an explicit amendment with a paper trail.
The callout also sets the rule for when the eval set itself can change. Every team eventually faces a model or prompt change that scores well on the existing eval but feels worse to users. The fix is to expand the eval — but expansion has to be a deliberate event, with a freeze on threshold changes for the version of the eval the current ship is being measured against. Without that rule, the eval becomes a moving target that always agrees with whatever the team wants to ship.
The Fallback Specification Treats Refusal, Timeout, and Tool Failure as Features
In a deterministic system, error handling is a paragraph in a tech spec. In an AI feature, refusal and degradation are part of the user-visible product, and they need to be designed.
The fallback specification has rows for the failure types the team has decided are not bugs but acceptable runtime states. Refusal: the model declines to answer because the request hit a safety boundary. Timeout: the model or a tool the agent depends on did not return in the budget. Tool failure: a structured tool call returned an error or malformed output. Confidence collapse: the model's own scoring signal is below a threshold the team set, and the team has decided it would rather not answer than answer poorly. Context overflow: the input exceeded the working window and the system had to truncate.
For each row the spec names two things: what the user sees, and what the system does next. The user-facing answer is a designer's job — copy, layout, whether there's a retry button, whether the failure is visible at all or hidden behind a quiet "I don't know." The system-side answer is an engineer's job — whether the input goes to a smaller cheaper model, whether it queues for human review, whether it surfaces a structured error to the calling code, whether it logs and moves on.
What this section pulls into the daylight is the fact that AI features have a much richer error space than the binary success/failure of deterministic systems, and that error space is part of the user experience. A feature that handles refusal gracefully feels like a different product from a feature that throws a generic error or, worse, hallucinates a confident answer when the model wasn't sure. The fallback specification is where you decide which of those products you are shipping.
The Prompt-Ownership Field Names Who Can Change Behavior After Ship
This is the section that has no analogue in a deterministic PRD, because deterministic features do not have a thousand-token configuration string that materially changes behavior and that any engineer with repository access can edit in five seconds.
A system prompt is a behavioral contract written in English. A small change — adding a sentence about tone, removing a refusal clause, swapping an example — can shift the feature into a different product. Treating that string the way you treat a constants file is how teams end up with prompt drift: the version in production no longer matches the version the eval set was scored against, and nobody can say when the divergence happened.
The PRD field has three slots. Owner: the role or named person who is allowed to merge changes to the system prompt. Approval path: the review the change has to go through before merge — at minimum, a re-run of the eval set with a passing score, often a second human reviewer who is not the author. Versioning: where the prompt lives such that every production request can be traced back to a specific version, and rollback is a single commit revert. None of this is novel infrastructure work — teams have been doing it for code for decades — but the PRD is where the discipline gets contracted, because the model-tuning side of the house often does not default to it.
The same field covers tool definitions, eval sets themselves, and any other behavioral artifact whose change is functionally equivalent to a code change but is sometimes treated as content. Anything that changes what the feature does in production needs a named owner and an approval path written into the spec.
The Organizational Cost of Skipping This
The reason this matters is not document hygiene. It is that the deterministic PRD encodes a set of cross-functional contracts — what the PM commits to, what engineering signs up to build, what design draws, what QA tests against — and AI features without an updated template are shipping without those contracts in the right places.
The PM writes acceptance criteria that the engineer cannot satisfy because the criteria assume determinism. The engineer ships against an internal threshold the PM has not seen and cannot challenge. The designer draws a happy-path screen because the failure space was never enumerated. QA writes test cases against a snapshot, watches them pass for a week, and then a model update lands and every snapshot is wrong overnight. The leadership reads the PRD and concludes the feature is on track, because the document is silent on the parts that are actually load-bearing.
When the incident lands — and there is always an incident — the postmortem reconstructs the missing contract. Someone writes a doc describing the input classes the feature was supposed to handle. Someone writes a doc describing the eval threshold that should have gated the ship. Someone writes a doc describing the fallback paths that were skipped. Those docs are the AI-PRD sections that were never in the original spec. The work happens regardless; the choice is whether it happens before the feature ships, when it is cheap and theoretical, or after, when it is expensive and live.
The shortcut is to update the template once and stop accepting AI feature specs without those four sections. Not as a process burden, but as a forcing function for the conversation that has to happen anyway. The deterministic PRD got the industry through a generation of products. The next generation needs a document that knows what it is shipping.
- https://www.productcompass.pm/p/ai-prd-template
- https://labs.adaline.ai/p/ai-prd-missing-sections
- https://ainna.ai/resources/faq/ai-prd-guide-faq
- https://theinnovationmode.com/the-innovation-blog/ai-prd-for-ai-products
- https://www.oreilly.com/radar/how-to-write-a-good-spec-for-ai-agents/
- https://launchdarkly.com/blog/prompt-versioning-and-management/
- https://www.mindstudio.ai/blog/ai-agent-failure-modes-reasoning-action-disconnect
- https://www.chatprd.ai/learn/PRD-for-Cursor
