Skip to main content

Your PRD Is an Untested Prompt — Until You Eval It

· 9 min read
Tian Pan
Software Engineer

Open the system prompt of any AI feature that shipped in the last six months and read it side by side with the PRD that authorized it. You will find two documents arguing with each other. The PRD says "the assistant should be helpful but professional, avoid making things up, and gracefully decline if it can't answer." The system prompt says "You are an AI assistant. Be concise. If you are unsure, say 'I don't know.' Never invent facts." The PRD takes a page. The prompt takes nine lines. The gap between them is where every behavioral bug you shipped this quarter lives.

The convenient fiction is that the prompt is an "implementation detail" of the PRD. The actual relationship is the opposite. The prompt is the contract the model executes; the PRD is a draft of that contract written in a language the model does not speak, by an author who never compiled it. Every PRD for an AI feature is an untested prompt. The team that admits this and runs the PRD through an eval before sign-off ships a feature with one fewer source of post-launch surprise.

This is not an argument that PMs should write prompts. It is an argument that the artifact you sign off on for an AI feature has to be evaluated against the same standard you would evaluate the runtime behavior against, because in practice the two will end up nearly identical. The PRD is doing prompt engineering. It just doesn't know it yet.

PRD Prose and Prompt Prose Are Different Languages

A traditional PRD is optimized for human alignment. It hedges. It contains aspirational language ("delight the user"), contradictory constraints ("respond comprehensively but keep it brief"), and tone instructions that assume a reader who can resolve ambiguity by asking around the office. These are not defects of PM writing — they are features of a document whose primary job is to get five stakeholders to agree on a direction before the engineer touches the keyboard.

A system prompt is optimized for a different reader. The model has no office to ask around in. "Respond comprehensively but keep it brief" gets resolved by whichever instruction the model attended to last, weighted by how the surrounding context biases it. Aspirations like "delight the user" collapse into whatever the base model's training distribution thinks delight means, which on any given Tuesday is somewhere between "open with an emoji" and "use three exclamation points in the closing paragraph." The PRD's hedges aren't safety; they're ambiguity the model will resolve for you, and you won't get to see how until production.

The mismatch shows up most clearly in three places. Tone instructions translate badly because tone in PM prose is a vibe and tone in a system prompt is a distribution of token choices. Edge-case behavior ("if the user asks something inappropriate, handle it gracefully") translates badly because the model needs to know which inappropriate, which graceful, and what fallback string to emit. And acceptance criteria translate badly because PMs write them as scenarios ("when a user asks X, the system should Y") while prompts need them as policies the model can apply to inputs it has never seen.

The Three Failure Modes of Treating the PRD as Authoritative

When the team treats the PRD as the source of truth and the prompt as the implementation, three predictable failure modes show up.

First, the silent reinterpretation gap. The engineer translating the PRD into a prompt makes a hundred micro-decisions the PRD didn't anticipate: how to phrase the refusal, what order to list constraints, whether to include few-shot examples, which guardrail to put at the top of the prompt versus the bottom (top wins, almost always, but the PRD never said which one was top-priority). The PM signs off on the PRD; the model executes the engineer's interpretation; nobody can point to where they diverged because there's no diff.

Second, the PRD-only behavioral test. QA writes test cases against the PRD's acceptance criteria. The cases pass. The model then encounters inputs that look nothing like the test cases — because PRD test cases come from the PM's imagination of users, and real users are weirder. The behavior in production drifts from what the PRD says, but the PRD never claimed to be tested against anything except itself, so nobody can tell whether the gap is a bug or an unspecified region.

Third, the post-launch prompt creep. Production traffic surfaces failure modes the PRD didn't predict. Engineers patch the prompt. Each patch is a tiny amendment to the contract that nobody updates back into the PRD because the PRD is now a stale artifact and the prompt is where the real behavior lives. Six months later, the PRD says one thing, the prompt says another, and the team has lost the ability to articulate what the product is supposed to do without reading the prompt diff.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates