The Two-PM Problem: When Prompt Ownership and Product Ownership Drift Apart
A support ticket lands on Tuesday morning: a customer was given a confidently wrong answer about their refund window. Engineering pulls the trace and finds the model picked the wrong intent. The product PM looks at the dashboard and sees the new "express refund" affordance — shipped last sprint — surfaced an intent the prompt was never tuned to handle. The platform PM points at the eval suite, which is green. Both are technically right. The customer is still wrong.
This is the two-PM problem, and most AI teams have it without naming it. The product PM owns the user-facing surface — intents, success metrics, the support escalation path. The platform or ML PM owns the prompt, the model choice, the eval suite, and the cost ceiling. The roadmaps are coordinated at the quarterly-planning level and drift at the weekly-shipping level, because the two PMs are optimizing for different metrics on different dashboards with different change-control processes.
The interesting failure mode isn't that the two PMs disagree. It's that they ship correctly relative to their own scope and still produce a regression nobody owns.
Why this split exists in the first place
Splitting prompt-ownership from product-ownership isn't an accident or a sign of immaturity. It usually reflects a real org constraint: the product PM came from the existing SaaS surface and knows the customer; the platform PM was hired (or grew) to own the model bill, the eval suite, and the integration with the inference provider. Neither role is wrong. The split solves the staffing problem of "we don't have one person who can do both."
It also matches how the rest of the industry talks about these roles. The market has converged on a separation between an AI product manager — who owns vision, customer fit, go-to-market — and an AI product owner or ML PM — who owns execution against models, prompts, and evals. Job ladders, hiring panels, and compensation bands have all calcified around this split.
The problem is that the artifact at the boundary — the prompt, plus its eval suite — isn't a backend implementation detail. It is the product surface. And the change-control process that the org has for backend implementation details is the wrong one for a user-experience artifact.
The four artifacts that get edited under both names
The boundary collapses around four specific artifacts. Each one has a product-shaped meaning and a platform-shaped meaning, and most teams have no convention for which side wins when the two read it differently.
The system prompt is platform code from the platform PM's view (a string in a config file, gated by an eval suite) and is brand voice from the product PM's view (the AI's tone, what it refuses, how it handles ambiguity). When the platform PM tunes "be more concise" to cut output tokens, the product PM finds out from a customer who says the AI used to be friendlier.
The eval set is a regression detector from the platform side and a definition of "what good looks like" from the product side. The product PM rarely sees the eval set as something they can edit; the platform PM rarely sees it as something they have to socialize. So the eval set encodes one team's assumptions about quality and gates the other team's releases against them.
The tool description — the docstring or schema the model reads to decide whether to call a function — is API documentation from the platform side and a behavioral contract from the product side. Editing "creates a draft" to "creates a draft (only when the user has explicitly confirmed intent)" looks like a clarification; it is actually a behavior change in how often the tool fires.
The help-center copy and in-product onboarding is product copy that the product PM owns directly, and it sets the user's mental model of what the AI can do. When that copy promises capabilities the prompt was never tuned for, the support team eats the gap.
These four artifacts get edited under both names, on different cadences, without a shared review process. That's the coordination tax.
How the regressions actually show up
The pattern that recurs in retrospectives looks roughly like this:
The product PM ships a UX change that surfaces a new intent — a button, a suggested prompt, an empty-state nudge. The eval suite has no coverage for that intent because it didn't exist last quarter. Production traffic shifts within hours. The model handles the new intent badly. The dashboard the platform PM watches is green because the eval suite is unchanged. The dashboard the product PM watches shows a CSAT dip the team attributes to a UX issue, then a copy issue, then — three weeks later — to the prompt.
The mirror failure also happens. The platform PM swaps the model to the new cheaper variant, runs the eval suite, ships green. The model is technically more concise but loses a stylistic register the product PM had quietly trained users to expect. The product PM hears about it from a customer escalation that comes through marketing because the customer used the word "tone" and not the word "regression."
Both regressions are invisible to the side of the org that doesn't own the artifact. Both are obvious in hindsight. The question is what discipline catches them in foresight.
A shared release calendar, not a shared roadmap
The standard answer here — "align the roadmaps" — does not work, because the roadmaps are at the wrong granularity. A roadmap aligns what we're building. A regression happens between releases. The right artifact is a shared release calendar where two rules hold:
A prompt or model change cannot ship without product-owner sign-off. Sign-off doesn't mean the product PM reads the diff. It means the platform PM has stated the user-visible behavior change in product language, and the product PM has decided whether the customer-facing dashboards, support team, and help-center copy need to move with it.
A UX change that surfaces a new intent cannot ship without eval coverage. Coverage doesn't mean a perfect eval; it means the platform PM has been told about the new intent before it goes to traffic, and has added at least one slice — even a coarse one — to the eval suite. Otherwise the new intent is silently outside the gate.
The release calendar lives in one place, and both PMs are required to look at it. This sounds like a process tax. It is, but it's a smaller tax than the post-hoc one.
The single dashboard that ends arguments
The two-PM problem produces two dashboards: a behavioral dashboard (CSAT, completion rate, escalation rate, time-to-resolution) the product PM watches, and a quality dashboard (eval scores, latency, refusal rate, cost per call) the platform PM watches. Each side rationalizes the other side's metrics as "a different concern."
The cure is one dashboard with both halves on the same page, sliced by the same cohorts, with annotations marking when prompt or model changes shipped and when UX changes shipped. Not a unified score — that flattens the signal. A literal vertical stack: behavior on top, quality on the bottom, change annotations as vertical lines through both. When CSAT drops on the same day the prompt changed, neither PM gets to argue from a different chart. When CSAT drops a week after a UX change, the cause-and-effect is visible.
Industry guidance to keep the dashboard to five-to-eight metrics applies here. The point isn't more metrics. The point is putting them in the same frame so the two PMs can't look at non-overlapping signals.
A cross-functional incident channel from minute one
When the regression hits, the two-PM split creates a triage problem. The product PM opens an incident; the platform PM finds out forty minutes later when an engineer pings them. Forty minutes is enough for the product PM to commit to an explanation that doesn't include the prompt, which is enough for the platform PM to feel pre-emptively defensive, which is enough to add a day to the resolution.
A standing incident channel that pages both PMs from minute zero — not as a courtesy, as a required attendance — solves the smaller problem. The bigger problem it solves is institutional: it teaches the org that AI regressions are joint regressions, and that the question "is this a product issue or a model issue?" is usually answered "yes."
The four-artifact RACI
For the four boundary artifacts, write down a RACI explicitly. Not a RACI for the AI feature in general — that's been done and it didn't help. A RACI for each of: the prompt, the eval set, the tool descriptions, the help-center copy. For each, name who is Responsible (does the edit), who is Accountable (signs off), who is Consulted (must be asked), who is Informed (must be told).
The convention that works in practice: the platform PM is Responsible for the prompt and eval set, the product PM is Accountable. The product PM is Responsible for help-center copy, the platform PM is Consulted. Tool descriptions are jointly Responsible and require both signatures, because they're the artifact most often edited as a "small clarification" and most often a behavior change. Industry data on cross-functional AI governance teams suggests organizations with explicit ownership models deploy faster and incur fewer post-deployment compliance issues — the mechanism is just that fewer artifacts are orphaned at the boundary.
A RACI doesn't enforce itself. It does give the post-mortem something concrete to point at when it asks "who was supposed to catch this," which is the question that, unanswered, calcifies into the next regression.
The deeper realization
The reason this problem keeps recurring is that AI features are the first product surface where the implementation language is itself a first-class user-experience artifact. In a classical SaaS feature, the implementation language (TypeScript, Go, SQL) is invisible to the user — the UX is the surface, and the code is the substrate. A prompt isn't substrate. It's the surface, expressed in the same English the help-center is written in, edited under different change-control by a different PM with a different definition of "ship."
That's why the two-PM split is harder than the analogous classical-SaaS problem of "platform team versus product team." In classical SaaS, the platform-versus-product boundary is API contracts; the artifact at the boundary is well-typed, and the change-control rituals around API contracts are well-understood. In AI features, the artifact at the boundary is natural language, and natural-language change-control is a new ritual most orgs haven't built yet.
The teams that figure this out won't necessarily merge the two PM roles. They'll keep the split and build the rituals — shared calendar, single dashboard, joint incident channel, four-artifact RACI — that make the split survivable. The teams that don't will keep paying the coordination tax, and it will keep showing up as a churn metric the dashboard can't explain.
What to do this week
If this pattern fits your team and you want one concrete change, pick the dashboard. The shared release calendar is the highest-leverage artifact, but it requires both PMs to renegotiate their cadence, which is a quarter-long conversation. The dashboard is the cheapest artifact to build, and it changes the conversation immediately: the next time someone says "this isn't a model issue," the chart that disagrees is on the same screen as the chart they're pointing at. From there, the calendar, the incident channel, and the RACI tend to follow, because each one becomes obviously the next missing piece.
The two-PM problem is fixable. It just isn't fixable by either PM alone, which is precisely the property that makes it persistent.
- https://productschool.com/blog/artificial-intelligence/guide-ai-product-manager
- https://productschool.com/blog/artificial-intelligence/ai-product-owner
- https://medium.com/@sahilaggarawal/how-to-manage-prompt-engineering-in-enterprise-ai-initiatives-9799c13fcbe5
- https://www.justanotherpm.com/blog/fundamentals-of-ai-product-management-prompt-engineering-ai-agents-and-eval-frameworks
- https://www.ayoolafakoya.com/articles/prompt-engineering-scale-2025
- https://www.lakera.ai/blog/prompt-engineering-guide
- https://elevateconsult.com/insights/designing-the-ai-governance-operating-model-raci/
- https://www.yields.io/blog/raci-matrix-ai-governance/
- https://www.productboard.com/blog/ai-evals-for-product-managers/
- https://saptak.in/writing/2025/04/17/product-managers-guide-ai-evaluations
- https://medium.com/@anubhavgoyal0011/a-product-managers-guide-to-ai-evals-how-i-build-reliable-safe-and-high-quality-ai-features-cfd9ce2fecb6
- https://dev.to/kuldeep_paul/evals-and-observability-for-ai-product-managers-a-practical-end-to-end-playbook-4cch
- https://arize.com/ai-product-manager/
- https://productschool.com/blog/artificial-intelligence/evaluation-metrics
- https://www.cio.com/article/4160442/the-metric-missing-from-every-ai-dashboard
