The AI Feature RACI: Why Four Green Dashboards Add Up to a Broken Product
An AI feature regresses on a Tuesday. The eval CI is green. The guardrail dashboards are clean. The retrieval P95 is in line. The model provider had no incident. And yet the support queue is filling up with users who say the assistant "feels worse this week." The PM is the only person in the room who can name the regression, and even she cannot tell you which dashboard would have caught it. Welcome to the seam bug — the kind of failure where every individual artifact owner can prove their piece is fine, and the integrated experience is still broken.
This is the predictable result of how AI features get staffed. The owner-of-record list looks reasonable on paper: a prompt author owns the system prompt, an eval owner owns the offline test set and CI gates, a tool/retrieval owner owns the function calls and search index, a guardrail owner owns moderation and policy filters. Plus a model-selection decision that often lives outside all four — sometimes with a platform team, sometimes with whichever engineer most recently filed the procurement ticket. Five owners. Zero of them are on the hook for "does this feature work for the user."
The pattern is consistent enough across teams that it deserves a name. Call it the integration owner gap. It is the org-design equivalent of the seam bugs it produces: a clear absence right where the responsibility ought to be. Below is what that gap costs, why the usual RACI exercises miss it, and the minimum amount of structure that closes it without inventing yet another middle-manager role.
Four Artifact Owners and the Failure They Share
Pull up the on-call wiki for any production AI feature and you can usually find the four owners listed by artifact:
- Prompt author — owns the system prompt, the few-shot examples, and the prompt versions in the registry. KPI is usually some flavor of "prompt regression rate" or "win-rate vs. previous version on a frozen set."
- Eval owner — owns the offline eval set, the CI quality gate, the LLM-as-judge configs. KPI is "eval coverage of failure modes" and "false-pass rate on regression tests."
- Tool/retrieval owner — owns the function-calling schema, the retrieval index, the chunking and embedding pipeline. KPI is "tool success rate" and "retrieval precision/recall."
- Guardrail owner — owns input filters, output redaction, refusal policies, sometimes a moderation API. KPI is "policy violation rate" and "false-block rate."
Each of these is a sensible artifact to own. Each is measurable. Each has a dashboard. None of them measures whether a real user got a useful answer. That gap is where seam bugs live, and it is structural — not the result of any owner being lazy. The eval owner cannot expand their test set to cover bugs they have not yet seen; the guardrail owner cannot tighten policies on failure modes nobody has reported; the prompt author cannot ablate against problems that surface only when the retrieval index drifts. Each owner is locally optimal and globally insufficient.
The classic example: a model upgrade lands on Monday. The prompt author re-runs their golden set and the wins outnumber the losses, so they sign off. The tool owner notices function-call schemas still validate, so they sign off. The guardrail owner sees the policy filter pass-rate is unchanged, so they sign off. The eval owner sees the CI quality gate is within tolerance, so they sign off. By Friday, support tickets show the assistant has started giving subtly worse advice on a class of questions that nobody put in the eval set because nobody knew, before the upgrade, that this class existed. Every artifact owner can prove they did their job. The user-visible quality dropped anyway.
The Hidden Fifth Owner: Model Selection
The model upgrade itself usually has no clean owner. In practice the decision lives in one of three places, none of them ideal: a platform team that handles the procurement and rollout but does not own any user-visible feature; a single senior engineer who happens to be paying attention to release notes; or a "model strategy" doc that nobody has updated since the last quarterly. The result is that a change with the largest blast radius — touching prompt behavior, tool reliability, retrieval relevance, and guardrail false-positive rates simultaneously — gets the lightest review.
If you do nothing else after reading this post, make model selection an explicit Accountable role, not a decision that happens by default. The Accountable role is the most often-misunderstood part of any RACI; only one person should hold it per decision, and that person is empowered to approve and to block. Model upgrades for a given feature surface should land on a named human who is also Consulted by every artifact owner before they ship.
Seam Bugs: What Lives Between the Pieces
Seam bugs are the failure class that none of the four artifact dashboards detect. A short field guide:
- Prompt-tool seam: the prompt politely asks the model to "use the search tool when in doubt," but the model is now confident enough on more questions, so it stops searching and starts hallucinating. Tool success rate is unchanged (it just fires less). Eval covers the cases where the model used to search and still does. Nobody owns "the model is searching less than it should."
- Tool-guardrail seam: the retrieval system starts surfacing slightly different chunks after an embedding model upgrade, and a guardrail that was tuned against the old chunk distribution now blocks responses it used to allow. Retrieval looks healthy; guardrail false-block rate is up by 0.3% — well under any alert threshold — but concentrated in one user segment.
- Eval-prompt seam: the prompt author adds a new instruction to the system prompt to address a recent failure. The eval set was frozen six months ago and does not exercise the new instruction's interaction with older instructions. The new instruction wins on the cases it targets and silently regresses everything it conflicts with.
- Guardrail-eval seam: the guardrail owner tightens a policy. Eval CI passes because the eval prompts do not trigger the new policy. In production, the policy fires on a long tail of legitimate requests that the eval set does not represent.
Notice the shape: every one of these regressions can be described as "owner A made a change that was correct under owner B's assumptions, except those assumptions had drifted." The regression is real, the artifact is fine, and the seam is broken. Seam bugs are not a kind of bug; they are an absence of an owner.
The RACI Move That Actually Works
The naive fix is to convene a working group every time something changes. This does not scale and it produces a culture where everyone is "consulted" on everything, which is functionally identical to no one being accountable. The fix is to add one named role, and to be ruthless about what that role owns:
| Role | Responsibility |
|---|---|
| Integration owner (the new role) | Accountable for the integrated user experience of the feature surface. Owns the production-quality dashboard that combines user-visible metrics. Has authority to block any of the four artifact owners from shipping when seam evidence says the integrated experience will regress. |
| Prompt author | Responsible for prompt artifacts. Consulted when the integration owner wants a prompt change to address a seam bug. |
| Eval owner | Responsible for eval artifacts. Consulted when the integration owner wants an eval added that crosses artifact boundaries. |
| Tool/retrieval owner | Responsible for tool and retrieval artifacts. Consulted when changes affect the prompt's tool-use distribution. |
| Guardrail owner | Responsible for guardrail artifacts. Consulted when policy changes affect the eval set's representativeness. |
| Model selection | Accountable as a named individual (often the integration owner doubling up). All four artifact owners are Consulted. The PM is Informed before, not after. |
The integration owner role is not a manager. It is an IC role with a different success metric than any artifact owner. Their KPI is some version of "user-visible quality of the integrated experience" — task completion, complaint rate, qualitative review, whatever the product cares about — and it explicitly ranges over the seams. Their authority is that they can block a prompt, eval, tool, or guardrail change from shipping if they have evidence of a seam regression, and the artifact owner cannot route around them.
That last point is the one teams get wrong. If the integration owner can be overridden by a senior engineer on any of the four artifact teams, the role becomes advisory and the seam bugs come back. The role only works if the authority is real.
On-Call That Covers the Surface, Not the Artifact
The other place this org structure shows up is the pager. Most AI feature on-call rotations are still implicitly artifact-shaped: the prompt team handles prompt incidents, the eval team handles CI breakage, the tool team handles function-call failures, the guardrail team handles policy incidents. None of these rotations know what to do when the page reads "users on segment X say the assistant got worse since Tuesday."
The minimum fix is an integration on-call rotation — one engineer per shift whose responsibility is the integrated surface, who has read access to all four artifact dashboards plus a user-quality dashboard, and whose first job on a seam-bug page is not to triage to an artifact owner. Triaging too early is exactly how seam bugs get lost. The integration on-call's first job is to characterize the regression in user-visible terms, then decide which artifacts are involved. Often the answer is "more than one."
A useful pattern: every seam-bug postmortem produces at least one new eval case that crosses artifact boundaries — for example, an eval case that succeeds only if both the retrieval surfaces the right chunk and the prompt uses it correctly and the guardrail does not block the response. Pure single-artifact eval cases are necessary, but they cannot detect seam regressions. Cross-artifact evals belong to the integration owner, not to the eval owner, because the eval owner has no incentive to maintain them when none of their KPIs depend on them.
The Planning Artifact: Mapping Changes to Seams
The other piece of structure that pays for itself is a one-page change-impact matrix that every change to any of the four artifacts has to fill out. The questions are short:
- Which of the four artifacts does this change touch directly?
- Which seams could this change perturb? (The matrix lists the six pairwise seams plus the "model upgrade" cross-cutting case.)
- Which artifact owners are Consulted for this change?
- Does the integration owner need to sign off? (Default yes for any change crossing two or more artifacts.)
This is bureaucratic if you implement it as a Jira template that nobody reads. It is load-bearing if you implement it as a five-line PR description block that is enforced in code review. The point is not the form; it is forcing each artifact owner to think about the seams before shipping rather than discovering them in production. Teams that have adopted some version of this report that the most common change to fail the matrix check is a model upgrade, which is exactly the case where the implicit-default decision was hurting them most.
The Leadership Decision
None of this requires a reorg. It requires the engineering leader for the AI feature surface to make four explicit choices, on purpose, in writing:
- Who is the integration owner for this surface? Named individual, not a team.
- What are their KPIs? They must measure user-visible quality of the integrated experience and must not be reducible to any artifact-level metric.
- What authority do they have? Specifically: can they block a ship from any of the four artifact owners? If not, the role does not work.
- Who is the Accountable owner for model selection? Often the integration owner doubles up; whichever way it goes, it is named.
The reason this has to be explicit is that the default — no integration owner, model upgrades land by default, on-call is artifact-shaped — is the failure mode. AI features are unusual in software engineering precisely because their quality is the cross product of four moving pieces, and the cross product is nobody's natural domain. Teams that ship great AI features have figured this out the hard way, sometimes after the second or third "the eval is green and the user hates it" incident. Teams that have not figured it out yet are about to.
The next time an AI feature regresses on your team and every artifact dashboard is green, do not start by looking for the bug. Start by looking for the org chart. The seam where the regression lives is the same seam where the role is missing.
- https://www.yields.io/blog/raci-matrix-ai-governance/
- https://medium.com/@gkunzile/ai-powered-raci-matrix-guide-clear-roles-and-accountability-aeb25bb7d9ea
- https://leehanchung.github.io/blogs/2025/09/05/ai-transformation-sdlc/
- https://www.zenml.io/llmops-database/lessons-from-enterprise-llm-deployment-cross-functional-teams-experimentation-and-security
- https://www.wiz.io/academy/ai-security/llm-guardrails
- https://orq.ai/blog/llm-guardrails
- https://weaviate.io/blog/evals-guardrails-enterprise-workflows-2
- https://cresta.com/blog/why-ai-agent-evaluations-fail----and-how-the-swiss-cheese-model-prevails
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://medium.com/@bhargavaparv/trust-at-scale-regression-testing-multi-agent-systems-in-continuous-deployment-environments-99dfcc5872e9
- https://www.evidentlyai.com/blog/llm-regression-testing-tutorial
- https://machinelearning.apple.com/research/model-compatibility
- https://callsphere.ai/blog/upgrading-llm-models-production-gpt35-gpt4-gpt5-migration
- https://www.datadoghq.com/state-of-ai-engineering/
- https://www.mdpi.com/2624-800X/6/1/20
