Who Owns AI Quality? The Cross-Functional Vacuum That Breaks Production Systems
When Air Canada's support chatbot promised customers a discount fare for recently bereaved travelers, the policy it described didn't exist. A court later ordered Air Canada to honor the hallucinated refund anyway. When a Chevrolet dealership chatbot negotiated away a 2024 Tahoe for $1, no mechanism stopped it. In both cases, the immediate question was about model quality. The real question — the one that matters operationally — was simpler: who was supposed to catch that?
The answer, in most organizations, is nobody specific. AI quality sits at the intersection of ML engineering, product management, data teams, and operations. Each function has a partial view. None claims full ownership. The result is a vacuum where things that should be caught aren't, and when something breaks, the postmortem produces a list of teams that each assumed someone else was responsible.
This isn't a technology problem. It's an organizational one. And unlike GPU cost or model latency, you can fix it without waiting for a vendor.
The Anatomy of Accountability Diffusion
In traditional software, ownership lines are usually clear. A service has an on-call rotation. A feature has a product manager. A database migration has an owning engineer. These lines are imperfect, but they exist.
AI systems don't inherit this clarity by default. Consider a typical production LLM pipeline: the ML team trained or fine-tuned the model, the platform team built the serving infrastructure, the product team wrote the prompts, the data team owns the retrieval pipeline, and the ops team monitors production metrics. Now ask: "who is responsible for ensuring this pipeline doesn't give users wrong answers?"
Each team has a reasonable claim to partial ownership and a reasonable argument for why another team bears primary responsibility. ML says they handed off an evaluated model. Platform says they delivered the infrastructure they were asked for. Product says they're not ML experts. Data says retrieval quality is a function of the corpus, which is a business problem. Ops says they monitor latency and error rates, not output semantics.
Nobody is wrong. Nobody is accountable. This is accountability diffusion — the organizational equivalent of the bystander effect, where the probability of any individual acting decreases as the number of potential actors increases.
The failure mode is predictable. Evals don't get written because product teams don't have the expertise and ML teams consider it product scope. Behavioral regressions ship because there's no owner for regression testing. Hallucination rates drift because monitoring dashboards track technical metrics (latency, error codes) but not semantic quality. When a user gets a wrong answer, each team has a retrospectively coherent reason why catching it wasn't their job.
What "Quality" Actually Means Across the Stack
Part of why ownership diffuses is that "AI quality" isn't a single thing. It breaks into at least three distinct layers, each with a natural owner — but only if your organization has named them.
Technical quality covers the metrics ML engineers typically think about: behavioral regression (does the new model version answer the same test cases the same way?), output consistency, latency, and cost per inference. This is the easiest layer to instrument and the one most likely to have some existing ownership. But it's insufficient on its own. A model can be technically identical to a prior version and still fail users if the prompts changed, the retrieval corpus drifted, or the task distribution shifted.
Product quality covers whether the system does what users need: task completion rates, user satisfaction scores, whether outputs are coherent in the context of the specific product experience. This layer requires product managers and domain experts to define what "good" means. It's the layer most likely to be claimed in a roadmap ("we'll improve response quality") and least likely to have measurable exit criteria or an owner responsible for measuring it.
Safety and reliability quality covers outputs that are harmful, biased, or legally dangerous — the domain of compliance, legal, and security teams. This is the layer that produces the expensive incidents when neglected. It requires someone to actively probe the system for failure modes rather than waiting for users to discover them.
Most AI quality failures in production result not from any single layer being poorly managed, but from the seams between layers. No single team has a clear view across all three, which means the gaps are where failures accumulate.
What Organizational Failure Looks Like in Practice
The proxy failure mode is the eval that never gets maintained. Evals require ongoing investment: writing new test cases as the product evolves, calibrating scoring functions, triaging new failure modes, and deciding what thresholds to hold releases against. This work is unglamorous and cross-functional by nature — you need ML expertise to design evaluation methodology, domain expertise to define correctness, product expertise to weight outcomes, and engineering work to integrate with CI/CD.
When no team owns this lifecycle, evals become stale. They were written to test the initial capabilities and never updated when the prompts changed, the use cases expanded, or the underlying models were swapped out. By the time a quality incident occurs, the eval suite may not even test the relevant behavior.
A second failure mode is the handoff gap. Many organizations treat AI quality like a relay race — ML builds and evaluates the model, product writes prompts, platform deploys, ops monitors. Each team runs their portion and hands off to the next. But AI quality degrades at handoffs. The ML team's evals were designed for the base model capability, not for the specific prompting and retrieval approach the product team used. The product team's prompts were written without visibility into model behavior under distribution shift. Ops has dashboards for infrastructure health, not semantic correctness.
The third failure mode is the pilot-to-production cliff. Accountability is reasonably well-managed during pilots because the team is small and communication is direct. When pilots scale — to more users, more use cases, more regions — the original tight team disperses across functions, and nobody explicitly reassigns ownership. This is the point where 61% of AI projects are reported to fail: not because the technology isn't ready, but because the organizational model for maintaining quality at scale was never defined.
How Teams at Scale Solve This
The organizations that have gotten this right converge on a hybrid model: a centralized platform team owns the eval infrastructure and framework, while product and domain teams own the eval cases and acceptance criteria.
The distinction matters. Infrastructure ownership means: building and maintaining the tooling for running evals, integrating eval runs into CI/CD pipelines so they block deployments, providing scoring primitives that teams can compose, and maintaining the historical baseline data that makes regression detection possible. This is genuinely a platform concern — it requires scale, consistency, and shared investment.
Eval case ownership means: defining what "correct" looks like for a specific use case, writing the test cases that reflect real user behavior, setting the thresholds for what constitutes a passing evaluation, and triaging evaluation failures when they occur. This is genuinely a product and domain concern — it requires knowledge of the business context that the platform team doesn't have.
When these responsibilities aren't split cleanly, one of two failure modes emerges. Either the platform team owns everything and becomes a bottleneck (every new product capability requires platform involvement to write new evals), or product teams own everything and end up with inconsistent tooling, stale test suites, and no shared standards.
The companies that have solved this — the ones running large-scale agentic and LLM systems in production — typically wrap their AI workflows in enough infrastructure that eval failures block releases the same way test failures block releases in traditional software engineering. Evaluations become a first-class part of the deployment pipeline, not an optional step that teams run when they remember to.
The Three Decisions That Reveal Whether Your Org Has Actually Solved This
Rather than asking "who owns AI quality?" in the abstract, three specific decisions reveal whether ownership is real or nominal:
Who sets acceptance criteria? Every deployed AI capability should have defined thresholds: what hallucination rate is acceptable, what task completion rate is required, what response consistency is expected. If the answer is "we haven't set explicit thresholds" or "the ML team decides," you have a gap. Acceptance criteria should be co-owned by the product manager (who knows what users need), the technical owner (who knows what's achievable), and a risk owner (who knows what failures are unacceptable).
Who approves changes that affect model behavior? Prompt changes, retrieval changes, model version bumps, and system message modifications all have the potential to change behavioral characteristics at scale. If these changes can be made without a structured review and eval run, you have a gap. The answer should be a named individual or team with the authority and obligation to run relevant evals and sign off.
Who investigates when output quality degrades? When users start reporting worse outputs, or when automated quality metrics drift, there should be a named on-call function that responds — not a cross-team discussion about whose problem it is. If the answer is "we'd figure it out," the answer is that nobody owns it.
Building a Lightweight Ownership Structure
The most pragmatic approach for teams that don't yet have this solved is an eval charter: a short document, updated quarterly, that explicitly names owners for each dimension of AI quality.
The charter doesn't need to be comprehensive. It needs to answer five questions: What are we measuring? Who owns the infrastructure for measuring it? Who writes the test cases? Who approves deployment? Who is on-call when quality degrades?
A charter that fits on half a page, that everyone on the relevant teams has read, and that is referenced during incidents is worth more than an elaborate governance framework that lives in a wiki nobody checks.
The organizational structure underneath the charter matters less than the existence of explicit, named ownership. Teams have shipped reliable AI systems under every model — embedded AI engineers, centralized platform teams, hybrid center-of-excellence arrangements. What correlates with failure isn't the specific model; it's the absence of any model, and the diffused accountability that fills the vacuum.
The Uncomfortable Question to Ask This Week
Most AI quality failures in production are not primarily caused by bad models. They're caused by teams that built and deployed AI systems without a clear answer to "who owns this?" When something goes wrong — a policy gets hallucinated, a user gets harmful output, a behavioral regression ships undetected — the root cause almost always traces back to an ownership gap, not a model limitation.
The question isn't whether your organization is sophisticated enough to do this well. It's whether you've made explicit decisions about who is responsible for what, and whether those decisions are reflected in your deployment processes, your on-call rotations, and your release criteria.
If you haven't run that conversation, the model is not the risk. Your org chart is.
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://www.ideaplan.io/guides/how-to-run-llm-evals
- https://hamel.dev/blog/posts/evals-faq/
- https://newsletter.pragmaticengineer.com/p/evals
- https://www.evidentlyai.com/blog/llm-evaluation-examples
- https://elevateconsult.com/insights/designing-ai-governance-operating-model-raci/
- https://www.mindstudio.ai/blog/ai-coding-agent-harness-stripe-shopify-airbnb
- https://thedecisionlab.com/biases/accountability-diffusion-in-ai
- https://www.pwc.com/us/en/tech-effect/ai-analytics/responsible-ai-survey.html
- https://www.bu.edu/questrom/blog/moving-beyond-ai-pilots-what-organizations-get-wrong/
