Skip to main content

The Federated AI Team: Why Centralizing AI Expertise Creates the Problems It Was Supposed to Solve

· 10 min read
Tian Pan
Software Engineer

The central AI team was supposed to be the answer. Hire the best ML engineers into a single group, standardize the tooling, establish governance, and let product teams consume AI capabilities without needing to understand them. It's a compelling architecture — clean on an org chart, defensible in a board presentation. In practice, it reliably produces a failure mode that looks exactly like the fragmentation it was created to eliminate.

The central AI team becomes a bottleneck. Product teams queue behind it. The AI it ships feels generic to every domain that needs something specific. The ML engineers who built the platform don't know the product metrics. The product engineers who need help can't debug AI behavior without filing a ticket. A 3-month pilot succeeds; a 9-month security review buries it.

Companies in 2025 reported abandoning the majority of their AI initiatives at more than twice the rate they did in 2024. Many of those failures happened at the transition from proof of concept to production — precisely where an overstretched, disconnected central team shows its seams.

Why Companies Build Central AI Teams

The case for centralization sounds solid on the first pass. AI skills are scarce. If you distribute AI engineers across twenty product teams, you get twenty inconsistent approaches to prompt engineering, twenty different observability setups, twenty separate vendor relationships, and no one with authority to enforce safety standards. Pooling expertise creates leverage. A central team can develop shared infrastructure that every product team uses, negotiate better model contracts, and own the compliance story that legal and security need.

Regulated industries feel this pull especially strongly. Healthcare, financial services, and anything touching sensitive user data needs consistent enforcement. A governance vacuum in a federated model is a real risk — not a hypothetical one.

The rationale for a central AI team is not wrong. The problem is that it captures only half the cost equation. It accounts for the cost of fragmentation. It ignores the cost of distance.

The Failure Modes That Follow

Bottleneck by design. The central team's capacity is fixed. Demand from product teams is not. As soon as AI moves from experiment to strategic priority, every product manager in the company has a use case. The central team can run two or three concurrent projects well. At eight, everything slows. At fifteen, the queue is so long that product teams stop asking — they either build without support or wait for a roadmap slot that keeps sliding.

This is not a staffing problem. Hiring into a central team adds bodies, but it doesn't give those engineers the domain context they need to make good decisions for teams whose work they don't inhabit. More engineers in a centralized structure just means a slower queue and a larger knowledge gap.

The bidirectional knowledge silo. The failure mode most organizations don't see until it's entrenched: the central team knows the models, and the product teams know the domain, and neither knows enough of what the other knows to be useful without extensive handoff overhead.

When a customer support AI starts giving inconsistent responses in edge cases, the product team knows what "edge case" means in their context — what kinds of tickets, what user segments, what emotional register. The central team knows how to debug the model — temperature settings, context window issues, retrieval failures. In a centralized structure, those two bodies of knowledge have to travel to each other through meetings and tickets. The debugging cycle is slow, the fixes are often misdiagnosed, and the iteration loop that actually makes AI features good never tightens.

Platform-product friction. Central teams optimize for reuse. Product teams optimize for outcomes. These are not the same thing.

A central team that builds a retrieval layer will build it to work across multiple use cases, with abstractions that preserve generality. A product team that needs retrieval for a specific vertical — legal document search, say, or medical code lookup — needs something tuned to the statistical properties of their data, the precision-recall tradeoffs that matter for their users, and the failure modes their users actually experience. The general-purpose layer doesn't fit. Adapting it requires either bending the central team's abstractions or building workarounds that undermine the point of having a central team in the first place.

Conway's Law Doesn't Care About Your Intentions

Conway's Law holds that organizations design systems that mirror their communication structures. In software, this means that if your backend, data, and AI teams are separate groups with separate reporting chains, you get tightly coupled systems with brittle interfaces at exactly the organizational boundaries where communication is hardest.

Applied to AI: a centralized AI team produces generalized AI systems. A team that serves everyone optimally serves no one. Its models are tuned to aggregate performance. Its evaluation criteria reflect what's measurable across all use cases rather than what matters most in any specific one. The system it ships looks like its own organizational structure — centralized, general, one layer removed from the domain problems it's supposed to solve.

The inverse is also true. When AI capability lives inside the teams that own the domain, the systems those teams build develop domain specificity over time. The engineers who tune the prompts are the same ones who attend the sprint review where a user researcher shows why a certain phrasing confuses customers. The feedback loops close faster because they don't have to cross an organizational seam.

What Federation Actually Looks Like

The federated model isn't "every team does its own thing." That produces the fragmentation and governance gaps the central team was supposed to fix. The pattern that works is a thin shared infrastructure layer with autonomous domain ownership above it.

What stays centralized:

  • Model serving infrastructure (standardized deployment, versioning, routing, and load management)
  • Evaluation frameworks and shared testing standards
  • Safety tooling and guardrails
  • Compute allocation and quota governance
Loading…
References:Let's stay in touch and Follow me for more thoughts and updates