We're Building a Centralized AI Control Plane—Here's Why We Can't Wait Until It's a Crisis

cto_michelle · March 22, 2026, 12:53pm

I just got approval from our board to build what I’m calling our “AI Control Plane”—and I know some of you are going to think I’m creating bureaucracy where we need speed. But hear me out.

Three months ago, we had 50+ engineers using 12 different AI coding assistants, LLM APIs, and agent frameworks. Zero visibility. Zero governance. Zero idea what data was being sent where. When our security team asked “can you audit AI tool usage?” the answer was “…we’ll get back to you.”

Last week, I mandated that all AI tool usage goes through a centralized control plane managed by our platform team. The reaction was… mixed. Some engineers accused me of killing innovation. Others thanked me for finally addressing what they saw as a compliance disaster waiting to happen.

The 10× Retrofit Cost Is Real

Here’s what convinced me we couldn’t wait: I talked to three CTOs who retrofitted AI governance after an incident. One had an agent leak PII into training data. Another had a developer accidentally expose API keys through an AI chat log. The third had a compliance audit fail because they couldn’t demonstrate data lineage for AI-generated code.

In all three cases, the retrofit cost 10× what proactive governance would have cost. Not just in engineering time—in legal reviews, customer trust rebuilding, talent drain from frustrated engineers, and opportunity cost while everything was locked down.

Our Architecture: Centralized Control, Decentralized Innovation

I’m not trying to control which AI tools engineers use. I’m trying to ensure that whatever they use goes through a governed layer. Our architecture:

Centralized (Platform Team Owns):

Authentication and authorization for all AI services
Centralized logging and observability of AI interactions
Policy enforcement (data classification, PII filtering, rate limiting)
Security scanning of prompts and responses
Cost tracking and budget controls

Decentralized (Engineering Teams Choose):

Which AI coding assistant (GitHub Copilot, Cursor, etc.)
Which LLM API for specific use cases
Agent frameworks and implementation patterns
Tool selection within approved categories

We’re implementing what I’m calling the “four pillars” based on recent CNCF research:

Golden paths: Pre-approved AI tool configurations
Guardrails: Policy enforcement at runtime
Safety nets: Monitoring and alerting for anomalies
Manual review: Human-in-the-loop for high-risk operations

The Galileo Agent Control release from earlier this year is inspiring our approach—write behavioral policies once, enforce across all agent deployments. We’re building on Kong API Gateway + Open Policy Agent + DataDog observability.

What I’m Struggling With

The bottleneck risk: Will centralized governance slow down engineering velocity so much that we lose our competitive edge?
The staffing challenge: Who owns this? Platform team is already stretched. Security team doesn’t have AI expertise. Do we hire a dedicated AI governance team?
The tool integration problem: Not all AI tools have APIs we can intercept. How do you govern local IDE assistants that call LLMs directly?
The buy-in problem: How do I get engineers excited about governance instead of seeing it as a blocker?

For those of you who’ve built centralized AI governance: What does your architecture look like? What am I missing? What should I prioritize in the first 90 days?

And for those who think I’m making a mistake—tell me why. I’d rather hear the counterarguments now than discover them the hard way six months from now.

Related reading that shaped my thinking:

eng_director_luis · March 22, 2026, 12:53pm

Michelle, I appreciate the thoroughness here, but I’m going to push back on the centralized approach—not because I don’t value governance, but because I’ve seen decentralization work at scale in ways that surprised me.

We just studied the Adidas infrastructure engineering case from earlier this month. They went the opposite direction: shifted FROM centralized Infrastructure-as-Code TO a decentralized model. Result? Five autonomous teams deployed 81 new infrastructure stacks in two months. The key was layered IaC modules, automated pipelines, and shared frameworks—not central control.

Decentralized with Guardrails Can Scale

Here’s what I think you’re missing: You can have governance without centralization. We’re doing it right now in financial services, where compliance requirements are extreme.

Our approach:

Shared policy libraries that teams implement locally
Automated compliance scanning in CI/CD (every deployment checked)
Federated monitoring where each team owns their observability, but security has read access to everything
Self-service golden paths with built-in governance (but teams can opt out if they document why)

Engineers choose their AI tools from an approved list. Each tool connects through our network with built-in DLP and logging. But there’s no “platform team approval” bottleneck—they provision instantly through Infrastructure as Code with policies baked in.

The Velocity Question You Should Be Asking

Your biggest risk isn’t security incidents—it’s that your best engineers leave because they feel micromanaged. The engineers who deliver 10× output are often the ones who hate process the most.

When does centralized governance become a bottleneck? In my experience: whenever the control plane is slower than the workaround. If engineers can get their work done faster by NOT using your AI control plane, some percentage will find ways around it.

Questions for you:

What’s the SLA for your platform team to approve a new AI tool integration? Days? Weeks?
What happens when an engineer needs a capability that’s not in your approved list?
How do you prevent this from becoming “security theater”—process that feels safe but doesn’t actually reduce risk?

What I’d Recommend Instead

Start with the minimum viable governance:

Visibility first: Monitor what’s being used (don’t block yet)
Education: Teach engineers why certain AI tool patterns are risky
Golden paths with escape hatches: Make compliant usage easy, but don’t forbid non-compliant usage—just make it require documentation

Then measure: Are you seeing risky behaviors? What percentage of engineers are outside the golden path and why?

You might find that 90% of engineers happily use your pre-approved tools, and the 10% who need something custom have legitimate reasons that centralized control would have blocked.

The Adidas lesson: Trust + guardrails scales better than control + exceptions.

That said—if you’re in healthcare or finance with hard compliance requirements, centralized might be the only option. But for most SaaS companies? I think you’re solving for a problem that education and visibility would address at 1/10th the cost.

What’s the regulatory driver for centralization in your case?

vp_eng_keisha · March 22, 2026, 12:54pm

I’m going to respectfully disagree with both of you—because I think you’re debating a false dichotomy. It’s not centralized OR decentralized. It’s both.

Luis, the Adidas example is compelling, but I think you’re comparing apples to oranges. Infrastructure provisioning with IaC is a solved problem with mature tooling. AI governance in 2026 is still figuring out what “good” even looks like. The risks are different, the tooling is immature, and the regulatory landscape is actively evolving.

Michelle, I think your architecture is right but your framing might scare people. When you say “centralized control plane,” engineers hear “bottleneck.” What you’re actually building is centralized enforcement of decentralized choices.

Here’s the Hybrid Model We’re Using

Centralize this:

Authentication and authorization (identity provider integration)
Logging and audit trails (who used what AI tool when, with what data)
Policy enforcement (PII detection, rate limiting, cost controls)
Security monitoring (anomaly detection, threat hunting)

Decentralize this:

Tool selection within approved categories
Implementation patterns and frameworks
Team-specific AI use cases and workflows
Innovation and experimentation (within guardrails)

Real example: Our engineers can choose between Claude, GPT-4, Gemini, or Llama for their coding assistant. But whichever they choose, it routes through our API gateway with built-in PII filtering, prompt injection detection, and cost tracking.

The engineer doesn’t even know the control plane exists—they just authenticate once and their IDE works. But we have full visibility and control if something goes wrong.

The AWS Bedrock Guardrails Model

We’re building on AWS Bedrock’s approach: centralized policy management, but policies are enforced at the edge (where the AI interaction happens). Not in some central approval queue.

This means:

No platform team bottleneck for tool selection
No manual approval workflows
Policies are code (versioned, reviewed, tested)
Engineers get immediate feedback when policy blocks something

Addressing Luis’s Concerns

“Your best engineers leave because they feel micromanaged”

This is real. But the answer isn’t “no governance”—it’s invisible governance. Make the compliant path indistinguishable from the uncompliant path, except compliant is actually better (faster, more reliable, better UX).

If your control plane adds latency or friction, you’ve failed. It should add value: better models, cost optimization, automatic fallbacks, smart caching.

“Whenever the control plane is slower than the workaround”

Exactly. Which is why the control plane can’t be a human-approval system. It has to be automated, fast, and reliable.

Addressing Michelle’s Concerns

“Who owns this? Platform team? Security team?”

Both. And neither. You need a dedicated AI Platform team that reports jointly to Engineering and Security. This team’s job is making AI usage safe AND productive—not one at the expense of the other.

We staffed ours with:

2 platform engineers (API gateway, observability)
1 security engineer (policy design, threat modeling)
1 ML engineer (model evaluation, performance optimization)
0.5 product manager (internal developer experience)

“How do you govern local IDE assistants?”

Network-level controls. All outbound AI traffic routes through your gateway, enforced by network policies. If an IDE tries to call OpenAI directly, the network blocks it and redirects to your proxy.

Engineers can still use any IDE plugin they want—it just goes through your governance layer transparently.

The Question Nobody’s Asking

What’s the SLA for your AI control plane? If it goes down, do all AI-powered features stop working?

We run multi-region with automatic failover. The control plane is more reliable than any individual AI service because we can fail over between providers.

Michelle, you’re doing the right thing. Luis, your concerns are valid. The answer is: centralized policy enforcement with decentralized tool choice. Make governance invisible until it’s needed.

product_david · March 22, 2026, 12:55pm

This is a great technical discussion, but I’m going to ask the question that our CEO asked me when I proposed something similar: What’s the ROI?

Michelle, you mentioned the “10× retrofit cost” from your CTO conversations. But what’s the business case for the proactive investment? I need to translate this into language that non-technical stakeholders understand.

The Calculation I’m Struggling With

Cost of proactive governance:

4.5 FTEs for AI Platform team (Keisha’s staffing model)
API gateway infrastructure and observability tooling
Engineering time for integration (every team affected)
Velocity impact during transition

Cost of doing nothing:

Unknown probability of security incident
Unknown magnitude of business impact if incident occurs
Possible regulatory fines (but which regulations apply?)
Opportunity cost of engineers using suboptimal tools

How do I build a financial model around “unknown unknowns”?

The Product Velocity Concern

Here’s what keeps me up at night: We have customer-facing AI features on the roadmap. Personalized recommendations. AI-powered search. Automated customer support. These features are competitive differentiators.

If we route all AI interactions through a centralized control plane, what’s the latency impact? Can we still hit our performance SLAs? What happens when the control plane becomes the bottleneck for our product roadmap?

Luis mentioned engineers leaving over micromanagement. I’m worried about a different kind of attrition: customer churn because our AI features are slower or less capable than competitors who ship without governance overhead.

Different Governance Tiers by Risk Level?

What if we had different governance approaches based on risk profile:

Tier 1 - High Risk (Strict Governance):

Customer PII processing
Financial data analysis
Healthcare-related AI features
→ Centralized control plane, manual review for changes

Tier 2 - Medium Risk (Automated Governance):

Internal developer tools
Marketing content generation
Analytics and reporting
→ API gateway with automated policy enforcement

Tier 3 - Low Risk (Visibility Only):

Code completion in IDEs
Documentation search
Internal knowledge base queries
→ Logging and monitoring, but no blocking

This lets us focus governance investment where business risk is highest, while keeping velocity where it matters.

The Questions I Need Answered

What’s the business impact timeline? If we don’t implement governance and something goes wrong, when does that “wrong” happen? Months? Years? Are we solving for a 2026 problem or a 2028 problem?
What’s the competitive landscape? Are our competitors implementing AI governance? Or are they shipping faster because they’re taking on more risk?
What’s the regulatory timeline? Is there pending AI regulation that will force us to have governance in place by a certain date? Or is this insurance against a hypothetical future?
Can we stage the rollout? Start with Tier 1 high-risk use cases, prove the model works, then expand to Tiers 2 and 3?

The Business Stakeholder Communication Challenge

When I present this to our board, they’re going to ask: “Why do we need this NOW when we’ve been using AI for 18 months without incident?”

How do I make the case that the absence of visible problems doesn’t mean absence of risk—especially when I’m asking for budget that could fund 2-3 product engineers?

For those of you who’ve sold AI governance to non-technical leadership: What was your pitch? What data/examples convinced them it was worth the investment?

I want to support Michelle’s initiative. But I also need to deliver customer value and business results. Help me understand how these aren’t in conflict.

maya_builds · March 22, 2026, 12:55pm

Coming from the design and UX side, I want to warn everyone about something I’ve seen kill good governance initiatives: governance theater.

You can build the perfect technical architecture, staff the right team, get executive buy-in—and still fail if the developer experience is terrible. And from what I’m reading, there’s a real risk of that happening here.

Make the Compliant Path the Easy Path

Keisha mentioned “invisible governance” and that’s exactly right. But I want to dig deeper into what that means from a UX perspective.

Bad governance UX:

Engineer needs a new AI tool → Fills out Jira ticket → Waits 3 days → Gets approval → Still has to figure out integration themselves
AI call fails with cryptic error → No idea if it’s the policy, the network, or the model
Policy violation → Generic “Access Denied” message with no guidance on what to do

Good governance UX:

Engineer needs a new AI tool → Runs ai-tools add copilot → Automatically configured with auth and policies in 30 seconds
AI call fails → Clear error message: “Request blocked by PII filter. Detected: email address. See docs for sanitization patterns.”
Policy violation → Helpful guidance: “This operation requires manual review. Here’s how to request exception: [link]”

The Workaround Problem

David’s right to worry about engineers finding workarounds. But I’d reframe it: If your engineers are working around your governance, your governance UX failed.

I’ve seen this pattern:

Security team builds governance with great technical architecture
Makes engineers jump through hoops to use it
Engineers find creative workarounds (personal OpenAI accounts, shadow AI tools)
Security team adds more restrictions
Culture of compliance vs velocity emerges
Best engineers leave

The problem wasn’t the governance—it was that using compliant tools was harder than using non-compliant tools.

Three UX Principles for AI Governance

1. Self-service with safety rails

Don’t make engineers ask permission. Give them tools that enforce policy automatically. The platform team maintains the safety rails, but engineers drive within them at full speed.

Example: Our design system has built-in accessibility checks. Designers can work fast, but they can’t ship something that violates WCAG without explicitly acknowledging and documenting why.

2. Progressive disclosure

Most engineers, most of the time, don’t need to know governance exists. Only surface it when relevant:

AI call succeeds → Engineer sees nothing about governance
AI call rate-limited → Warning with current usage and limit
AI call blocked → Clear explanation and next steps

3. Fast feedback loops

If policy blocks something, the engineer needs to know immediately—not in a security scan 3 hours later. Real-time feedback at the point of interaction.

The Accessibility Parallel

Michelle, this reminds me of when we implemented design system governance. We could have had a “design review board” that approved every component. Instead:

Pre-approved components in Figma library with built-in constraints
Automated design token validation in CI/CD
Self-service exception process with documentation
Monthly review of exceptions to improve the system

Result: 95% design consistency, zero designer complaints about process, 90% reduction in accessibility bugs.

The key was making the governed path better than the ungoverned path—not just “safer” but actually easier and faster.

Questions for Michelle

What’s the onboarding experience for an engineer who joins tomorrow? Do they know how to use the AI control plane? Is there great documentation?
If an engineer hits a policy blocker, what’s the average time to resolution? Hours? Days?
Have you done user research with engineers on what frustrates them about the current AI tool landscape? You might find governance solves problems they already have (cost visibility, tool fragmentation, auth management).
Is there a feedback loop where engineers can suggest policy changes or new tool integrations?

The risk isn’t that your AI control plane is technically flawed—it’s that it’s so frustrating to use that your engineers route around it, making you less secure than before you started.

Make governance delightful, and you’ll get compliance for free.