94% View AI as Critical to Platform Engineering's Future, But 75% Are "Preparing" AI Workloads Not Running Them—What's the Real Timeline?

Here’s the paradox that’s keeping me up at night: 94% of organizations view AI as critical to platform engineering’s future. Yet when you dig into the CNCF Platform Engineering Survey, 75% are “preparing” for AI workloads—not running them. Only 7% deploy AI models daily. 47% deploy occasionally, meaning “a few times per year.”

That’s not a pipeline. That’s a pilot graveyard.

The Gap Between Belief and Execution

I’ve been wrestling with this at my own company. We’ve invested $2M into platform engineering over the past 18 months—Kubernetes, observability stack, the works. Our infrastructure is objectively ready. Yet our AI workloads are still in “preparation” mode.

Why? Skill gaps. The same CNCF report found 57% of organizations cite skill gaps as the primary barrier to AI integration. We can build the roads, but we don’t have drivers who know how to navigate them.

This isn’t an infrastructure problem anymore. It’s a talent and training problem masquerading as an infrastructure problem.

2026 Is the Year of Scale—But Only If You’re Ready

Deloitte’s AI Infrastructure analysis calls 2026 “the year of scale,” where the industry crosses from pilot to production. Inference workloads now rival training in compute demand. AI is doing productive work—if you can operationalize it.

But here’s the uncomfortable truth from 2025’s lesson: infrastructure readiness matters more than model capability. You can have the best model in the world, but if you can’t observe it, secure it, scale it, or explain its outputs to stakeholders, it stays in the lab.

Platform engineering is hitting 80% adoption by year-end (up from 55% in 2025), according to Platform Engineering maturity data. Yet The New Stack reports that AI and platform engineering are “merging into one and the same.” If 80% of us have platforms but only 7% are deploying AI daily, something’s broken.

What’s Actually Blocking You?

I’m curious what the real blockers are for this community:

  1. Skill gaps - Do you have engineers who understand both platform ops and AI model lifecycle?
  2. Observability - Can you actually monitor AI agent behavior in production, or are you flying blind?
  3. Organizational readiness - Is your business aligned on AI use cases, or is engineering building infrastructure for hypothetical products?
  4. Budget - Platform budgets are expected to double from $1M to $5-10M by year-end. Do you have that runway?

My Timeline Prediction

Based on the data and our own journey, here’s what I think happens:

  • Q2-Q3 2026: Most companies stay in “preparation” mode—building observability, upskilling teams, piloting 1-2 use cases
  • Q4 2026 - Q1 2027: Early adopters (the current 7%) scale to daily deployments; everyone else hits “production” with occasional deployments (the 47% bucket)
  • 2027: Deployment frequency normalizes as skill gaps close and tooling matures

We’re not seeing mass AI production workloads in 2026. We’re seeing infrastructure investment pay off in 2027.

But I’d love to be proven wrong. What’s your org’s timeline? Are you in the 7%, the 47%, or the “still preparing” majority? And what’s actually blocking you from moving faster?

This hits close to home. We’re definitely in the “preparing” bucket, and I’ll be honest—it feels like a euphemism for “stuck.”

My team has been asking to run AI workloads for 6 months. Our platform infrastructure is solid—Kubernetes, CI/CD, the whole stack. But when I ask “who’s going to own the AI model lifecycle?” I get blank stares. Nobody on my team has production ML experience. Nobody.

The Skill Gap Is Real—And Expensive

That 57% skill gap stat? We’re living it. I’ve tried to hire AI/ML engineers. The market is brutal. The candidates who do have the skills want equity in AI-first companies, not infrastructure roles at a financial services firm.

So the alternative is upskilling. But upskilling 40 engineers to understand model monitoring, drift detection, and inference optimization while they’re already underwater with existing work? That’s 12-18 months minimum. And that assumes we can find the right training programs and mentors.

The Budget Reality

Our platform budget is $1.2M this year—which sounds like a lot until you realize that expected budgets are $5-10M for comprehensive AI platform capabilities. We’re not even close. Adding AI observability tools alone would eat 20% of our budget.

So here’s my question back to you, Michelle: What’s the upskilling path that actually works? Are companies succeeding with:

  1. Hiring specialized AI platform engineers (and how are you competing on comp?)
  2. Sending existing platform engineers to intensive training (and which programs?)
  3. Partnering with consultants/vendors to bootstrap the capability (and avoiding vendor lock-in?)
  4. Accepting that “preparing” is the right answer for another 12 months while the ecosystem matures?

I’m asking because if we’re in the 75% majority, there must be a playbook emerging. Or are we all just stuck in parallel, waiting for the other shoe to drop?

Coming from the design/product side, I’ve been watching this infrastructure conversation with fascination—and a growing sense of dread.

Here’s what I’m seeing: We’re building AI features we can’t actually monitor or explain.

The Observability Black Hole

On a side project, I tried adding AI-powered design recommendations. The model worked locally. Looked great in demos. But the moment I thought about production, I hit a wall:

  • How do I know when the recommendations degrade?
  • How do I explain to users why the AI suggested something?
  • How do I roll back a model that’s “working” but producing weird edge cases?

My traditional monitoring stack (Datadog) tells me HTTP 200. Doesn’t tell me the response is confidently wrong. That AI observability gap is real—and it’s a $2B category nobody budgeted for.

We’re Building Distributed Systems Without Distributed Systems Expertise

The 35% using hybrid platforms makes total sense to me. It’s like saying “we bolted AI onto our existing platform and hoped for the best.” That’s what I did!

But Michelle’s point about infrastructure maturity is spot-on. MIT research shows 95% of multi-agent systems fail to reach production—not because the AI is bad, but because teams lack distributed systems expertise to orchestrate agents at scale.

Are we repeating the microservices hype cycle? Everyone adopted, few succeeded, because most teams didn’t understand the operational complexity they were taking on.

What I Wish Platform Teams Would Tell Me

As someone who wants to ship AI features but doesn’t own the platform:

  1. What observability primitives do you actually have? (Traces? Evals? Session coherence?)
  2. How do I know if I’m about to blow the inference budget?
  3. What’s the latency SLA for AI responses vs. traditional API calls?
  4. How do I handle model versioning and rollback?

If the answer is “we’re still figuring that out,” that’s fine—but let’s say it out loud instead of calling it “preparation.” We’re all learning in public here.

From the product side, this infrastructure-reality gap is creating serious strategic problems. Let me explain.

The Promise-Capability Mismatch

Our leadership sees competitors shipping AI features. They ask: “Why can’t we?” The answer—“our platform engineering team is in preparation mode”—doesn’t land well when the board expects AI in our Q3 roadmap.

But here’s the thing: I’ve learned to be grateful for “preparing” instead of rushing.

Last year, we shipped an AI feature without proper observability. It worked great for 2 weeks. Then model drift kicked in, recommendations got weird, and we had no instrumentation to diagnose it. We pulled the feature and spent 3 months rebuilding monitoring before re-launching.

That experience taught me: the 7% deploying AI daily aren’t necessarily winning. They might just be failing faster.

The Inference Economics Shift

The shift to inference economics Michelle mentioned is critical for product strategy. Training models is a one-time cost. Running inference at scale is an ongoing cost that grows with users.

Question: Do any of you have benchmarks for inference costs per user/session? I’m trying to model unit economics for an AI feature, and I have zero reliable data. Without that, I can’t make a build-vs-buy decision.

The Vendor Lock-In Dilemma

Michelle mentioned vendor independence. This is top of mind for me. If we build on OpenAI’s API and they 10x pricing (or deprecate the model we depend on), we’re stuck. But building our own infrastructure means 12-18 months before we ship anything.

Strategic question: Should we prioritize speed-to-market with vendor dependence, or infrastructure ownership with slower time-to-value?

The 75% “preparing” stat suggests most orgs are choosing infrastructure ownership. But are we all making the right bet, or are we ceding market position to the 7% who moved faster with less infrastructure?

I don’t have the answer. But I know our exec team won’t accept “we’re preparing” as a valid roadmap update much longer.