I’ve been building AI infrastructure for 6 years now—Google Cloud AI, and currently scaling our startup’s LLM deployment platform. Last year we went from 15 engineers to 80, and I watched our DevOps practices, which worked beautifully at small scale, turn into an absolute coordination nightmare.
The DevOps scaling wall we hit
When we were 15 people, “shift left” was empowering. Every engineer owned their infrastructure. CI/CD configs, Kubernetes manifests, observability setup—all decentralized. We moved fast.
At 80 engineers? That same approach became “shift everywhere” chaos:
- Tool sprawl: 23 different monitoring dashboards, 17 variations of CI/CD pipelines, zero consistency
- Knowledge concentration: The 5 senior engineers who understood production became bottlenecks for every deployment question
- Duplication hell: Eight teams independently solving the same database backup problem, each with subtle bugs
- Coordination tax: More time spent in “how do we deploy this” meetings than actually deploying
The operational knowledge that used to be democratized became concentrated among a few exhausted seniors. DevOps promised to eliminate silos, but at scale it just created different ones.
Platform engineering’s promise (and my skepticism)
Now everyone’s talking platform engineering as the solution: self-service portals, golden paths, internal developer platforms. “Shift down” instead of “shift left”—embed capabilities into a platform layer rather than expecting every developer to become an infrastructure expert.
The pitch makes sense:
- Gartner predicts 80% of orgs will have platform teams by 2026
- “Shift down” approach promises to eliminate toil, not redistribute it
- AI integration is now non-negotiable (94% view it as critical)
But here’s what makes me skeptical:
87% of leaders still cite manual processes as growth barriers despite platform engineering adoption. That stat is from the same sources evangelizing platform engineering. If it’s working so well, why are nearly 9 in 10 orgs still struggling?
And the resource reality is grim: 47.4% of platform teams operate with budgets under $1M, which experts call “systemic underfunding” that guarantees failure. Are we setting up platform teams to become the new bottleneck—just with better branding?
The questions I actually need answered
I’m not against platform engineering. I’m against cargo-culting the latest trend without understanding if it actually solves our problems or just renames them.
So here’s what I want to know from people who’ve lived this transition:
-
Did you see measurable improvement? Not “developers are happier” vibes, but actual metrics: deployment frequency, lead time for changes, MTTR, production incident rates?
-
What changed besides the org chart? Did you actually eliminate toil, or just move it from product engineers to a platform team that’s now underwater?
-
How do you avoid the abstraction trap? When your platform obscures infrastructure, how do you debug complex issues? Are we trading operational knowledge for dependency on a platform team?
-
What’s the right inflection point? At what team size does platform engineering stop being premature optimization and start being survival necessity?
I keep seeing the same pattern in our industry: a real problem (DevOps doesn’t scale), a rebranding (platform engineering), and breathless adoption before anyone asks if the new approach actually works differently.
Platform engineering might be the answer. But I need more than blog posts from platform vendors telling me it is. I need evidence from people who’ve made this work—or tried and failed—at real companies with real constraints.
What’s your experience been?