Kubernetes Cost Is Now the #1 Pain Point (88% Report Rising TCO) — Are Simpler Alternatives Like Nomad and Cloud Run Finally Winning?

I’ve been sitting on this post for a few weeks because I know it’s going to be controversial in certain circles. But after our Q4 infrastructure review, I can’t stay quiet anymore.

The Numbers Don’t Lie

The 2026 Cloud Native Computing Foundation survey landed last month and the headline finding is stark: 88% of organizations report rising Kubernetes total cost of ownership year over year, with 42% citing cost as their primary infrastructure pain point — up from 31% in 2024. This isn’t a fringe complaint from startups running hobby clusters. This is enterprise-scale feedback from organizations that bet big on Kubernetes and are now reckoning with the bill.

And when people say “cost,” they don’t just mean the EC2 or GKE compute bill. That’s actually the part you can see and optimize. The real costs are the ones that never show up on a single invoice.

The Hidden Cost Iceberg

Here’s what “running Kubernetes” actually costs beyond compute:

  • Dedicated platform engineering teams. You can’t run K8s without at least 2-3 engineers whose full-time job is keeping the platform healthy. At senior SRE salaries, that’s $500K-$800K/year before you deploy a single application.
  • Training and onboarding. Every new engineer needs weeks to become productive with your K8s setup. Helm charts, custom operators, networking policies, RBAC — the learning curve is a cliff, not a slope.
  • Security hardening. Pod security standards, network policies, secrets management, image scanning, admission controllers — each one is a project, not a task.
  • Upgrade cycles. Kubernetes releases every four months and end-of-lifes versions aggressively. Each upgrade is a multi-week project involving testing, compatibility checks, and the occasional 3am incident.
  • Networking complexity. Service mesh, ingress controllers, DNS, load balancing — we’ve spent more engineering hours debugging Kubernetes networking than building features in some quarters.

Our K8s Journey: A Case Study in Complexity Creep

Three years ago, we migrated to EKS with 12 microservices. It felt manageable. Today we’re running 85 services on Kubernetes with a 4-person platform engineering team dedicated to keeping it all running. Our annual K8s-related costs — compute, platform team salaries, tooling licenses (Datadog, PagerDuty, Teleport, ArgoCD) — add up to roughly $1.2 million per year.

For a mid-stage SaaS company, that’s a significant chunk of our engineering budget. And the honest question I keep asking myself is: do we need all of this?

The Alternatives Are Getting Serious

A few years ago, suggesting anything other than Kubernetes for container orchestration would get you laughed out of the room. That’s changing:

  • HashiCorp Nomad — simpler operational model, handles 80% of orchestration use cases with 20% of the complexity. I’ve talked to three CTOs this quarter who are running Nomad in production and loving it.
  • Google Cloud Run — serverless containers that scale to zero. No cluster management, no node pools, no upgrade cycles. You push a container and it runs.
  • AWS App Runner — Amazon’s answer to Cloud Run. Still maturing but the simplicity is compelling.
  • Fly.io — interesting for edge deployments and teams that want container orchestration without the Kubernetes overhead.

And here’s one that surprised me: Docker Swarm is having a quiet renaissance. I’ve heard from multiple teams going back to Swarm for straightforward web service deployments. As one engineer put it to me: “We don’t need a rocket ship to deliver pizza.”

My Contrarian Take

I’ll say it plainly: Kubernetes is the right choice for maybe 20% of companies. The other 80% adopted it because it was the “industry standard,” because it looked good on job postings, and because conference talks made it seem like the only serious option. Those companies are now paying a complexity tax that compounds every quarter.

The Migration Trap

Here’s the catch, and it’s a big one: once you’re on Kubernetes, the ecosystem lock-in makes leaving extraordinarily expensive. Helm charts, custom operators, Istio service mesh configurations, ArgoCD GitOps pipelines, Prometheus monitoring stacks — every tool in the CNCF landscape is another anchor keeping you on K8s.

We’re not moving off Kubernetes tomorrow. But I’ve started a working group to evaluate which of our services could be migrated to Cloud Run or a simpler platform without the K8s overhead.

Has anyone here successfully moved OFF Kubernetes? What did you move to, and what was the migration experience like? I’d especially love to hear from teams that went from K8s to something simpler and don’t regret it.

Michelle, I appreciate the candor but I have to push back on some of this — respectfully, as someone who manages K8s clusters for a living.

The Cost Problem Isn’t Kubernetes. It’s Discipline.

The 88% figure from the CNCF survey doesn’t surprise me, but I’d argue it says more about how organizations run Kubernetes than about Kubernetes itself. When I joined my current team two years ago, our monthly EKS bill was $47K. Today it’s $26K — a 45% reduction — while running 30% more workloads. Here’s what we did:

1. Resource limits and requests — properly configured. The number of teams running containers with no resource limits, or with requests set to 4 CPU and 8GB RAM “just in case,” is staggering. We audited every deployment and right-sized based on actual P95 usage. That alone saved us 25%.

2. Cluster autoscaler tuning. Most teams set up cluster autoscaler with default settings and never touch it again. We configured scale-down thresholds, set appropriate cooldown periods, and implemented pod disruption budgets that actually allow nodes to drain. Our average node utilization went from 35% to 72%.

3. Spot instances for stateless workloads. About 60% of our workloads can tolerate interruption. Running those on spot instances with proper fallback to on-demand saves us another 15-20%.

4. Karpenter instead of cluster autoscaler. Switching to Karpenter for node provisioning was a game-changer. It makes smarter bin-packing decisions and provisions the right instance types instead of one-size-fits-all node groups.

On the Alternatives

I’ve evaluated Nomad, Cloud Run, and App Runner for our use cases, and here’s the reality: they solve simpler problems. If you’re running stateless web services with straightforward scaling needs, absolutely, Cloud Run is fantastic. I’ve recommended it myself for internal tools and simple APIs.

But my team runs workloads that need:

  • Custom operators for database lifecycle management (we built one for our PostgreSQL clusters)
  • Advanced networking with Cilium for microsegmentation and eBPF-based observability
  • Multi-cluster federation across three regions with consistent service discovery
  • GPU scheduling for ML inference workloads with custom topology-aware placement

Try doing any of that on Nomad or Cloud Run. You can’t. For complex, heterogeneous workloads, Kubernetes is still the only orchestrator that gives you the primitives to build what you need.

The Real Answer

The framework should be: do you need the complexity? If you’re running 12 stateless web services, K8s is overkill and I’ll be the first to say it. But the “20% of companies” figure is too aggressive. Any organization running stateful workloads, multi-region deployments, or heterogeneous compute (GPUs, ARM, mixed instance types) is going to find those simpler alternatives hitting walls fast.

The solution isn’t to abandon K8s. It’s to invest in the operational discipline to run it efficiently.

Michelle, this hits close to home. I’m coming at this from the other end of the scale spectrum — we run Kubernetes at a Fortune 500 financial services company — and my perspective is nuanced.

At Enterprise Scale, K8s Cost Is Justified (But Only Barely)

We operate 200+ Kubernetes clusters across multiple clouds and on-premises data centers. Our platform engineering organization is 30 people. That sounds enormous, and it is — but when you divide it across 2,000+ engineers deploying to those clusters, it’s actually a reasonable ratio. The alternative would be each team managing their own infrastructure, which we tried in 2019 and it was chaos.

At our scale, the CNCF survey data tracks. Our K8s costs have risen year over year, primarily driven by:

  • More teams onboarding to the platform (good problem to have)
  • Increasing security and compliance requirements (SOC 2, PCI-DSS, FedRAMP)
  • Service mesh adoption for zero-trust networking (expensive but required by our security team)

The cost is justified for us because the alternative is worse. Before K8s, we had teams deploying to bare EC2 instances with hand-rolled scripts, no standardized observability, and no consistent security posture. The platform cost is real, but the risk reduction and developer velocity gains offset it.

Where I Agree With You Completely

Small and mid-size companies should not be running Kubernetes. Full stop. If you have fewer than 50 engineers and fewer than 30 services, the operational overhead of K8s will eat you alive. I’ve seen this play out at three companies where friends work — they adopted K8s because it was trendy and ended up spending 25-30% of their engineering capacity just keeping the lights on.

Cloud Run Changed My Mind

Here’s what shifted my thinking: we started evaluating Google Cloud Run for new projects last year as an experiment. The results were eye-opening. For our internal tools, simple APIs, and event-driven services, Cloud Run worked beautifully. 70% of our new projects in the last six months launched on Cloud Run instead of K8s, and the teams are measurably happier and faster.

My Framework for 2026

After a year of running both K8s and simpler alternatives side by side, here’s the framework I give my teams:

Workload Type Recommended Platform Why
Stateless web services Cloud Run / App Runner Zero ops overhead, scales to zero, pay per request
Event-driven processing Cloud Run + Pub/Sub Native integration, no cluster management
Stateful services (databases, queues) Kubernetes Need persistent volumes, custom operators, fine-grained control
Complex multi-service systems Kubernetes Service mesh, advanced networking, custom scheduling
Batch / ML training Serverless (Cloud Run Jobs, AWS Batch) Pay for compute time only, no idle resources
Edge / latency-sensitive Kubernetes or Fly.io Need control over placement and networking

The key insight: it’s not either/or. Most organizations should be running a mix, routing each workload to the platform that matches its complexity. The mistake is putting everything on K8s just because you already have it.

I’d love to hear if anyone else is running this kind of hybrid approach and what challenges they’ve encountered.

I’m going to come at this from a completely different angle: I don’t care about the infrastructure. I care about the developer experience, and Kubernetes is terrible at it.

The DX Tax Nobody Talks About

All this discussion about TCO and cost optimization is important, but it misses what I think is the biggest hidden cost: the cognitive load Kubernetes puts on application developers.

I’m a design systems lead. I build UI components and design tools. But at my current company, to deploy a new service, I need to:

  1. Write a Dockerfile (fine, I can handle this)
  2. Create a Helm chart with values files for dev, staging, and prod (why do I need to understand Go templating?)
  3. Configure an Ingress resource with annotations specific to our ingress controller (which one are we using again? nginx? Traefik? It changed last quarter)
  4. Set up resource requests and limits (how much CPU does my Node.js app need? I genuinely don’t know)
  5. Write network policies so my service can talk to the database (I had to learn about pod selectors and CIDR ranges for this)
  6. Debug why my pod is in CrashLoopBackOff (it was a missing environment variable that worked fine locally)

None of this helps me ship better products. None of it makes our users happier. It’s pure infrastructure overhead that’s been pushed onto application developers because the platform team is too busy keeping the clusters alive.

The Railway Revelation

At my previous startup (the one that failed, but not because of our infrastructure choices), we used Railway. The entire deployment process was: push to main, Railway builds and deploys, done. No Helm charts, no YAML files, no debugging pod scheduling.

We shipped features twice as fast because nobody spent time on infrastructure. Our tiny team of five engineers could focus entirely on the product. When something broke, Railway’s logs were clear and actionable — not a wall of Kubernetes events that require a PhD in distributed systems to interpret.

The Best Infrastructure Is Invisible

Here’s my hot take: the best infrastructure is the one developers never have to think about. If your developers are writing YAML instead of application code, your infrastructure has failed at its primary job — enabling developers to build products.

I hear the infrastructure folks saying “but you need K8s for complex workloads” and sure, maybe. But even then, the complexity should be abstracted away from the application developers. If your frontend engineers need to understand pod affinity rules, something has gone very wrong.

The platforms that are winning — Railway, Vercel, Render, Cloud Run — all understand this. They hide the complexity and let developers focus on what they’re actually paid to do: build things users love.

I’d take a “limited” platform that lets me ship fast over an “unlimited” platform that makes me an unpaid infrastructure engineer any day.