We Spent $2M Building a Unified Platform. Here's What Actually Happened

eng_director_luis · March 18, 2026, 7:12am

Over the past 18 months, we embarked on what seemed like a straightforward mission: unify our deployment pipelines. Three separate systems—one for application deployments, one for ML model deployments, and one for data pipeline orchestration. The exec team saw the Gartner report predicting 80% adoption of unified platforms by 2026 and said “let’s be in that 80%.”

We allocated $2M in budget. Brought in consultants. Kicked off a six-month initiative to build what everyone was calling “the unified platform.” And honestly? We delivered exactly what we promised. A beautiful Kubernetes-based internal developer platform with a Backstage frontend, GPU node pools for ML workloads, integration with our model registry, the works.

App developers loved it. Deployment times dropped from days to hours. They were thrilled. We declared victory at the all-hands. Showed impressive metrics. Got executive buy-in for the next phase.

Then we noticed something: our ML engineering team hadn’t migrated. Six months after launch, they were still deploying models directly to SageMaker, bypassing our shiny new platform entirely.

The Brutal Truth

We built an app-dev platform with ML features bolted on. Not a unified platform. We approached it from an infrastructure perspective—“let’s make Kubernetes support both workloads”—instead of a workflow perspective. We never asked the fundamental question: what does “deployment” actually mean to a data scientist versus a backend engineer?

For our app developers, deployment means: push code, run tests, build container, deploy to cluster, done. Linear workflow, clear success criteria.

For our ML engineers, deployment means: validate model performance, ensure feature parity with training data, configure inference endpoints, set up A/B testing infrastructure, establish monitoring for model drift, plan rollback strategy. It’s not linear—it’s iterative and experimental.

We built the first workflow and expected ML teams to adapt. They didn’t. They couldn’t. Their jobs are fundamentally different.

What We’re Doing Differently

Six months ago, we started over. This time with a co-design approach:

Embedded platform engineers in the ML team for a month. Just observed. Watched how they actually work.
Workflow mapping sessions where ML engineers walked us through their ideal deployment process, not the one our platform imposed.
Prototype testing with real ML workloads before building infrastructure. We validated workflows before infrastructure.
Separate abstractions for different personas that compile down to the same underlying platform. App devs see containers and deployments. ML engineers see model versions and inference endpoints. Same Kubernetes, different interfaces.

It’s slower. Much slower. We’re eight months in and still not at feature parity with what we “delivered” in the first attempt. But adoption is real this time. ML teams are actually using it because we built for their workflows, not ours.

The Lessons

If I could go back and advise the team that started this journey:

Start with workflows, not infrastructure. Your platform is only “unified” if it serves all workflows, not just one.
Co-design isn’t optional. Every persona needs representation from day one, not “we’ll add ML support in v2.”
Measure adoption, not features. We built everything on the roadmap but failed at the only metric that matters: are people using it?
Different doesn’t mean worse. ML deployment workflows aren’t broken app workflows. They’re legitimately different, and that’s okay.
Budget for iteration. The $2M we spent taught us what NOT to build. That has value, but only if you get budget to apply those lessons.

Right now we’re at 60% adoption across all personas. Not the 100% we optimistically projected, but it’s real adoption. ML engineers are choosing the platform because it works for them, not because we mandated it.

The unified delivery pipeline is coming. But unification at the infrastructure level without workflow-level empathy is just infrastructure consolidation with better PR.

For teams starting this journey: What’s your definition of “unified”? Because if it doesn’t include the workflows of your least-represented persona, you’re building the same expensive lesson we did.

cto_michelle · March 18, 2026, 7:12am

Luis, thank you for the honesty here. This is the real story behind every “80% platform adoption” statistic that gets thrown around. The gap between “we built it” and “they use it” is where most platform initiatives die quietly.

Your experience mirrors ours almost exactly. Three years ago, we went through our own “unified platform” initiative. Declared success after launch. Showed the metrics at board meetings. And then slowly realized half the engineering org had routed around it within six months. We called it “platform engineering theater”—all the artifacts of success without actual adoption.

The Technology Was Easy

Here’s what nobody talks about: the technical implementation of a unified platform is actually the straightforward part. Kubernetes can handle both app workloads and ML workloads. Service meshes work fine for both. The infrastructure unification? That’s solved technology.

The hard part—the part we completely underestimated—is workflow change. And workflow change requires culture change. And culture change is expensive, slow, and doesn’t show up on Gantt charts.

Your point about starting with workflows instead of infrastructure is exactly right. But I’d add: workflows are downstream from incentives. ML engineers weren’t avoiding your platform because it was technically insufficient. They avoided it because their job success metrics weren’t aligned with platform adoption. They’re measured on model accuracy and deployment velocity. If your platform slows down either, they’ll route around it. Full stop.

The Co-Design Bet

Your co-design approach is the right call, but here’s the challenge you’re probably already hitting: executive patience for “slower ROI.” The first attempt showed results in 6 months. The rebuild is 8 months in with “only” 60% adoption. From a board perspective, that looks like regression.

This is where platform-as-product thinking becomes critical. You need product-style success metrics that make the 60% real adoption look better than the 100% fictional adoption. Metrics like:

Weekly active users by persona
Deployment frequency by workflow type
Time-to-first-deployment for new team members
Platform NPS by team (not just overall)

These tell the real story. But you need executive buy-in that these matter more than “we built all the features on the roadmap.”

The Question That Matters

You asked “how did you get buy-in to start over?” Here’s what worked for us:

We stopped selling the platform rebuild as a technical project and started selling it as a retention project. We showed that our ML engineers were the most expensive talent to replace and the most flight-risky because they spent 40% of their time fighting deployment tooling. We framed the co-design rebuild as “what does it cost to lose three senior ML engineers?”

Suddenly, a $2M platform rebuild looked cheaper than recruitment and ramp-up costs. The CFO became our biggest supporter.

Where I Disagree (Slightly)

You said the unified delivery pipeline is coming. I’m less certain. I think unified infrastructure is coming—we’ll all run on Kubernetes or its successor. But I think we’re going to see persistent workflow differentiation at the abstraction layer.

Just like “full-stack engineer” turned out to mean “frontend specialist who can read backend code” or vice versa, I think “unified platform” will turn out to mean “shared infrastructure with persona-specific interfaces.”

The unification happens at the Kubernetes layer. The differentiation happens at the developer experience layer. And that’s fine. That might actually be the right architecture.

The Real Win

60% real adoption across all personas is legitimately impressive. Most “unified platforms” have 90% adoption from app teams and 10% from ML teams. You’ve actually achieved workflow parity.

The next challenge: how do you prevent drift? How do you ensure that as the platform evolves, it continues to serve all personas equally? That requires organizational structure, not just technical architecture. Do you have product owners for each persona? Do ML teams have dedicated platform liaisons?

Because the risk now is that you’ve achieved balance, but the platform team is still 80% app-dev focused. The next 10 features will accidentally optimize for the majority persona unless you have structural checks against it.

product_david · March 18, 2026, 7:12am

This is a masterclass in product thinking, even though you didn’t frame it that way. As someone who lives in product land, everything about this story screams “we built for power users and wondered why everyone else didn’t adopt.”

The Jobs-To-Be-Done Lens

Here’s the framework I’d apply: what job were ML engineers hiring SageMaker to do? And what job were you offering with your platform?

SageMaker’s job: “Help me get a validated model into production with minimal context switching from my training workflow.”

Your platform’s job (v1): “Give me a Kubernetes-native way to deploy containers.”

Those aren’t the same job. ML engineers don’t wake up thinking “I wish I could deploy to Kubernetes.” They think “I need this model serving predictions by Friday.”

The $2M you spent wasn’t wasted—it taught you what job you were actually solving for. And that’s valuable product discovery. Expensive, but valuable.

The Real Product Question

You mentioned you’re measuring success differently in the rebuild. I’m curious: how are you defining “done” for each persona?

For app developers, “done” might be: code deployed, health checks passing, rollback strategy configured.

For ML engineers, “done” is probably: model serving predictions, performance validated against baseline, drift monitoring configured, A/B test running, fallback to previous model working.

If your platform treats both as “deployment complete,” you haven’t actually unified the workflows—you’ve just unified the infrastructure. The actual value delivery happens after what your platform calls “done.”

This is why I think the co-design approach will work. You’re finally asking: what does success look like from the user’s perspective? Not from the platform team’s perspective.

Measuring What Matters

Michelle’s right that you need different success metrics. But I’d go further: you need different metrics for different personas.

App developer success metrics:

Time from commit to production
Deployment failure rate
Rollback time

ML engineer success metrics:

Time from model training to inference endpoint
Model performance in production vs. validation
Time to detect and remediate model drift

If you’re measuring both personas with the same metrics, you’re missing the point. The platform is “unified” at the infrastructure layer but should be “differentiated” at the outcome layer.

The Uncomfortable Question

Here’s what I’m wrestling with: is 60% adoption across all personas actually the goal? Or is it a sign that you’ve found the lowest common denominator?

I’m being provocative, but hear me out: what if the right answer is 100% adoption from app developers and 40% adoption from ML engineers? Because 40% of ML workloads actually fit the unified model, and the other 60% are edge cases that should use specialized tooling?

The product trap here is assuming every workflow should be served by the same platform. Sometimes the right answer is: we serve these personas, not those. That’s product strategy—choosing what not to build.

I’m not saying that’s your situation. But it’s worth asking: what workflows are you trying to support that maybe shouldn’t live on a unified platform?

What Would Product Discovery Look Like?

If I were approaching this as a product manager, here’s the discovery I’d want:

User interviews: Not with “ML engineers” broadly, but with specific roles. The data scientist who builds models isn’t the same user as the ML engineer who deploys them. Different jobs to be done.
Journey mapping: What’s the end-to-end path from “I have a trained model” to “model is serving production traffic”? Where are the pain points? What takes time? What requires tribal knowledge?
Competitive analysis: Why do teams choose SageMaker over your platform? What does it do better? (Not what does it do differently, but what does it do better for their specific job.)
Success criteria validation: How do different personas define successful deployment? Is it speed? Reliability? Observability? Cost control?

That discovery probably would have cost less than $2M and prevented a lot of the rebuild.

The Question I’d Ask Your Team

In the rebuild, how are you measuring “done”? Not platform “done,” but user “done.”

Because if your platform’s definition of success is “model deployed” but the ML engineer’s definition is “model validated in production,” you’ve still got a gap. The platform succeeded but the user didn’t.

That’s the metric that matters: what percentage of users achieve their actual goal (not just use the platform)?

maya_builds · March 18, 2026, 7:12am

Oh wow, this hits close to home. Not with platform engineering specifically, but with the design systems journey we went through. We had almost the exact same pattern: build it, declare victory, watch adoption crater, figure out what we did wrong, rebuild with actual user input.

The “Build It and They Will Come” Fallacy

Here’s what I’ve learned from design systems (which are kind of like platforms for designers): if your users don’t show up during the building process, they won’t show up during the using process either.

You mentioned that ML engineers ignored your platform for six months. I’d bet money that those same ML engineers weren’t in the room when you were designing it. Am I right?

Because that’s exactly what happened with our first design system. We (the design team) built this beautiful component library with perfect documentation. Showed it at a demo. Everyone clapped. And then… engineers kept building one-off components instead of using ours.

Why? Because we built what we thought they needed, not what they actually needed. We optimized for design consistency (our goal) instead of developer velocity (their goal). Sound familiar?

The User Research You Didn’t Do

You said you’re embedding platform engineers in the ML team now. That’s basically user research. And the fact that you’re doing it after the first build instead of before tells me everything about why v1 failed.

Here’s the question that would have changed your first attempt: Did anyone sit with an ML engineer and watch them deploy a model end-to-end before building the platform?

Not asking them what they want. Not showing them mockups. Actually watching them work. Seeing where they get frustrated. Understanding their mental model.

If you had, you probably would have discovered that “deployment” means something completely different to them. You would have seen them fight with SageMaker. You would have understood why they do things a certain way.

Instead, you built infrastructure and expected users to adapt their workflow to your tool. That never works. Never. The tool has to adapt to the workflow, not the other way around.

Champion Users Change Everything

You mentioned 60% adoption. I’m curious: who are the 60%? Are they evenly distributed across personas, or did you get one or two ML teams as champions and the rest are following?

In design systems, we learned that adoption is viral, not top-down. You need champions in each team. Power users who become advocates. They’re the ones who answer questions in Slack, help teammates get unstuck, and defend the system when someone wants to build a one-off solution.

For your platform, you need ML engineer champions. Not platform engineers explaining ML features. Actual ML engineers who use the platform daily and can vouch for it to their peers.

Did you identify champions during the co-design process? Did you give them special access, early features, a seat at the roadmap table? Because if you didn’t, you’re missing the key adoption driver.

The Documentation/Onboarding Gap

You said ML engineers ignored the platform. I’m going to make another bet: your documentation assumed too much knowledge.

I’ve seen this pattern so many times. Platforms built by engineers for engineers, with docs that assume you already understand the mental model. “To deploy, just run this kubectl command” or “Configure your model in the YAML file.”

For someone who lives in Jupyter notebooks and thinks in pandas dataframes, that’s gibberish. It’s a different language. And you’re asking them to learn your language instead of speaking theirs.

What does good onboarding look like for an ML engineer who’s never deployed anything? Not a README with commands. A step-by-step guide that starts from their context:

“You have a trained model in a .pkl file. Here’s how to get it serving predictions…”

Not: “Here’s how our platform works.” But: “Here’s how to do the thing you’re trying to do.”

That’s a UX problem, not a technical problem. And it requires actual user research to solve.

The Question I’d Ask

Before you built v1, did anyone on the platform team do user research with ML engineers? And I mean real user research:

Watching them work
Asking about pain points
Understanding their mental models
Validating assumptions with prototypes

Or did you build from requirements gathered in meetings?

Because “what do you need” (asked in a meeting) gets very different answers than “show me how you work” (observed in real time).

The $2M rebuild could have been a $200K discovery process if someone had done the research first.

That’s not a criticism—it’s just the pattern I see everywhere. We’re so eager to build that we skip the understanding phase. And then we’re surprised when nobody uses what we built.

What Would Success Look Like?

If I were designing the UX for this platform, here’s what I’d want:

For ML engineers: “I don’t need to think about Kubernetes. I just need to get my model into production.”

For app developers: “I don’t need to think about CI/CD. I just push code and it deploys.”

For data engineers: “I don’t need to think about orchestration. I define the pipeline and it runs.”

Three different interfaces. Same platform underneath. Each optimized for that persona’s mental model and workflow.

Is that what you’re building in v2? Because if you’re still trying to make everyone use the same interface, you’re going to hit the same adoption problem.

vp_eng_keisha · March 18, 2026, 7:13am

Luis, this is exactly the conversation we need to be having about platform engineering. Not the “we built a unified platform and got 80% adoption” success stories, but the honest truth about what happens when platform teams don’t include all personas from day one.

The Invisible Cost

You mentioned the $2M spent. But what about the opportunity cost of your ML team staying on the old workflow for 6+ months? What about the technical debt they accumulated working around your platform? What about the morale cost of building something nobody uses?

I’ve been tracking platform initiatives across our portfolio companies, and here’s the pattern: the technical costs are visible and bounded ($2M, 6 months, X engineers). The organizational costs are invisible and unbounded.

You lost credibility with the ML team. They watched you build something that didn’t work for them, declare victory, then come back asking for a do-over. How much trust did that cost? How many conversations start with “remember last time platform said this would work…”?

That’s the real cost. And it’s why I push teams to get it right the first time, even if it takes longer.

The Hiring Lens

Here’s the question I’d ask: did your platform team have ML expertise when you started v1?

Because I’d bet the answer is no. You had platform engineers who understood Kubernetes, CI/CD, infrastructure-as-code. Maybe some had deployed ML models before, but not as their primary job.

So you had a team of app-focused engineers trying to build for a persona they didn’t understand. And you got exactly what you’d expect: an app platform with ML features bolted on.

This is a hiring problem, not just a process problem. If you’re building a unified platform for N personas, your platform team should include expertise from N-1 personas (one will be over-represented by default, usually app development).

The Skills Question

You mentioned embedding platform engineers in the ML team. That’s great for understanding workflows. But did you also hire ML engineers onto the platform team?

Because there’s a difference between understanding ML workflows and having the expertise to build for them. I can shadow a surgeon for a month, but I still can’t perform surgery.

For your rebuild, did you:

Hire ML platform engineers with production ML experience?
Promote ML engineers into platform roles?
Borrow ML engineers from other teams as advisors?

If your platform team is still 100% app-focused engineers trying to serve ML workflows, you’re going to hit the same problem eventually. The expertise needs to live on the team, not just in the user research.

The Org Design Challenge

Michelle asked about preventing drift as the platform evolves. I think that’s an organizational design question, not a technical one.

Here’s the structure I’ve seen work:

Platform core team: Owns infrastructure, shared services, foundational capabilities
Platform advocates: Embedded in each major org (app teams, ML teams, data teams), 50% platform work, 50% embedded team work
Platform product council: Representatives from each persona, meets monthly to prioritize roadmap

The advocates are critical. They’re not full-time platform engineers. They’re senior engineers from each org who have a platform focus. They understand their team’s workflows deeply because they’re still doing the work. And they have enough platform context to translate needs into requirements.

Without that structure, your platform team becomes isolated from the users. And eventually, you build the wrong thing.

The Leadership Challenge

You asked how to get buy-in to start over. Here’s the harder question: how do you balance “build the right thing slowly” vs. “show ROI quickly”?

Because the pressure you’re feeling isn’t going away. VPs want to see results. CFOs want to justify the spend. Board members want to hear success stories.

The co-design approach is correct. But it requires executive patience for slower, messier progress. And that’s a leadership communication challenge.

What I’ve done in similar situations:

Reframe metrics: Stop reporting “features shipped,” start reporting “adoption by persona” and “user satisfaction by team”
Show the counterfactual: What would have happened if you kept going with v1? How many ML engineers would have left? What Shadow IT would have emerged?
Celebrate small wins: 60% adoption across all personas is actually impressive. Make sure leadership understands that’s better than 90% app adoption and 10% ML adoption.
Create feedback loops: Bring ML engineers to exec reviews. Let them tell leadership directly why the platform works now. That’s more powerful than metrics.

The Diversity Angle

Here’s something I haven’t seen mentioned yet: platform unification efforts often accidentally privilege the majority persona. And if you’re not careful, that has diversity implications.

If ML engineering has better gender/racial diversity than app engineering (which it often does, especially at research-focused companies), and your platform makes ML workflows harder, you’ve created a diversity problem disguised as a technical problem.

This isn’t hypothetical. I’ve seen companies lose diverse ML talent because the platform optimized for app workflows. The ML engineers felt like second-class citizens, saw limited career growth, and left.

Your co-design approach prevents this. But it’s worth being explicit: building for all personas isn’t just good product thinking, it’s good DEI strategy.

The Question I’d Ask Your Team

What skills did you add to the platform team for the rebuild? Show me the job descriptions. Show me the backgrounds of new hires.

Because if the platform team looks the same in v2 as it did in v1, you’re going to get the same result. Just slower.

The organizational structure has to change to support the technical architecture. Otherwise, you’re building a unified platform with a fragmented team. And that doesn’t work.