Is OpenTelemetry ready for production AI workloads? Our migration experience

We just finished migrating 80% of our AI startup’s observability stack from vendor-locked solutions to OpenTelemetry, and I wanted to share the real story—not the conference talk version, but what actually happened.

Why we did this

Coming from Google Cloud AI, I’m used to world-class observability tooling. When I joined this startup, we were using a mix of Datadog, New Relic, and some custom Prometheus dashboards. The bills were adding up fast, and we had zero portability. If we wanted to switch vendors or bring some observability in-house, we’d be starting from scratch.

OpenTelemetry felt like the obvious answer: vendor-neutral, community-driven, 89% of production users consider it critically important (according to recent surveys). The momentum behind it is real.

The AI workload challenge

Here’s where it gets interesting: OTel is great for traditional web services, but our workloads are mostly GPU-bound LLM inference. The standard OTel semantic conventions don’t cover:

  • GPU memory utilization per model
  • Token throughput rates
  • Model loading and unloading events
  • Inference latency bucketed by prompt length
  • Batch efficiency metrics

We had to build custom exporters and define our own semantic conventions for ML-specific metrics. That’s not a knock on OTel—it’s just that the AI infrastructure use case is newer and the community hasn’t standardized these patterns yet.

Migration approach: dual stack for 3 months

We didn’t rip and replace. We ran both legacy and OTel instrumentation side-by-side for three months:

  • Month 1: OTel in shadow mode, comparing data accuracy
  • Month 2: Started using OTel dashboards alongside legacy ones
  • Month 3: Gradually shifted alerts from legacy to OTel-based metrics

This cautious approach saved us. We caught several subtle differences in how metrics were aggregated, and having the legacy system as ground truth was crucial.

Performance impact

The OTel collector added about 50ms to our p99 latency initially. For LLM inference where we’re already at 2-3 seconds per request, 50ms is acceptable. But for low-latency services, this would be a problem.

We tuned it down to ~10ms p99 by:

  • Batching telemetry data more aggressively
  • Running collectors as sidecars instead of a central gateway
  • Being selective about what we trace (sampling 10% of requests)

Cost savings vs effort

Real talk: the migration consumed about 1.5 engineer-years of effort. Two platform engineers worked on it mostly full-time for 9 months. Was it worth it?

From a pure cost perspective, we’re saving about $4K/month on vendor observability bills. That’s a 3-4 year payback period, which isn’t amazing. The real value is strategic:

  • We’re no longer locked into vendor pricing changes
  • We can route different data to different backends (logs to Loki, metrics to Mimir, traces to Tempo)
  • We own our observability roadmap

Where we are now

80% migrated, with the last 20% being the hardest. Legacy integrations, third-party services that don’t speak OTel, and some mobile SDK gaps. We’ll probably run a hybrid stack for another 6 months.

My controversial take

OpenTelemetry is becoming the industry standard, and that’s a good thing. But it’s not a panacea. For early-stage startups with limited engineering resources, vendor solutions might still make more sense. The flexibility OTel gives you isn’t free—you’re trading vendor lock-in for operational complexity.

For AI workloads specifically, we’re in a weird in-between state where OTel is powerful but the ecosystem hasn’t caught up with our needs. I’m hopeful that as more ML teams adopt it, we’ll see better tooling and standardization.

Questions for the community

Has anyone else migrated to OTel for AI/ML workloads? How did you handle model-specific metrics? And for those still on vendor solutions—what would it take for you to consider OTel?

This is a great breakdown of the real-world trade-offs. As someone who lives on the product/business side, these infrastructure decisions always come down to: what’s the ROI and what are we NOT building because resources are allocated here?

The cost-benefit question

1.5 engineer-years of effort to save $4K/month = 3-4 year payback. From a pure financial perspective, that’s… not great. At a Series B SaaS company, I’d have a hard time justifying that to our CFO, especially when we could have spent that same engineering time building revenue-generating features.

However—and this is important—your strategic points are compelling:

Avoiding vendor lock-in = negotiating power

I’ve been on the other side of vendor renewal discussions. When you’re locked in, they know it, and pricing “adjustments” start appearing. The flexibility to credibly say “we could move to another backend” is worth something, even if you never do it.

At my previous company (Airbnb), we got locked into a monitoring vendor and the renewal came back at 3x the original price. We had no leverage because migration would have taken 18 months. We paid.

The question I’d ask as VP Product

If you had those 1.5 engineer-years back, what would you have built instead? What product features or optimizations didn’t happen because of this migration?

I’m not saying it was the wrong call—strategic infrastructure decisions often have non-obvious ROI. But in a resource-constrained startup environment, the opportunity cost is real.

When does this make sense?

From where I sit, OTel migration makes sense when:

  1. You’re at scale where vendor costs are growing faster than revenue
  2. You have platform engineering capacity (not pulling from product teams)
  3. You’re past product-market fit and can afford long-term infrastructure bets
  4. Vendor lock-in is actively limiting your roadmap

For early-stage startups still finding PMF? I’d probably stick with vendor solutions and defer this complexity.

How do you explain this to non-technical stakeholders?

Genuinely curious: when you proposed this migration to leadership/investors, how did you frame it? “We’re going to spend 9 months and $200K in engineering time to eventually save $4K/month” is a hard pitch.

Did you emphasize the strategic flexibility? The risk mitigation of vendor lock-in? Or was there a technical reliability argument I’m missing?

Not challenging the decision—I think long-term infrastructure investments are important. Just trying to understand how to make the business case, since I often have to bridge the gap between engineering priorities and executive understanding.

Your experience resonates with what I’m seeing in financial services—we evaluated OpenTelemetry about a year ago and ultimately decided to stick with Datadog. Different constraints, but similar decision-making process.

Enterprise vs startup trade-offs

In a regulated environment like banking, the calculus is different. Stability and compliance trump flexibility. A few considerations that kept us on vendor solutions:

1. SLAs and support

When our observability goes down at 3AM, I need a vendor support team I can escalate to. With open source OTel collector, who do you call? How do you guarantee uptime?

We pay a premium for Datadog, but we also have contractual SLAs, 24/7 support, and someone to blame if things go wrong. In financial services, that accountability matters—especially when regulators come asking questions.

2. Compliance and audit trails

Banking regulations require us to maintain complete audit trails with guaranteed retention. When I asked about OTel, my compliance team’s first question was: “Who guarantees the data won’t be lost?”

With vendor solutions, we have:

  • Contractual data retention SLAs
  • SOC 2 / ISO 27001 certifications we can point to
  • Audit logs of who accessed what data
  • Legal agreements about data sovereignty

Could we build all this on top of OTel? Probably. But that’s more engineering time, and every custom component is another thing to audit and maintain.

3. Vendor lock-in is painful, but…

I completely agree that vendor lock-in is a problem. We’re paying more than we’d like, and every renewal negotiation is tense. But here’s the thing: in enterprise, switching costs are always high.

Even with OTel, you’re “locked in” to your custom dashboards, your alerting logic, your team’s learned knowledge of the system. Migration is painful regardless.

The question for us became: is OTel’s vendor portability worth the operational complexity of managing it ourselves? For a platform team of 8 people supporting 200+ engineers, the answer was no.

Interested but cautious

That said, I’m watching the OTel ecosystem closely. A few things that would change my mind:

  1. Enterprise support vendors - If someone offers fully-managed OTel with SLAs (basically OTel-as-a-Service), that could be interesting
  2. Better compliance tooling - Pre-built solutions for audit trails, retention policies, access controls
  3. Stability guarantees - Semantic conventions locked down, no breaking changes in core protocols

A question for you

How do you handle incidents when the observability system itself is degraded? With vendor solutions, I can at least check their status page and know if it’s my problem or theirs.

With self-hosted OTel, you need observability for your observability—which adds another layer of complexity.

Strategic value of open standards

One thing I do appreciate: even though we’re on Datadog, the fact that OTel is becoming standard means we’re less locked in than we used to be. Datadog supports OTel ingestion now precisely because the community pressure is real.

So even if we don’t run OTel ourselves, the existence of the standard gives us leverage. That’s valuable.

Thanks for sharing the detailed experience—these kinds of honest discussions about trade-offs are exactly what I need when making infrastructure decisions.

Oh, this hits close to home. You mentioned “mobile SDK gaps” in your last 20%—that’s exactly where we’re stuck.

OTel mobile SDKs lag way behind backend

I’ve been trying to get OpenTelemetry working for our React Native app at Uber, and honestly, it’s been frustrating. The backend story is solid, but mobile feels like an afterthought.

Specific gaps we’ve hit:

1. Offline trace buffering

Mobile apps lose connectivity constantly—subway tunnels, airplane mode, poor network areas. We need to buffer traces locally and upload when connectivity returns.

The OTel mobile SDKs have basic support for this, but it’s nowhere near as sophisticated as vendor SDKs like Firebase Performance or Datadog Mobile. They handle:

  • Intelligent batching based on battery state
  • Compression before upload
  • Retry logic with exponential backoff
  • Storage limits so you don’t fill up the device

OTel’s mobile story is “well, you could build that yourself.” Not great when you’re trying to instrument hundreds of app screens.

2. Battery impact metrics

On mobile, observability itself has a cost: battery drain. Users will uninstall your app if it kills their battery.

Vendor SDKs report their own battery impact. OTel doesn’t have standardized conventions for this, and without it, how do you know if your instrumentation is causing more problems than it solves?

3. Client-side vs backend correlation

Your post mentioned the challenge of correlating GPU metrics with standard OTel metrics. We have a similar problem: correlating client-side mobile traces with backend traces.

Did you implement any mobile instrumentation? If so, how do you handle:

  • Mapping device IDs to backend user IDs (privacy concerns)
  • Understanding which backend errors are actually impacting mobile users
  • Correlating mobile crashes with backend state at the time

The pragmatic mobile approach

For now, we’re running a hybrid:

  • Backend services: OpenTelemetry all the way
  • Mobile apps: Stuck with vendor SDKs (Firebase for crashes, custom instrumentation for performance)

It’s not ideal—we lose the unified observability story. But the mobile OTel ecosystem just isn’t mature enough yet.

What would change my mind

I’d migrate mobile to OTel if:

  1. Offline buffering became first-class, not an afterthought
  2. Battery impact monitoring was built-in
  3. Crash reporting reached feature parity with Firebase/Crashlytics
  4. React Native/Flutter SDKs were as mature as native iOS/Android

A question about your migration

You mentioned saving $4K/month—does that include mobile observability costs, or are you paying separately for mobile instrumentation with vendor SDKs?

At our scale (millions of mobile users), the mobile observability costs are actually higher than backend. Would love to bring that into the OTel ecosystem, but not at the cost of functionality.

Appreciate you calling out that OTel isn’t a panacea. The backend story is compelling, but mobile teams are in a tough spot.

Really appreciate this detailed writeup—this is exactly the kind of real-world perspective I need as we’re scaling our EdTech platform.

Strong support for strategic infrastructure investment

From a leadership perspective, I actually think you made the right call, even with the 3-4 year payback period. Here’s why:

1. Engineering roadmap ownership

The ability to “own your observability roadmap” is huge. I’ve been in situations where vendor limitations directly blocked product features we wanted to build. The flexibility to customize and extend is worth more than the immediate cost savings.

At my previous company, we wanted to implement custom SLO tracking across our platform, but our monitoring vendor charged enterprise-tier pricing for that feature. We ended up building a hacky workaround. If we’d owned the infrastructure, it would have been straightforward.

2. Dedicated platform team approach

One thing that stands out: you had two platform engineers working on this mostly full-time. That’s the right approach. Where companies fail is trying to do infrastructure migrations with distributed 20% time from product engineers—that never works.

Question about staffing

How did you structure this? Did you:

  • Pull two engineers off product work (and communicate the tradeoff to stakeholders)?
  • Hire specifically for platform/infra roles?
  • Mix of dedicated + rotational involvement?

I’m currently making the case for a dedicated platform team, and these kinds of examples help me explain the value to our executive team.

The EdTech angle

We’re in an interesting position. Our observability needs are growing rapidly (just crossed 1M students using the platform), but we’re still earlier-stage than your typical tech company.

The vendor bills are starting to hurt, but I’m not sure we have the platform engineering capacity yet to self-manage OTel infrastructure. Your experience suggests we might want to wait another year, scale up the eng team, then migrate.

Creating a rollback plan

One thing I learned from previous migrations: always have a clear rollback plan and communicate the timeline upfront.

Did you set explicit milestones with your exec team? Like:

  • “After 3 months, we’ll evaluate if this is working”
  • “We expect full migration by month 9”
  • “If we hit blockers X/Y/Z, we’ll reconsider”

That kind of clear communication helps manage expectations and prevents the “why have you been working on infrastructure for a year?” questions from leadership who’ve forgotten the original plan.

Advice for others considering this

Based on your experience and my own migrations, I’d say:

  1. Don’t start this until you have dedicated platform engineering capacity
  2. Run dual stack longer than you think—3 months minimum
  3. Build the business case around strategic flexibility, not just cost savings
  4. Communicate timelines clearly to non-technical stakeholders
  5. Accept that hybrid state is okay—you don’t have to be 100% OTel

Thanks for sharing both the wins and the challenges. This community needs more honest discussions about the trade-offs of infrastructure decisions.