We just finished migrating 80% of our AI startup’s observability stack from vendor-locked solutions to OpenTelemetry, and I wanted to share the real story—not the conference talk version, but what actually happened.
Why we did this
Coming from Google Cloud AI, I’m used to world-class observability tooling. When I joined this startup, we were using a mix of Datadog, New Relic, and some custom Prometheus dashboards. The bills were adding up fast, and we had zero portability. If we wanted to switch vendors or bring some observability in-house, we’d be starting from scratch.
OpenTelemetry felt like the obvious answer: vendor-neutral, community-driven, 89% of production users consider it critically important (according to recent surveys). The momentum behind it is real.
The AI workload challenge
Here’s where it gets interesting: OTel is great for traditional web services, but our workloads are mostly GPU-bound LLM inference. The standard OTel semantic conventions don’t cover:
- GPU memory utilization per model
- Token throughput rates
- Model loading and unloading events
- Inference latency bucketed by prompt length
- Batch efficiency metrics
We had to build custom exporters and define our own semantic conventions for ML-specific metrics. That’s not a knock on OTel—it’s just that the AI infrastructure use case is newer and the community hasn’t standardized these patterns yet.
Migration approach: dual stack for 3 months
We didn’t rip and replace. We ran both legacy and OTel instrumentation side-by-side for three months:
- Month 1: OTel in shadow mode, comparing data accuracy
- Month 2: Started using OTel dashboards alongside legacy ones
- Month 3: Gradually shifted alerts from legacy to OTel-based metrics
This cautious approach saved us. We caught several subtle differences in how metrics were aggregated, and having the legacy system as ground truth was crucial.
Performance impact
The OTel collector added about 50ms to our p99 latency initially. For LLM inference where we’re already at 2-3 seconds per request, 50ms is acceptable. But for low-latency services, this would be a problem.
We tuned it down to ~10ms p99 by:
- Batching telemetry data more aggressively
- Running collectors as sidecars instead of a central gateway
- Being selective about what we trace (sampling 10% of requests)
Cost savings vs effort
Real talk: the migration consumed about 1.5 engineer-years of effort. Two platform engineers worked on it mostly full-time for 9 months. Was it worth it?
From a pure cost perspective, we’re saving about $4K/month on vendor observability bills. That’s a 3-4 year payback period, which isn’t amazing. The real value is strategic:
- We’re no longer locked into vendor pricing changes
- We can route different data to different backends (logs to Loki, metrics to Mimir, traces to Tempo)
- We own our observability roadmap
Where we are now
80% migrated, with the last 20% being the hardest. Legacy integrations, third-party services that don’t speak OTel, and some mobile SDK gaps. We’ll probably run a hybrid stack for another 6 months.
My controversial take
OpenTelemetry is becoming the industry standard, and that’s a good thing. But it’s not a panacea. For early-stage startups with limited engineering resources, vendor solutions might still make more sense. The flexibility OTel gives you isn’t free—you’re trading vendor lock-in for operational complexity.
For AI workloads specifically, we’re in a weird in-between state where OTel is powerful but the ecosystem hasn’t caught up with our needs. I’m hopeful that as more ML teams adopt it, we’ll see better tooling and standardization.
Questions for the community
Has anyone else migrated to OTel for AI/ML workloads? How did you handle model-specific metrics? And for those still on vendor solutions—what would it take for you to consider OTel?