Six months into our OpenTelemetry migration, I have mixed feelings. We solved the vendor lock-in problem. We created new problems I wasn’t expecting.
What We Gained
The benefits are real:
- Vendor flexibility: We run parallel evaluations of backends without touching application code
- Unified data model: One mental model across traces, metrics, logs
- Community momentum: Stack Overflow answers, library support, tooling ecosystem
- Cost leverage: Negotiated 25% reduction on our observability contract
What We Didn’t Expect
1. Configuration Complexity Explosion
Our OTel Collector config went from “simple YAML” to this:
# collector-config.yaml - 847 lines and growing
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
max_recv_msg_size_mib: 8
max_concurrent_streams: 100
http:
endpoint: 0.0.0.0:4318
cors:
allowed_origins: ["*"]
prometheus:
config:
scrape_configs:
- job_name: 'kubernetes-pods'
# ... 50 more lines
processors:
batch:
send_batch_size: 10000
timeout: 10s
memory_limiter:
check_interval: 1s
limit_mib: 4096
spike_limit_mib: 1024
attributes:
actions:
# ... 30 attribute transformations
filter:
spans:
exclude:
match_type: regexp
# ... 20 exclusion patterns
tail_sampling:
decision_wait: 10s
num_traces: 50000
policies:
# ... 8 sampling policies
exporters:
# ... 6 different exporters with unique configs
service:
pipelines:
# ... traces, metrics, logs each with unique routing
This config is now a critical piece of infrastructure that requires its own versioning, testing, and rollback procedures.
2. The “OTel Expert” Bottleneck
We have 25 engineers. Two of them understand OTel deeply. Everyone else treats the collector as a black box.
When something breaks:
Engineer: "My traces aren't showing up"
OTel Expert: "Let me check the collector"
[2 hours later]
OTel Expert: "Your service is using W3C context, but the downstream
service expects B3. I'll add a processor."
3. Version Drift Chaos
OTel SDKs evolve quickly. We have:
| Language | SDK Version | Last Updated |
|---|---|---|
| Java | 1.32.0 | Last week |
| Python | 1.21.0 | 3 months ago |
| Go | 1.20.0 | 6 months ago |
| Node | 1.18.0 | 4 months ago |
Result: Subtle incompatibilities, context propagation bugs, inconsistent attribute naming.
4. Debugging the Debugger
When observability itself fails, you’re flying blind:
"Why are traces missing from Service X?"
├── Is the app instrumented correctly?
├── Is the SDK exporting?
├── Is the collector receiving?
├── Is the collector processing correctly?
├── Is the exporter sending?
└── Is the backend ingesting?
Debugging this requires… different observability tools. Meta, right?
My Honest Assessment
Would I do it again? Yes, but with different expectations.
What I’d do differently:
- Hire or designate a full-time OTel owner from day one
- Invest in collector observability (yes, meta-observability)
- Enforce SDK version consistency across all services
- Build internal training before migration, not during
- Accept that “simple” isn’t the goal—“standard” is
The Bottom Line
OTel trades one type of complexity (vendor lock-in) for another (operational overhead). For us, that’s the right trade. But don’t let anyone tell you it’s simpler. It’s not. It’s better, but not simpler.