OpenTelemetry Saved Us From Vendor Lock-In - But Created New Operational Burden

Six months into our OpenTelemetry migration, I have mixed feelings. We solved the vendor lock-in problem. We created new problems I wasn’t expecting.

What We Gained

The benefits are real:

  • Vendor flexibility: We run parallel evaluations of backends without touching application code
  • Unified data model: One mental model across traces, metrics, logs
  • Community momentum: Stack Overflow answers, library support, tooling ecosystem
  • Cost leverage: Negotiated 25% reduction on our observability contract

What We Didn’t Expect

1. Configuration Complexity Explosion

Our OTel Collector config went from “simple YAML” to this:

# collector-config.yaml - 847 lines and growing

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        max_recv_msg_size_mib: 8
        max_concurrent_streams: 100
      http:
        endpoint: 0.0.0.0:4318
        cors:
          allowed_origins: ["*"]
  prometheus:
    config:
      scrape_configs:
        - job_name: 'kubernetes-pods'
          # ... 50 more lines

processors:
  batch:
    send_batch_size: 10000
    timeout: 10s
  memory_limiter:
    check_interval: 1s
    limit_mib: 4096
    spike_limit_mib: 1024
  attributes:
    actions:
      # ... 30 attribute transformations
  filter:
    spans:
      exclude:
        match_type: regexp
        # ... 20 exclusion patterns
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    policies:
      # ... 8 sampling policies

exporters:
  # ... 6 different exporters with unique configs

service:
  pipelines:
    # ... traces, metrics, logs each with unique routing

This config is now a critical piece of infrastructure that requires its own versioning, testing, and rollback procedures.

2. The “OTel Expert” Bottleneck

We have 25 engineers. Two of them understand OTel deeply. Everyone else treats the collector as a black box.

When something breaks:

Engineer: "My traces aren't showing up"
OTel Expert: "Let me check the collector"
[2 hours later]
OTel Expert: "Your service is using W3C context, but the downstream 
             service expects B3. I'll add a processor."

3. Version Drift Chaos

OTel SDKs evolve quickly. We have:

Language SDK Version Last Updated
Java 1.32.0 Last week
Python 1.21.0 3 months ago
Go 1.20.0 6 months ago
Node 1.18.0 4 months ago

Result: Subtle incompatibilities, context propagation bugs, inconsistent attribute naming.

4. Debugging the Debugger

When observability itself fails, you’re flying blind:

"Why are traces missing from Service X?"
├── Is the app instrumented correctly?
├── Is the SDK exporting?
├── Is the collector receiving?
├── Is the collector processing correctly?
├── Is the exporter sending?
└── Is the backend ingesting?

Debugging this requires… different observability tools. Meta, right?

My Honest Assessment

Would I do it again? Yes, but with different expectations.

What I’d do differently:

  1. Hire or designate a full-time OTel owner from day one
  2. Invest in collector observability (yes, meta-observability)
  3. Enforce SDK version consistency across all services
  4. Build internal training before migration, not during
  5. Accept that “simple” isn’t the goal—“standard” is

The Bottom Line

OTel trades one type of complexity (vendor lock-in) for another (operational overhead). For us, that’s the right trade. But don’t let anyone tell you it’s simpler. It’s not. It’s better, but not simpler.

Staffing and Training: The Real Investment

Alex, your “OTel Expert Bottleneck” point hits close to home. Here’s how we addressed it—imperfectly, but better than before.

The Staffing Model That Works

We moved from “2 experts, 23 users” to a tiered model:

Tier 1: OTel Platform Team (2 engineers)
├── Owns collector infrastructure
├── Defines standards and conventions  
├── Handles complex debugging
└── Maintains internal documentation

Tier 2: OTel Champions (6 engineers, 1 per team)
├── First responders for team issues
├── Reviews instrumentation PRs
├── Attends monthly OTel sync
└── Escalates to Platform Team when stuck

Tier 3: All Engineers (everyone else)
├── Follows instrumentation patterns
├── Uses standard libraries
└── Opens tickets for help

The Training Investment

We underestimated training. Here’s what we eventually built:

Training Duration Audience Content
OTel 101 2 hours All engineers Concepts, basic instrumentation
OTel SDK Deep Dive 4 hours Champions Language-specific details
Collector Operations 8 hours Platform Team Config, debugging, scaling
Quarterly Refresher 1 hour Champions Updates, lessons learned

The Champion Rotation Challenge

Champions burn out if it’s a permanent role. We rotate every 6 months:

Pros: Spreads knowledge, prevents single points of failure
Cons: Ramp-up time, inconsistent expertise levels

Cost Reality

The staffing investment for a 25-engineer team:

Platform Team: 2 engineers × 100% = 2 FTE
Champions: 6 engineers × 10% = 0.6 FTE
Training development: 0.25 FTE for 6 months

Total Year 1: ~2.75 FTE dedicated to OTel

That’s significant. But compare to the alternative: every engineer figuring it out independently, duplicating effort, making inconsistent decisions.

My Advice

Budget for the staffing model before you start migration. If leadership won’t fund proper staffing, reconsider whether you’re ready for OTel adoption at scale.

Data Quality and Schema Management: The Hidden Battle

Alex, your attribute explosion point resonates. From a data perspective, OTel creates as many problems as it solves if you don’t treat telemetry as a data product.

The Schema Problem

OTel provides semantic conventions, but they’re guidelines, not enforcement:

# Semantic convention says:
span.set_attribute("http.request.method", "GET")

# What teams actually do:
span.set_attribute("http.method", "GET")        # Old convention
span.set_attribute("request_method", "GET")     # Custom
span.set_attribute("HTTP_METHOD", "get")        # Case mismatch
span.set_attribute("method", "GET")             # Too generic

Now try to build a dashboard that works across all services. Good luck.

The Data Quality Pipeline

We treat OTel data like any other data product:

┌─────────────────────────────────────────────────────────┐
│                    Schema Registry                       │
│  ┌──────────────────────────────────────────────────┐  │
│  │ Approved Attributes                               │  │
│  │ - http.request.method: enum[GET,POST,PUT,DELETE] │  │
│  │ - user.id: string, max_length=64                 │  │
│  │ - cart.value: float, unit=USD                    │  │
│  └──────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│                 OTel Collector Processor                 │
│  ┌──────────────────────────────────────────────────┐  │
│  │ transform/validate:                               │  │
│  │   - Drop unknown attributes                       │  │
│  │   - Normalize naming (http.method → http.request.method) │  │
│  │   - Enforce cardinality limits                    │  │
│  │   - Emit schema_violation metric                  │  │
│  └──────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘

Cardinality: The Silent Killer

We learned this the hard way:

Attribute Cardinality Impact
user.id 10M+ Storage explosion, slow queries
request.path (unbounded) Infinite Backend OOM
error.stack_trace Very high 10x storage cost
http.status_code 50ish Fine

Rule: If an attribute can have >10,000 unique values, it needs sampling or aggregation.

Data Quality Metrics

We built a meta-dashboard:

-- Schema compliance rate by service
SELECT 
  service_name,
  COUNT(CASE WHEN schema_valid THEN 1 END) / COUNT(*) as compliance_rate
FROM otel_validation_results
GROUP BY service_name
ORDER BY compliance_rate ASC;

-- Cardinality trend by attribute
SELECT
  attribute_name,
  DATE_TRUNC('day', timestamp) as day,
  COUNT(DISTINCT attribute_value) as cardinality
FROM otel_attributes
GROUP BY 1, 2
ORDER BY cardinality DESC;

My Recommendation

Treat your OTel schema like your API schema:

  • Version it
  • Document it
  • Validate it at ingestion
  • Alert on violations

OTel gives you flexibility. That flexibility becomes chaos without governance.

Product Reliability: What I Care About as VP Product

Alex, thanks for the honest assessment. Let me share what this complexity means from a product perspective.

What Product Actually Needs from Observability

  1. “Is the feature working?” - Not “is the server up,” but “can users complete checkout?”
  2. “How fast is it?” - User-perceived latency, not p99 of individual services
  3. “What broke?” - When users complain, why?
  4. “What’s the business impact?” - Revenue at risk, users affected

The OTel Complexity Tax on Product

Your 847-line config doesn’t affect me directly. But here’s what does:

Symptom Product Impact
Missing traces during migration “We don’t know why this broke”
Inconsistent attributes Dashboards don’t work across features
OTel expert bottleneck Weeks to add business context to traces
Version drift Correlation across services broken

What I Need the Team to Prioritize

From a product owner perspective:

Must Have:

  • End-to-end traces across user journeys (not just services)
  • Business metrics correlated with technical health
  • Fast time-to-insight when production issues hit

Nice to Have:

  • Vendor flexibility for cost negotiations
  • Future AI/ML observability capabilities

Don’t Care:

  • Which collector topology you use
  • What SDK version is current
  • How many lines the config is

The Trade-Off I Accept

You’re trading operational simplicity for strategic flexibility. From a business perspective, that’s the right trade if:

  1. Time-to-insight stays the same or improves - If incidents take longer to debug during/after migration, that’s a product regression
  2. Coverage doesn’t decrease - “Flying blind” periods are unacceptable
  3. Business context is preserved - I don’t just need “Service X is slow,” I need “checkout is slow for premium users”

My Ask

When you present OTel migration to leadership, don’t just show the cost savings. Show:

Before OTel:
- Time to root cause: 45 minutes
- Business impact visibility: Low

After OTel:
- Time to root cause: 30 minutes (expected improvement)
- Business impact visibility: High (user-level correlation)

That’s the story that gets product support for the investment.