Your Span Names Are an Undocumented API: Telemetry Contracts Between Agent Teams
The cost spike that paged finance at 3 a.m. was not a cost spike. It was a span rename. Someone on the agent platform team decided that llm.completion.synthesis should really be llm.generate.answer because it read more naturally, opened a small PR, ran their tests, and shipped. Three days later finance's monthly token-spend dashboard showed a 60% drop. Nobody had cut spending. The aggregation rule still grouped by the old name, and the new spans flowed past it into an "other" bucket that the dashboard didn't even render. The bill didn't move. The dashboard did.
This is a class of incident I keep watching teams rediscover. Span names and attribute keys are not labels for humans to read in a trace UI. They are the public schema of an undocumented API, with consumers that the producer team has never met — eval pipelines that filter on them, cost dashboards that group by them, SLO alerts that fire on their durations, FinOps reports that sum their token attributes. A "harmless rename" inside one team is a wire-protocol break for four other teams that never saw the PR.
The mental shift required is uncomfortable but simple: if your protobuf definitions get a schema registry, a compatibility mode, and a CI gate, your OpenTelemetry attribute set deserves the same governance. Telemetry is a contract. Treat it like one.
Why Spans Become Contracts Without Anyone Deciding
Nobody sets out to publish an API by emitting a span. The chain is incremental. Team A instruments their agent and picks a span name that sounds reasonable. The platform team's collector slurps it up and routes it to the trace store. An analyst in another org notices that filtering traces by tool.call.search produces a useful slice for tool-success-rate analysis and writes a query. A FinOps engineer sees an gen_ai.usage.input_tokens attribute and adds it to a Looker model. An SRE writes an alert on latency p99 where span.name = "llm.completion.synthesis".
None of these consumers ever talked to Team A. None of them appear in Team A's CODEOWNERS. None of their queries are referenced in Team A's repo. But each one is now a load-bearing dependency on the exact strings Team A chose. The contract was never negotiated; it was materialized — by the act of someone, somewhere, writing a query that depends on a name.
This is structurally identical to how Hyrum's Law works on HTTP APIs: with a sufficient number of consumers, every observable behavior of your system will be depended on by someone. The difference is that HTTP API teams know they have an API. Span-emitting teams often think they have an internal log line.
The OpenTelemetry community has been wrestling with this directly. Span names are supposed to be low cardinality and stable; attribute keys are supposed to be governed by semantic conventions. The GenAI semantic conventions for agents and LLM clients are still in experimental status as of early 2026, and they ship with a OTEL_SEMCONV_STABILITY_OPT_IN mechanism — opt-in dual-emission of old and new attribute names so consumers can migrate. That mechanism only exists because the maintainers know exactly what happens when you rename an attribute that the world has come to depend on.
The Failure Modes Are Quiet
What makes telemetry breaks especially nasty is that they tend to fail silently. A protobuf field rename in a Kafka producer will usually blow up a downstream consumer with a deserialization error — the schema registry will reject the publish, or the consumer will throw, or a dead-letter queue will start filling. The break announces itself.
A span rename rarely produces an error. It produces a query that returns fewer rows, or no rows, or the same rows but missing a critical category. The trace store is happy. The collector is happy. The producer team's tests are green. The consumer's dashboard goes flat or — worse — looks plausible but is now systematically under-counting one segment.
I've seen four characteristic shapes of this failure:
- The flat-line dashboard. A KPI that used to oscillate normally suddenly pegs to zero or to a constant. Often missed for days because flat panels look like "things are stable now."
- The silent re-bucket. A multi-series chart that previously broke down by tool type starts collapsing everything into "other" or into a single category, because the discriminator attribute changed.
- The partial migration. Half the producers ship the new name; half are still on the old name. Aggregations now double-count when both names overlap a window, then under-count when one side finishes the rollout.
- The absent alert. An SLO alert silently stops firing because its filter no longer matches anything. Nobody notices until an incident the alert was supposed to catch happens unnoticed for hours.
The common thread: there is no exception, no 500, no failed deploy. The damage is to the meaning of derived data, and the only way to detect it is to know what the data was supposed to mean.
Borrowing the API Discipline You Already Have
If you've worked on a system with a real schema registry — Confluent's for Avro and Protobuf, Buf's for Protos — you already know the playbook. Schemas are versioned artifacts with owners. Compatibility modes (BACKWARD, FORWARD, FULL) are enforced in CI before a producer can publish a new version. Breaking changes require an explicit, reviewed, opt-in process with a deprecation window where old and new fields coexist. Consumers are visible to producers because they're registered.
Almost every part of this transfers directly to telemetry, and OpenTelemetry has been quietly building the tooling to make it real. OTel Weaver is the registry analogue: you describe your spans, attributes, and metrics in a structured registry, validate code against it, generate documentation and typed instrumentation helpers, and run weaver registry diff to detect breaking changes between versions. The diff can be wired into CI to fail PRs that introduce backward-incompatible changes without a deliberate version bump and migration plan. Weaver also supports custom Rego policies — useful when your house rules go beyond the OTel conventions (e.g., "no new span name without an owning team in the registry").
The pattern, mapped one-to-one:
- Schema as artifact. Span names and attribute schemas live in a registry repo, not in scattered string literals across your services. Producers pull from the registry; consumers reference the registry version they target.
- Versioning and compatibility modes. A registry version is the unit of change. Adding new optional attributes is backward-compatible; renaming or retyping existing ones is not. CI enforces the rule.
- Owned, deprecation-windowed changes. Breaking changes ship with dual-emission: producers emit both the old and new names for a defined window. The OTel
OTEL_SEMCONV_STABILITY_OPT_INenv var implements exactly this pattern at the convention level — copy the pattern at your org level. - Known consumers. When a span schema changes, the producer can see who's filtering on it. This requires that consumers register their dependencies (a Looker model, a Grafana panel, an SLO query) against the registry — annoying to set up, life-saving when something needs to change.
The Org Seam Where This Falls Apart
Tooling solves the technical half. The cultural half is harder, and it almost always fractures along the same fault line: who owns what part of the telemetry stack.
Two failure modes recur. In the first, the platform team owns "observability" — the collector, the trace store, the dashboards. They feel responsible for the system but have no authority over the strings application teams emit. They watch a rename go through and have no policy lever to block it. In the second, every team owns its own spans, the platform team owns nothing but the pipes, and there is no shared schema. Cross-team consumers learn to live with brittle string filters and write defensive code that breaks weekly.
The arrangement that holds up is closer to how API design works in mature orgs: the platform team owns the registry and the policy, application teams own the content of their slice of the registry, and consumers register their dependencies. Breaking-change reviews are a real review with a real reviewer pool — the affected consumer teams. The producer cannot merge a renaming PR until either (a) no registered consumer depends on the old name or (b) the consumers have been notified and a deprecation window is scheduled.
This sounds heavy. It is much lighter than the recurring 3 a.m. pages and the trust collapse that follows the third "the dashboard was wrong" finance escalation.
Where to Start If You're Not There Yet
Most teams reading this are nowhere near a Weaver-backed registry, and that's fine — the path in is incremental.
Start by inventorying what you already have. Pick the top three to five span names by volume in your trace store and ask: who consumes these? You will be surprised. The answer is rarely just "the team that emits them." Document the consumers. Now you have a draft contract.
Next, add a CI check that fails any PR that adds, renames, or removes a span name or attribute key without a corresponding entry in a TELEMETRY_CHANGES.md file in the repo. This is dumb and primitive and works. It forces the conversation to happen before the rename ships, not after.
When you graduate from primitive to real, OTel Weaver is the obvious next step — adopt it for the spans your platform team controls, expand to application teams as governance matures, and use the registry diff in CI as your compatibility gate. Pair it with a normalization layer at the collector: when a rename does ship, run both names through a processor that maps the old to the new (or vice versa) for a defined window so existing dashboards keep working while consumers migrate.
Finally, treat span attribute cardinality with the same paranoia you treat database index design. Putting a request ID or a user ID into a span name is the telemetry equivalent of a SELECT * against an unindexed table — it doesn't break anything immediately, and then one day it does, expensively. The OTel naming conventions are explicit about this for a reason.
The Real Lesson Is About Observability of Observability
The deeper move is to stop thinking of telemetry as a write-only stream and start thinking of it as a system with its own users, its own SLAs, and its own change-management discipline. The cost dashboard, the eval pipeline, the SLO alert — these are applications built on top of your span data, and the people who depend on them are paying customers of the schema, even if no money changes hands.
The teams that get this right do one habit consistently: they log every change to the telemetry contract, and they review changes to it the same way they review changes to a public API. The teams that get it wrong keep getting paged at 3 a.m. for cost spikes that aren't cost spikes, and slowly losing trust in their own dashboards.
If your protobufs deserve a schema registry, your spans do too. The bill might not move, but the dashboard will — and that is a worse problem, because by the time you notice, you've been making decisions on bad numbers for weeks.
- https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/
- https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/
- https://opentelemetry.io/blog/2025/otel-weaver/
- https://github.com/open-telemetry/weaver
- https://opentelemetry.io/docs/specs/otel/schemas/
- https://opentelemetry.io/docs/specs/semconv/general/naming/
- https://opentelemetry.io/blog/2025/how-to-name-your-span-attributes/
- https://last9.io/blog/otel-naming-best-practices/
- https://last9.io/blog/otel-weaver-consistent-observability-with-semantic-conventions/
- https://buf.build/blog/why-a-protobuf-schema-registry
- https://docs.confluent.io/platform/current/schema-registry/fundamentals/serdes-develop/serdes-protobuf.html
- https://www.dash0.com/blog/observability-cost-out-of-control
