Agent Identifiability: When Your Trace Can't Tell You Which Agent Did What
A user reports the assistant gave them a wrong answer at 9:47 a.m. You open the trace. There are three hundred and forty spans. They are almost all named agent.run, llm.invoke, or tool.call. Some have a parent. Some are siblings. Three of them retried. One of them retried and then was cancelled. None of them tells you whether the bad output came from the planner, the worker, the critic, the reflection pass, or the second retry of the worker after the critic flagged it.
You spend the next hour grepping log lines for a UUID prefix you saw in a screenshot, cross-referencing timestamps against a Slack notification, and reconstructing the agent topology in your head from the indentation pattern in the trace viewer. Eventually you guess that the third worker invocation ran with a model alias that silently flipped to a different snapshot the night before. You cannot prove it from the trace alone.
The agent worked. The trace is intact. The hairball is the bug.
The Default Span Topology Is A Lie By Omission
When you wrap an agent loop with conventional APM instrumentation, you get spans. You do not get identity. The OpenTelemetry GenAI semantic conventions define gen_ai.operation.name values like invoke_agent and create_agent, with span names of the form invoke_agent {gen_ai.agent.name} — but the conventions are still experimental as of this year, and most framework integrations either ignore them or set the agent name to a generic label like the framework's class name. The result is that a trace from a CrewAI, LangGraph, or Microsoft Agent Framework run looks structurally similar regardless of how complex the underlying topology is. Every internal call shows up as an indistinguishable agent.run.
Conventional APMs were built for stateless request-response services. The mental model assumed: a request enters at the edge, fans out to a fixed set of named downstream services, returns. Service names are stable. The topology is part of your architecture diagram. The span hierarchy is the call graph because the call graph is statically known.
Agents break that assumption in three ways simultaneously. First, the topology is dynamic — a planner decides at runtime whether to spawn a single worker, a fan-out of three parallel workers, a critic, a tool call, or a delegation to a sub-team. Second, the same code path runs as multiple logical roles depending on context — the same agent.run function call is a planner on one invocation and a worker on the next. Third, retries and reflection passes mean the same logical agent can appear N times in the same trace, and you have no way to tell whether you are looking at the same agent's second attempt or two different agents that happened to receive similar inputs. The trace structure faithfully records what the runtime did. It does not record what the runtime meant.
The on-call experience that follows is predictable. The team without an identity model spends an hour reconstructing the call graph from log timestamps, function-arg dumps, and prompt fragments visible in span events. The team with one runs a single query: agent.role = "critic" AND agent.spawn_reason = "self_consistency_check" AND outcome = "rejected" and sees the answer in thirty seconds.
The Five-Field Identity Model
The fix is not a new tool. It is a discipline that says every span produced by an agent must carry five attributes that conventional APMs do not generate by default.
Role. A stable string drawn from a closed enum your team controls — planner, worker, critic, router, reflector, tool_executor, summarizer. Not the framework's class name, not the model name, not a free-form description. The constraint matters: a closed enum makes the attribute groupable and chartable, while free-form descriptions become unsearchable noise. When the topology evolves and a new role appears, you add it to the enum deliberately.
Parent agent and spawn reason. Two fields, not one. parent_agent_id reconstructs the call graph as a directed acyclic graph rather than the flat list your trace viewer renders. spawn_reason captures why this child was created: initial_decomposition, self_consistency_check, tool_failure_retry, escalation_threshold_exceeded. The combination is what lets you query for "every critic invocation that was spawned because a worker timed out," which is a question your trace cannot answer without it.
Agent instance ID. A deterministic identifier scoped to one logical agent for the lifetime of one task. If the worker retries after a critic rejection, both attempts get the same instance ID. If the orchestrator spawns three parallel workers, each gets its own. This is the field that makes the trace queryable across retries — without it, you can ask "show me this span" but not "show me everything the same logical worker did across all its attempts."
Model and prompt version pair. gen_ai.request.model and a prompt artifact identifier — content hash, registry version, or git SHA of the prompt source file. Both fields, not just one. If your prompt registry pushes a hot-reload at 09:31 and a regression begins at 09:33, the prompt version on the bad span is the evidence that points the postmortem at the prompt and not the model. If a model provider promotes a snapshot under the alias claude-x-latest overnight, the request-model field on the bad span is the evidence that points the postmortem at the provider and not the prompt. Without both, every regression is ambiguous.
Outcome. Not status code — outcome. succeeded, tool_error, validation_failed, critic_rejected, timeout, cancelled_by_parent. APM defaults give you HTTP status and exception type, neither of which captures the agent-loop semantics that determine whether a span was useful work or wasted compute. The team that adds outcome surfaces a class of regressions that error-rate alarms never see — the run completed without an exception, but the critic rejected every worker output and the loop ran out of budget.
These five fields are not a complete schema. They are the floor. Add domain-specific fields above them — the document corpus a retrieval span hit, the sub-agent team a delegation went to, the user-tier the trace originated from. But if any of the five floor fields is missing, the trace is a hairball and no amount of higher-level attributes saves you.
Why OpenTelemetry's GenAI Conventions Don't Solve It Alone
The conventions are real. gen_ai.agent.id and gen_ai.agent.name exist in the registry. gen_ai.operation.name distinguishes create_agent from invoke_agent. Major vendors — Datadog, Grafana, Langfuse, Arize — have begun supporting the conventions natively. This is genuine progress, and instrumenting against the conventions is the right floor for any team starting fresh.
But the conventions stop short of the identity model above in three meaningful ways. First, gen_ai.agent.id is a vendor identifier — typically the assistant ID returned by an OpenAI Assistants API call or its analogue — not a per-task instance ID that lets you correlate retries. Second, there is no standard attribute for agent.role at the application level — the conventions describe the operation, not the role the agent is playing in your system's logical architecture. Third, the conventions do not specify a parent-agent or spawn-reason field, because the OpenTelemetry parent-span relationship is the closest analogue and OTel is reluctant to duplicate it.
The parent-span relationship is not the same thing as parent-agent. In a typical agent framework, an orchestrator spawns sub-agents that run as child spans of the orchestrator's span — but it also spawns retries that run as siblings, calls tools that run as children of the worker rather than the orchestrator, and delegates to sub-teams whose internal spans are deeply nested. The OpenTelemetry parent-span graph is a faithful record of which function called which other function. The parent-agent graph you actually want is a logical graph of which agent decided to invoke which other agent and why. They overlap but are not the same. You need both, with explicit attributes for the second.
The pragmatic position is: instrument against the OpenTelemetry GenAI conventions where they apply, and add the application-level identity attributes above them as custom fields prefixed with your own namespace. When the conventions catch up — and they will — your custom fields can be migrated. The cost of running ahead of the spec is small; the cost of waiting for the spec is the next on-call shift.
What Has To Be True The Day Before The Incident
The reason most teams discover the gap during an incident is that the identity model is invisible during normal operation. The trace looks fine when nothing is wrong. Every span has a name. The hierarchy renders. You can step through it.
The day everything works is the day to verify the identity model holds, because the day it breaks you will not have time to add fields. Three checks belong in the routine:
The first check is a topology audit. Take the last hundred production traces. Group spans by agent.role and confirm the distribution matches the architectural intent — if you have a critic in the design but agent.role = "critic" shows up on zero spans, your instrumentation is silently mislabeling. If a role you do not have shows up, someone added a code path that was not in the design. Either way, the audit catches drift before the incident does.
The second check is a retry resolvability check. Find a trace with retries and confirm that querying by agent.instance_id returns all attempts as a single logical group. If retries appear under different instance IDs, your ID generation is non-deterministic and you cannot correlate across attempts. This is the failure mode that converts a five-minute investigation into a five-hour one because every retry looks like an independent agent that was never explicitly invoked.
The third check is a regression-traceability rehearsal. Pick a trace that ran with a known prompt version and a known model snapshot. Run a synthetic incident — change the prompt, change the model, change the routing logic — and confirm that the resulting trace makes the change attributable from the spans alone, without consulting the deployment log. If the trace does not name the version, the on-call engineer will not be able to either, and the postmortem will say "we are not sure which change caused it."
All three checks fit in a weekly observability review. Skipping them does not feel costly until the incident, at which point the cost is the entire investigation window.
The Cultural Half Of The Problem
Identity attributes do not appear by accident. They are added by engineers who have been bitten by the absence and committed to never being bitten the same way twice. That commitment is what is hard to maintain.
The default trajectory of agent instrumentation is decay. The first version has rich attributes because someone just debugged a hairball trace. The second version drops two of them because the framework upgrade changed the wrapper signature and nobody noticed. The third version adds a new agent role to the codebase but does not add it to the role enum, so half the new spans show up as unknown. By the sixth version, the trace looks structurally identical to the first version but conveys 60% as much information.
The defense is treating the identity model as a contract, not a convenience. The role enum is a versioned artifact in the codebase, not a string literal sprinkled across the call sites. The instance-ID generator is a single function with a test. The model and prompt version attributes are produced by the same library that loads the prompt artifact, not by hand at every span boundary. The sixth-version drift happens when these primitives are scattered; it is mostly preventable when they are centralized.
The team that ships agents in production for a year without an incident is rare. The team that ships agents for a year and can answer "which agent did what, when, why, and with which version" thirty seconds after the incident is the one whose identity model survived contact with the codebase. The trace is not a side-effect of the runtime. It is a designed artifact, and the design is upstream of every postmortem the team will eventually write.
- https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/
- https://opentelemetry.io/docs/specs/semconv/gen-ai/
- https://opentelemetry.io/docs/specs/semconv/registry/attributes/gen-ai/
- https://opentelemetry.io/blog/2024/otel-generative-ai/
- https://www.langchain.com/articles/agent-observability
- https://www.langchain.com/articles/llm-observability-tools
- https://uptrace.dev/blog/opentelemetry-ai-systems
- https://learn.microsoft.com/en-us/agent-framework/agents/observability
- https://langfuse.com/blog/2024-07-ai-agent-observability-with-langfuse
- https://atlan.com/know/ai-agent-observability/
- https://www.getmaxim.ai/articles/5-essential-techniques-for-debugging-multi-agent-systems-effectively/
- https://fast.io/resources/ai-agent-distributed-tracing/
- https://www.datadoghq.com/blog/parent-child-vs-span-links-tracing/
- https://oneuptime.com/blog/post/2026-02-06-trace-ai-agent-execution-flows-opentelemetry/view
- https://www.digitalapplied.com/blog/agent-observability-platforms-langsmith-langfuse-arize-2026
- https://www.datadoghq.com/blog/llm-otel-semantic-convention/
