Halted Is Not a Status: Why Agents Need a Typed Terminal-Reason Protocol
Open the dashboard for an agent fleet and you will see a clean number: completion rate, 94%. Below it, a list of runs, each tagged with one of two states — running, or not running. The 6% that are "not running" all look identical. Some of them finished the task perfectly. Some of them hit a step limit two actions short of done. Some of them caught a tool error and gave up. Some of them decided the task was impossible — correctly. And some of them simply lost the thread and stopped emitting tokens.
Your monitoring cannot tell these apart. It knows the process is no longer running. It does not know why, and "why" is the only thing that matters when you are deciding whether to page someone.
This is the quiet structural gap in most agent deployments. We instrument the agent's path obsessively — every tool call, every token, every retrieval — and then we let the agent's exit fall off a cliff into a single untyped event. A run ends. The framework records that it ended. On-call gets a number that conflates a healthy completion with a polite surrender, and the only way to tell which is which is to open the trace and read it, one run at a time, after the fact.
A process that can end in six meaningfully different ways needs to say which way it ended. That sentence sounds obvious. Almost no agent framework enforces it.
The completion rate is hiding a distribution
Start with what "completion rate" actually measures. It is the ratio of runs that reached a terminal text output to all runs started. It is borrowed, mostly unexamined, from the request/response world, where a server either returns a 200 or it doesn't, and a 500 is unambiguously bad.
Agents broke that model. An agent run does not have one failure mode and one success mode. It has a spectrum, and the endpoints of that spectrum are both legitimate outcomes:
- It completed the task and produced a correct result.
- It exhausted its budget — hit the max-turns ceiling or a wall-clock timeout — mid-task.
- It was blocked on a tool that errored, timed out, or returned garbage, and could not route around it.
- It judged the task infeasible and stopped deliberately, which is often the right call.
- It escalated to a human, handing off rather than guessing.
- It abandoned the run — stopped making progress, looped, drifted, and went quiet without any of the above being explicitly true.
A completion rate of 94% tells you 6% landed somewhere in that bottom five. It does not tell you whether that 6% is mostly clean infeasibility judgments (a healthy system declining work it shouldn't do) or mostly abandonment (a system silently failing). Those two readings call for opposite responses. One means ship it; the other means stop the rollout. The metric that is supposed to drive that decision is structurally incapable of distinguishing them.
This is the same trap the multi-agent failure literature keeps surfacing. UC Berkeley's Sky Computing Lab built a taxonomy of agent failures — MAST — by hand-annotating roughly 150 traces and then expanding to over 1,600, and the headline finding is not that LLMs are dumb. It is that most failures are structural: missing termination criteria, no role constraints, breakdowns nobody designed against. Of their fourteen failure modes, several — step repetition, premature termination, ignoring other agents — are precisely the things a completion rate renders invisible. You cannot manage a failure class your top-line metric cannot see.
"Stop reason" already exists — at the wrong layer
Here is the frustrating part: the model providers already solved a version of this. Every chat-completions response carries a finish_reason, and it is a closed enum — stop, length, tool_calls, content_filter. Anthropic's API has its own: end_turn, max_tokens, tool_use, refusal, and a few more. The Claude Agent SDK's ResultMessage goes further, carrying a subtype that distinguishes success from error_max_turns from error_during_execution.
So the single LLM call knows exactly why it stopped. The problem is that this signal lives one or two layers below the thing you actually operate. A finish_reason: length on turn 14 of a 40-turn agent run tells you a single model call hit its token ceiling. It tells you nothing about whether the agent — the orchestration loop wrapping dozens of those calls, tool invocations, retries, and branches — succeeded, gave up, or escalated.
The terminal reason has to be lifted to the loop boundary. The agent process, the unit your SLO is written against, is the unit that has to emit a typed exit. And almost no orchestration layer does this. The framework returns control to the caller, the caller sees the loop is done, and the rich enum from the model layer has long since been flattened into "the run is over."
That flattening is the bug. The fix is not new infrastructure. It is refusing to discard a signal you already have.
A typed terminal-reason protocol
- https://opentelemetry.io/blog/2025/ai-agent-observability/
- https://arxiv.org/abs/2503.13657
- https://sky.cs.berkeley.edu/project/mast/
- https://blog.sentry.io/ai-agent-observability-developers-guide-to-agent-monitoring/
- https://www.langchain.com/articles/agent-observability
- https://futureagi.com/glossary/tool-timeout/
- https://code.claude.com/docs/en/agent-sdk/agent-loop
- https://sre.google/workbook/implementing-slos/
- https://galileo.ai/blog/why-multi-agent-systems-fail
