Skip to main content

Your stop_reason Is Lying: Building the Real Stop Taxonomy Production Triage Needs

· 12 min read
Tian Pan
Software Engineer

The on-call engineer pulls up a trace. The model returned, the span closed clean, the API call shows stop_reason: end_turn. By every signal the platform offers, this was a successful generation. Three minutes later a customer reports that the agent confidently wrote half a config file, declared the operation complete, and moved on. The trace had no warning sign because the warning sign isn't in the API contract — the provider's stop reason has four to seven buckets, and the question your incident demands an answer to lives in the gap between them.

Stop reasons are the field engineers reach for first during triage and the field that lies most cleanly when it does. The values are designed for a runtime that needs to decide what to do next: was this turn complete, did a tool get requested, did a budget get exceeded, did safety intervene. They are not designed for a human reconstructing why an answer went wrong, and the difference between those two purposes is where production teams burn entire afternoons.

The fix is not to abandon the provider's value. It is to recognize that stop_reason is the provider's abstraction over their decoder, and your product needs a second abstraction over your output: a parallel stop-taxonomy derived from final-state inspection, surfaced as its own span attribute next to (not replacing) the API value, and alerted on as a first-class signal. Without that second taxonomy, you are flying blind during the exact incidents where flying blind is most expensive.

What the API Tells You and What It Does Not

The Anthropic Messages API can return one of seven values today: end_turn, max_tokens, stop_sequence, tool_use, pause_turn, refusal, or model_context_window_exceeded. OpenAI's finish_reason exposes a smaller set: stop, length, tool_calls, content_filter, and the deprecated function_call. Other providers have their own variants. Across all of them, the surface area is small because the values are categorical labels for the decoder loop's exit condition, not a diagnosis of what happened to the answer.

That mismatch shows up the first time you triage a real incident.

end_turn is the most loaded value of the lot. It claims the model believed it was done. In practice it groups together at least four very different states: a fully-formed answer that satisfies the user, a "Sure, I can help with that" preamble where the model thought a follow-up turn was coming, an empty 2-3 token response that occurs when the message structure trains the model to expect a user reply after every tool result, and a soft refusal that hedged its way to a stopping point without ever producing a refusal flag. Anthropic's own docs call out the empty-response failure mode explicitly. Your triage process needs to distinguish all four.

max_tokens is the value most likely to be silently wrong. The model wrote until it ran out of room, and the fact that the cutoff happened mid-sentence, mid-JSON-object, or mid-tool-argument is exactly the failure that downstream code rarely checks. There is a documented agent-loop incident where a truncated write_file tool call wrote a partial file with no error, the agent assumed success, and twenty hours of subagent work piled on top of a corrupted artifact before anyone noticed. The vendor's stop_reason accurately reported max_tokens. The agent loop never branched on it. The bug lived in the gap.

tool_use looks unambiguous until you read the bug tracker. There are documented cases where the model emits tool_calls content but finish_reason comes back as stop, where production code branched on finish_reason instead of inspecting the content blocks and silently dropped the tool invocation. Treating stop_reason as the source of truth for "did the model call a tool" makes you wrong in exactly the cases where being right matters most.

pause_turn is recent enough that most agent loops in the field don't handle it at all. It fires when server-side tools (web search, web fetch) hit the iteration limit on Anthropic's side, returning a partially-completed turn that needs to be replayed back into a continuation request. If your loop only branches on tool_use and end_turn, a pause_turn becomes a dropped conversation.

refusal appears clean — the model declined, you handle it. Except classifier-driven refusals only fire on streaming responses on newer Claude models, and "soft" refusals (the model produces a hedged non-answer that no classifier flagged) come back as end_turn. Your refusal rate as measured by the API field is not your refusal rate as experienced by users.

The pattern is the same across every value: the API gives you a categorical label that helps the runtime decide what to do; it does not give you the eight-bit diagnostic that triage needs.

The Parallel Stop-Taxonomy

The discipline is to compute, on every response, a second classification that you derive yourself from final-state inspection. The provider's value tells you what their decoder thought. Your value tells you what the output looks like. Surface both. Diff them when they disagree.

A working starter taxonomy has five signals worth computing on every completion:

Trailing-completeness signature. Inspect the last 32 characters of the output. Does it end on a complete sentence (period, question mark, closing quote, closing brace)? Or does it end mid-word, mid-bracket, mid-quote? A response that ends mid-token is almost certainly a truncation regardless of what stop_reason says. This catches cases where max_tokens is reported correctly but downstream code didn't branch on it, and cases where the runtime claims end_turn but the output ended in a comma.

Structured-output validity. If the prompt asked for JSON, parse it. If it asked for a tool call, validate the arguments against the tool's schema. The "JSON validity rate stratified by stop reason" is one of the highest-leverage observability metrics you can ship, because the failure mode "model hit max_tokens on the closing brace and wrote 3.5 KB of garbage that fails JSON parsing" is invisible until you check.

Refusal-phrase fingerprint. Maintain a small dictionary of soft-refusal openers — "I'm not able to", "I can't help with", "As a language model", "I would recommend speaking to a professional", and the model-specific hedging language your traffic actually surfaces. Compute the score on every response. The score is noisy on its own, but the delta in soft-refusal rate after a system-prompt change or model upgrade is the alarm you want. The Refusal Tokens line of research argues for explicit refusal calibration during training; until that ships universally, your fingerprint is the production substitute.

Length-vs-distribution z-score. Maintain a per-route distribution of output lengths. Flag responses that come in below the 5th percentile, especially when they pair with end_turn. The "answered in 3 tokens" empty-response failure mode reads as a clean turn from the API's perspective and as a screaming red flag from the distribution's perspective.

Last-token entropy proxy. You usually don't have logprobs in production, but you have proxies. A response that ends with a high-frequency continuation word (and, but, because) is more likely to have been cut off than a response ending with Done. or Let me know if you have questions. Track the bigram-frequency of the final token; a moving average that spikes is your early-warning that something widened the model's reasoning past the budget.

None of these is sufficient alone. Combined, they give you the eight bits of state the four-bit API field was always going to compress away.

Wiring It Through the Span

The implementation discipline is what determines whether this taxonomy is actually useful at 2am. Three rules.

Surface enriched stop-reason as a separate span attribute. The OpenTelemetry GenAI semantic conventions reserve a finish_reason field on the output message structure for the provider's value. Add your own attribute alongside it — call it whatever you want, but make it a peer, not a replacement. When triage queries the trace, both fields are visible at once and the diff between them is the first thing the eye sees.

Never overwrite the provider's value. It is tempting to "normalize" everything into your own taxonomy. Don't. The provider's value is a contract surface that tools, dashboards, and SDK behaviors depend on. Keep it pristine. The point of the parallel taxonomy is that the diff between provider and derived is itself a signal.

Compute the taxonomy at the span boundary, not in the agent loop. The agent loop needs to make a decision in microseconds; trailing-completeness checks and refusal-phrase scoring can spend a few milliseconds on the observability path without blocking the hot path. This also means the taxonomy is computed exactly once, in one place, with one definition — instead of drifting into three different implementations across three teams.

A useful side effect: once the enriched stop-reason exists as a span attribute, you can build dashboards on it. Refusal rate by route. Truncation rate by model version. Empty-response rate after each system-prompt deploy. These are the dashboards that catch incidents before users open tickets, and you can't build them on the four-value field.

The Slow-Burn Failure Mode Worth Alerting On

The single highest-value alert this taxonomy unlocks is on the slow-burn max_tokens regression after a system-prompt change.

The pattern is consistent enough to be a checklist item. Someone widens the system prompt: adds a new instruction, expands a few-shot block, attaches a longer tool description. Quality goes up on the eval set because the model now has more context. Nobody touched max_tokens. A week later, the bottom 5% of responses by length are coming back truncated mid-thought, the JSON-validity rate has slipped two points on one route, and customer support starts seeing "the bot just stopped" tickets. The vendor's stop_reason: max_tokens is firing more often, but it's still a single-digit percentage of total traffic, so it doesn't trip the obvious alerts. Meanwhile your enriched stop-reason shows a 3x increase in the combination of max_tokens plus invalid-JSON, and that combination is what the alert should fire on.

Other combinations worth alerting on:

  • end_turn + below-p5-length + soft-refusal-fingerprint-positive: the soft-refusal regression.
  • end_turn + content blocks contain tool_use: the SDK-bug failure mode where tool calls get reported as plain stops.
  • tool_use + tool-arg JSON invalid: the model invoked a tool with malformed arguments, and the runtime dispatched anyway.
  • end_turn + 0-3 output tokens after a tool result: the empty-response anti-pattern documented in Anthropic's own guidance.
  • pause_turn rate per agent route: the canary for an agent loop that doesn't know how to continue.

Every one of these is invisible if you alert on the API's stop_reason directly. Every one is trivially visible once the parallel taxonomy is wired up.

Why This Is an Architecture Problem, Not an Instrumentation Problem

The temptation is to treat this as a logging upgrade — add a few fields, ship a dashboard, move on. That misses the deeper point. The provider's stop_reason is their abstraction over their decoder loop. It is correct for their purposes (telling the SDK what to do next), and it will continue to be the wrong shape for your purposes (telling an engineer why an answer went wrong) no matter how many values they add. Every new model release will introduce a new value, and every new value will compress away the diagnostic you actually wanted.

The architectural realization is that you have to own the second abstraction. Your product surface is not the decoder; it is the answer your user receives, and the question "did the answer succeed" is a product question that no decoder-level field can answer. The parallel taxonomy is the type system you build over your own outputs, and the moment you commit to maintaining it you stop being held hostage to vendor field choices.

It also reframes how you read provider release notes. When Anthropic adds pause_turn or model_context_window_exceeded, those are not new bugs to handle — they are new information, granular signals the provider is now willing to share that you can integrate into your taxonomy. When OpenAI clarifies the tool_calls-vs-stop ambiguity, that is one inconsistency you can stop computing your way around. The vendor field gets richer over time; your derived field gets simpler. That is the right direction of travel, and it only happens if you own the derived field in the first place.

Where to Start Tomorrow

If you have nothing today, ship the trailing-completeness check first. It is twenty lines of code, it catches the highest-impact failure mode (silent mid-sentence truncation), and it gives the team something concrete to point at when arguing for the rest of the investment. Add the JSON-validity check the day after, because the cost of shipping invalid structured output to a downstream system tends to dominate every other category of failure once tool use shows up.

After that, the order is determined by your traffic. Heavy chat product? The soft-refusal fingerprint pays back fastest. Heavy agent product? The pause_turn and tool-arg-validity checks. Heavy structured-extraction product? The length-distribution z-score. The taxonomy is composable; ship signals as your incidents demand them, and keep them all under one span attribute so the dashboard story stays coherent.

The provider's stop_reason is a four-bit field, and the production question it gets asked is an eight-bit question. You can either wait for the vendor to grow the field — and discover during your next 2am incident that they didn't grow it where you needed — or you can build the second abstraction now. The teams that sleep through their incidents have already built it.

References:Let's stay in touch and Follow me for more thoughts and updates