Skip to main content

The Finish Reason Your Code Never Inspects

· 10 min read
Tian Pan
Software Engineer

Your handler did everything right. The HTTP status was 200. The body parsed. The text field had characters in it. You incremented responses_succeeded, appended the message to the conversation, returned the JSON down to the client, and moved on. The user got a sentence that ended mid-clause, a redacted answer dressed up as a normal one, or a polite refusal phrased as a completion. Your dashboard does not know any of that happened. The provider told you. You did not read the field.

Every major inference API returns a stop signal alongside the text: OpenAI calls it finish_reason, Anthropic calls it stop_reason, Gemini calls it finishReason. The field is small. It is one enum value per response. It is also the only out-of-band channel the model has for telling you whether the response you just shipped is the answer or a fragment of one. Treating it as cosmetic is the same shape of bug as ignoring HTTP status codes — except your monitoring caught the HTTP one a decade ago and has no opinion about this one.

The values, and what each of them is actually saying

The values are not interchangeable, and the differences between providers are smaller than the differences between values within a provider. Anthropic publishes end_turn, max_tokens, stop_sequence, tool_use, pause_turn, refusal, and model_context_window_exceeded. OpenAI publishes stop, length, tool_calls, content_filter, and the legacy function_call. Gemini publishes STOP, MAX_TOKENS, SAFETY, RECITATION, LANGUAGE, BLOCKLIST, PROHIBITED_CONTENT, SPII, and MALFORMED_FUNCTION_CALL. A handler that treats anything non-empty as success is treating ten distinct outcomes as one.

Three of them deserve special attention because they are the ones that masquerade most convincingly as success.

Length, or max_tokens, or MAX_TOKENS. The model was generating fine and got cut off because it ran out of token budget. The text in the response is real text. It is also incomplete. If the answer was a JSON object, it is malformed. If the answer was a code block, it is missing its closing brace. If the model was emitting a tool call, the tool call is truncated mid-arguments — Anthropic's docs spell this out: when stop_reason == "max_tokens" and the last content block is tool_use, retry with a higher max_tokens. If the answer was prose, it ends mid-clause and your user gets to wonder what the rest of the sentence was. The field is the model telling you this. The field is also the only way to know. A naive parser will fail loudly on the malformed JSON cases and silently on the prose case, which is the worst possible distribution: the failures that are easy to monitor become exceptions, and the failures that are hard to monitor become the user experience.

Content_filter, or SAFETY, or PROHIBITED_CONTENT, or BLOCKLIST, or refusal. The model produced something and the provider's safety layer (or the model itself) intercepted it. The response field is often empty, or contains a generic deflection, or contains a partial generation truncated at the policy boundary. Your success counter ticks because the request did not error. Your user gets a polite non-answer. The downstream system that was supposed to act on the answer gets a string that does not contain what it expected. Anthropic returns refusal and, on newer models, a stop_details.category of "cyber" or "bio" to tell you what tripped — a structured signal that should fan out into distinct routing logic, not collapse into "non-empty response, ship it." Anthropic's refusal is also the only place where the docs explicitly suggest swapping models as a remediation: Sonnet 4.5 and Opus 4.1 refuse more aggressively than Haiku 4.5, and you cannot tell that from the text field alone.

Tool_use, or tool_calls, or MALFORMED_FUNCTION_CALL. The model wanted to call a tool. Sometimes the tool exists in your registry and your orchestration loop dispatches it. Sometimes the model invented a tool name because the prompt suggested capabilities you did not wire up. Sometimes the model's JSON for the arguments was malformed and the provider's parser caught it before it reached you. The text portion of the response, in any of these cases, is either empty or a placeholder. A handler that returns text to the user when finish_reason == "tool_calls" is returning either nothing, a half-finished thought, or a hallucinated tool call serialized as prose. None of those are the answer. All of them increment success.

The remaining values matter less individually but compound the same way. stop_sequence means a custom token boundary was hit, which is only interesting if your prompt was relying on the boundary being a real semantic edge. pause_turn is Anthropic's signal that a server-tool loop hit its iteration cap, and the contract is that you continue the conversation by sending the response back unchanged — a continuation flow that has no analog in the rest of your error handling. RECITATION is Gemini's signal that the model started reproducing copyrighted text verbatim and was cut off mid-reproduction. model_context_window_exceeded is Anthropic's way of telling you the response fit the available context but the model would have kept going if it had more room. Each is a different remediation. Each maps to a different log line, a different counter, a different decision in your retry or routing logic. The collapse to "non-empty text → success" loses all of them.

The contract field hiding inside the SDK

The reason this bug is so common is that the SDK ergonomics push you toward it. The provider SDKs return a structured response. The first thing every tutorial does is reach into response.content[0].text or response.choices[0].message.content and treat that string as the answer. The stop reason sits one field over. It is in the same object. It is documented. It is also two extra characters of code to read and ten or twenty to dispatch on, and most production handlers were written by someone who was racing to ship the happy path and never came back. The path of least resistance is the bug, and the language design rewards taking it.

The pattern shows up across SDKs the same way. Anthropic's docs ship a handle_response function whose first three branches are tool_use, max_tokens, model_context_window_exceeded — that is the prescribed shape. OpenAI's community forum is full of threads from engineers who shipped a code path that broke when finish_reason == "length" showed up in production for the first time and the JSON parser started throwing on truncated objects. LiteLLM has open issues about normalizing finish_reason across providers because the values are similar but not identical and the cross-provider mapping cannot be defaulted. The Gemini CLI has an open issue titled "Silent termination when Gemini API output token limit is reached" — a bug where the tool returned the truncated text without surfacing that it was truncated, because the wrapper around the SDK did not check the field. The pattern repeats because the SDKs make the field optional to look at, and humans default to optional.

Treat finish_reason as a predicate on success, not metadata

The fix is structural, not cosmetic. The handler at the boundary of your application code should treat the stop signal the way it treats HTTP status: as a required predicate on the success of the operation, not as one more field in the response body to log if you feel like it. Concretely:

Make the stop reason a required input to every response handler. The function that turns a model response into "the answer" should not be allowed to ignore it. In typed languages, model the response as a discriminated union on the stop reason — the type system then forbids the lazy unwrap. In untyped languages, a thin wrapper with explicit branches gets you the same enforcement at runtime.

Map each value to a distinct error class your retry and routing logic can act on. Truncation by max_tokens should trigger a continuation request, not a retry; the input is fine, the budget was wrong. Truncation by model_context_window_exceeded means the input is too large and a retry with the same parameters will produce the same outcome — what you want is a summarization step or a longer-context model, not a redrive. A refusal should not be retried at all on the same model; the cheap remediation is to route the request to a model with different safety calibration, which Anthropic's docs explicitly suggest. A tool_use or tool_calls value means the orchestration loop owes the model a tool result, not a fresh request. A content_filter or SAFETY value means the prompt or the user input crossed a policy boundary and the right action is to surface that to the user, not silently degrade.

Graph each value as a first-class metric, not a tag on the success counter. The reason silent failures stay silent is that the dashboard aggregates "the request succeeded" and the field that says "but the answer is incomplete" is in a low-cardinality log nobody alerts on. Promote it. Track responses_truncated_by_max_tokens as its own time series. Track responses_refused as its own series, broken out by stop_details.category where the provider gives you one. Alert when any of them moves. A spike in responses_truncated after a prompt change is the kind of signal that reaches you in hours instead of weeks, but only if it has a graph of its own.

Make the value visible in your traces. When you log a request, log the stop reason as a top-level attribute, not buried in a payload blob. A trace that ends with stop_reason=refusal and a downstream system that processed the empty text field as if it were an answer is a story that should be readable end-to-end without a regex on the payload column.

Why this is the same bug as ignoring HTTP status codes

There is a useful frame here. Two decades ago, a generation of HTTP handlers were written that read the response body without checking the status code, because the body was non-empty and the SDK gave you the body. Production broke. The industry learned, the linters learned, and the SDKs started making it harder to do the wrong thing. We now think of "check the status code" as table stakes.

The stop reason on a model response is the same field. It is the provider's out-of-band signal that the apparent success of the response is conditional. The body parses. The HTTP status is 200. The semantics live in the metadata. A handler that reads response.content[0].text without reading response.stop_reason is making the same category of mistake as a 2005-era client that read response.body without reading response.status — the difference is only that the failure mode is harder to see, because the body is a string of plausible English instead of an HTML error page.

The signal is in the contract. The provider is telling you. The team that writes the field-aware handler today does not get any credit for it because nothing visibly broke. The team that does not write it gets to spend the next six months explaining to support why the success rate is green and the customer experience is degraded. The cheap discipline is to treat the stop reason as load-bearing from the first commit, before the dashboards were ever drawn — and to refuse to ship a handler that does not branch on it.

The reward for reading one extra field is that the bugs that were going to bury themselves in your conversion funnel become visible in the layer where they originate. The cost is two more lines of code per call site. The ratio is unambiguous. The field is already in the response.

References:Let's stay in touch and Follow me for more thoughts and updates