"Done!" Is Not a Return Code: Why Agent Completion Needs a Structured Signal
An agent ends its turn with "All done — let me know if you want any changes!" and your orchestrator has to decide whether to mark the ticket resolved, kick off the next handoff, or retry. That sentence is not a return code. It is a polite closing line trained to sound reassuring at the end of a chat, and every line of automation downstream of it inherits the ambiguity. The teams that treat this as a parsing problem write regexes that catch \b(done|complete|finished)\b and call it a day. The teams whose agents run in production eventually learn that completion is an event, not a mood.
The failure mode is bimodal and boring. Either the agent announces done when it isn't — premature termination — and the orchestrator happily advances the workflow on a half-finished artifact. Or the agent is actually done, but phrases it in a way that doesn't match the detector ("I went ahead and landed the change, though the test for the edge case is still flaky"), and the orchestrator spins up a retry that re-does the work, duplicates the side effect, and sometimes contradicts the successful first pass. Both modes degrade silently. Neither shows up in a dashboard until someone reads a trace and notices that the agent said "I think that covers it" and the billing system treated that as a commit.
The fix is not smarter parsing. It is giving the agent a structured way to terminate — a done-tool with an enumerated status, a reason code, and a handle your pipeline can route on — and changing the orchestrator to wait for that event instead of listening to the chat stream.
The Surface Area of "Done"
Prose-based completion detection scales with the vocabulary of the model, which is effectively unbounded. A partial catalog of phrases that practitioners have had to handle, just from one quarter of production logs: "all set," "should be good to go," "that should do it," "I've wrapped that up," "done, pending your review," "I'll leave it here for now," "let me know if you need more," "that's everything," "nothing else to add," "finished — with a small caveat." The first five are unambiguous wins. The last five are all ambiguous: pending your review is a handoff, for now is a pause, with a small caveat is a partial. Your regex catches them all the same.
The problem is not that any individual phrase is hard to classify — you could tune a classifier and get to 90% on a dataset. The problem is that the surface area grows every time the model rolls, because decode-time behavior drifts with checkpoints, and the phrases that used to be rare ("I'll go ahead and...") become common overnight. Each drift is a silent regression in your completion detector. You ship a prompt update, the agent picks up a new valediction, and three weeks later someone notices retries are up 15%.
Structured completion does not have this property. A tool call named task_complete with status: "success" has exactly one surface. The agent either calls it or it doesn't. There is no phrasing drift, no hedging in tool arguments the way there is in natural language, and the parsing is a schema check rather than a language model.
The research on why multi-agent systems fail backs this up: in the survey of 150+ traces that produced the MASFT taxonomy, premature termination — an agent declaring "done" before completing subtasks — accounts for roughly 6% of failures, and the paper traces it specifically to star-topology systems where termination conditions were never explicit. The topology made the bug latent. Structured completion makes it loud.
The Three Shapes of Prose-Done Failure
The first shape is the false positive — the orchestrator believes the agent finished when it didn't. This is the expensive one because downstream state advances. A common pattern: the agent catches an exception, writes a summary that includes the word "completed" ("completed initial attempt; write failed"), and the regex grabs it. The workflow moves on. The artifact is missing. The bug surfaces two steps later when a consumer reads an empty row and fails, but the trace points at the consumer, not at the agent that never wrote.
The second shape is the false negative — the agent actually finished, but the orchestrator retries. This is the duplicate-side-effect case. Retries in agent workflows are rarely idempotent because the agent's first-pass reasoning populates its context and changes the plan the second time around. The retry might re-send an email, re-open a PR, or re-run a SQL migration that was already applied. You end up with two tickets in Jira that claim to resolve the same issue, and neither of them is the canonical record.
The third shape is the silent quit — the agent's turn ends without any completion phrasing at all. The final assistant message is a reasoning trace, a partial plan, or a tool error that the agent didn't know how to recover from. The orchestrator has no regex hit for done, no regex hit for failure, and often just logs "agent turn ended" and waits. In production this becomes the longest-tail incident class: hours of undetected downtime where the pipeline is nominally healthy but nothing is moving. One public post-mortem from a team running internal ops agents reported six hours of silent failure before a human noticed — the agent had hit a rate limit, summarized the error, and ended its turn, and the orchestrator had no signal to retry on.
All three shapes have the same root cause: the contract between agent and orchestrator is a natural-language convention, not a typed interface. The agent is free to express "done" in any way it wants, and the orchestrator is forced to reverse-engineer intent from a lossy summary.
The Structured Completion Contract
A workable contract has three fields, and they all matter. Removing any one of them collapses the contract back toward prose:
status— an enum. Typical values:success,partial,failed,needs_input,aborted. Not free text. The orchestrator's state machine has exactly one transition per status value, and ambiguity is a schema violation, not a judgment call.reason_code— a short machine-readable code scoped to the status.failed: validation_failed,failed: tool_error,needs_input: ambiguous_requirement,partial: budget_exhausted. The orchestrator uses this to route:tool_errorretries with backoff,validation_failedescalates to a review queue,ambiguous_requirementsurfaces a clarification in the UI,budget_exhaustedcheckpoints and schedules a resume.resumable_handle— an opaque token, usually a pointer to checkpointed state (ornullif the task is terminal). The retry path does not re-send the whole conversation; it resumes from the handle. Without this, partial completions force full restarts and the cost of a retry equals the cost of a fresh run.
A complete signal looks like task_complete({status: "partial", reason_code: "budget_exhausted", resumable_handle: "ckpt_a7f2", summary: "Reviewed 8 of 12 files; remaining 4 queued."}). The summary is still prose — the agent is still a language model, and humans will read the artifact — but the fields the orchestrator acts on are all typed.
The framework providers have been converging on this shape. OpenAI's Agents SDK terminates its run loop when the agent produces output matching a declared output_type and makes no further tool calls, which is a structural check rather than a string match. Microsoft's Agent Framework implements handoff as a dedicated tool call rather than a prose cue; the orchestrator reacts to the tool, not to a paragraph. LangGraph models termination as a graph transition to an explicit end node. In every case the move is the same: replace "the agent said it's done" with "the agent fired a typed event."
Migrating a Chat-Shaped Agent
Most agents in production today were prototyped as chat. The system prompt tells them to help the user, the turn ends when they stop producing tokens, and some script downstream reads the last message. Moving to structured completion is mostly mechanical, but the migration has two sharp edges.
The first is that the agent has to actually call the tool. Adding a task_complete tool to the toolset is not enough — models trained on RLHF will default to the trained valediction ("I've done X, let me know if you need anything else") unless the system prompt makes the tool mandatory and the behavior is reinforced with examples. A common failure in the first week of migration is that half the runs call the tool correctly and the other half end with prose because the model fell back to chat habits. The fix is a parse-fail path that treats "turn ended without a completion tool call" as its own failure mode — usually a retry with a stronger system-prompt nudge — rather than pretending the turn was successful.
The second is that the orchestrator has to stop reading the assistant stream for intent. Once the contract is typed, any logic that peeks at message content to decide what to do next is a regression. It is tempting to keep the old regex "just as a backstop" — don't. The backstop will mask failures in the real signal and you will never find out the agent has stopped calling the tool until the dashboard lies to you for a week. If task_complete is missing, the orchestrator should treat that as an incident, not smooth it over.
On the observability side, structured completion turns into a real event stream. Instead of "turn 47 ended with a 180-character message containing the word 'done'," you get "task_complete with status=partial, reason_code=budget_exhausted, resumable_handle=ckpt_a7f2, trace_id=..." Correlation IDs flow naturally. SLOs become definable: p95(time_to_task_complete), rate(status=failed), rate(missing_completion_event). You cannot build any of these from a prose transcript.
What This Costs and What It Buys
The cost of structured completion is real but small: one tool schema, a handful of enum values, a system-prompt change, and a bit of orchestrator code to dispatch on status. The first week of runs will surface a class of latent failures that were previously silent — turns that ended without a completion event, reason codes that don't route cleanly, partial completions that the team hadn't modeled. That is not regression; that was your pre-existing bug surface becoming visible for the first time.
The buy is a sharper boundary between the agent and everything around it. An agent with structured completion is a black box with a typed interface, which means you can swap models, adjust prompts, and upgrade frameworks without the orchestrator having to re-learn the agent's idiom. Observability stops being transcript archaeology. Retries become bounded and idempotent because the resumable handle replaces conversation replay. Silent failure — the failure mode that outlasts every other category in production incidents — becomes loud by default, because a missing completion event is a schema violation and your pipeline knows how to alert on schema violations.
"Done" is an intent. Your pipeline needs an event. The cheapest way to stop confusing the two is to stop letting the agent express completion in language and start making it emit a signal the system was built to read.
- https://openai.github.io/openai-agents-python/running_agents/
- https://learn.microsoft.com/en-us/agent-framework/workflows/checkpoints
- https://arxiv.org/pdf/2503.13657
- https://orq.ai/blog/why-do-multi-agent-llm-systems-fail
- https://galileo.ai/blog/agent-failure-modes-guide
- https://temporal.io/blog/of-course-you-can-build-dynamic-ai-agents-with-temporal
- https://dev.to/bobrenze/ai-agent-silent-failures-what-6-hours-of-undetected-downtime-taught-me-about-monitoring-3ja8
- https://medium.com/@deolesopan/data-contracts-for-agents-keep-tools-and-schemas-stable-as-systems-evolve-8af6f3e024ba
- https://learn.microsoft.com/en-us/agent-framework/user-guide/workflows/orchestrations/handoff
- https://www.chanl.ai/blog/memory-silent-failure-mode
