Designing for Partial Completion: When Your Agent Gets 70% Done and Stops
Every production agent system eventually ships a failure nobody anticipated: the agent that books the flight, fails to find a hotel, and leaves a user with half a confirmed itinerary and no clear way to finish. Not a crash. Not a refusal. Just a stopped agent with real-world side effects and no plan for what comes next.
The standard mental model for agent failure is binary — succeed or abort. Retry logic, exponential backoff, fallback prompts — all of these assume a clean boundary between "task running" and "task done." But real agents fail somewhere in the middle, and when they do, the absence of partial-completion design becomes the bug. You didn't need a smarter model. You needed a task state machine.
The Four Failure Modes That Live Between Start and Done
Before designing for partial completion, it helps to categorize exactly where mid-task interruptions come from. There are four distinct causes, and each requires a different design response.
Duration overflow is the most measurable. Current frontier models complete tasks under four minutes with near-perfect reliability; success rates collapse to under 10% for tasks exceeding four hours. This isn't a fluke — it's a predictable degradation curve. Agents lose coherence over long sequences due to context drift, accumulated uncertainty, and compounding tool errors. Any workflow that could plausibly take hours needs to be designed as interruptible from the start.
Authorization gaps discovered mid-execution are more insidious. An agent begins a task with a reasonable permission scope, then encounters a step that requires credentials it doesn't have — a write operation to a system it only has read access to, a financial transaction above an approval threshold, an API endpoint behind an extra auth layer. Static permission grants at task initialization can't anticipate the full execution tree. The agent stalls, often silently.
Context exhaustion causes agents to degrade before they hit an outright limit. Every frontier model degrades as input length grows — often starting as early as 30–50% of the stated context window. Agents don't crash when they approach the limit; they start making poorer decisions, forgetting earlier context, and eventually producing incoherent outputs. Sixty-five percent of enterprise AI production failures have been attributed to context drift rather than hitting hard limits.
Uncertainty and confidence collapse happens when the agent reaches a decision point where the right action is genuinely unclear and the cost of a wrong guess is high. Without explicit handling for these cases, agents either guess (producing errors that are hard to trace) or halt with opaque error messages that give operators nothing to work with.
The Checkpoint-First Design Principle
The prerequisite for handling any of these gracefully is state externalization. An agent that holds all task state in its context window has no recovery surface. The moment it fails, everything it knew about what it had done disappears.
Checkpoint architecture treats each step of a task as a discrete, persistable event. Before executing a step, write the intent. After completing it, write the result. The persisted state forms the basis for resumption, rollback, and audit — independently of whether the agent's next run remembers anything.
The implementation choice matters. For small checkpoints (under 350 KB), writing serialized state directly to DynamoDB with a thread ID and timestamp is sufficient. For larger checkpoints that include tool outputs, documents, or intermediate artifacts, the pattern shifts: write the payload to S3 and store a reference pointer in DynamoDB. The checkpointer is then just a pointer store, and the actual state lives in object storage.
The critical distinction is between workflow state and session state. Workflow state is durable — what the task has accomplished, what artifacts it has produced, where it is in the execution sequence. Session state is ephemeral — the current conversation context, in-flight tool calls, the agent's working memory. Many teams conflate these and end up rebuilding workflow progress on every server restart, because they stored everything in the session.
Workflow state should survive process restarts, agent model upgrades, and handoffs between agents. Session state by definition doesn't need to. Building checkpointing into task design means choosing what belongs in each category before the first line of agent code gets written.
Three Patterns for Human-in-the-Loop Resumption
Once you have persisted task state, the question becomes how humans interact with partially completed work. Three patterns emerge from production systems, and choosing between them depends on task latency tolerance, error cost, and how often intervention is actually needed.
Synchronous approval gates pause execution at predefined points and require human confirmation before proceeding. The agent executes up to the gate, serializes its state, and surfaces a review UI with full context: what it has done, what it proposes to do next, and what happens if the user declines. This is the right pattern when the next step is irreversible or expensive — submitting a purchase order, sending an external communication, executing a database write. The cost is latency; human approval cycles add minutes to hours to task execution time. For workflows where the downstream cost of a wrong step exceeds the cost of waiting, that's the correct trade-off.
Asynchronous audit queues let agents run with full autonomy while logging every decision with enough context to allow after-the-fact review. The agent doesn't block waiting for human input, but a queue of decisions accumulates that operators can review, flag, or use to tune the system. This pattern is appropriate when task latency is tightly constrained and individual errors are recoverable — a customer service agent triaging tickets, a content generation pipeline producing drafts. The risk is that by the time a bad decision surfaces in the queue, the effect has already propagated.
Multi-tier oversight assigns different oversight levels to different categories of action. Low-risk, well-bounded steps run autonomously. Steps that exceed a cost threshold, touch sensitive systems, or reach a confidence floor get escalated synchronously. The agent itself makes the escalation decision based on a policy it receives at initialization. This is the pattern most production deployments converge on — it distributes human attention toward the decisions that most need it.
In all three cases, the underlying requirement is the same: the agent must be able to articulate its state clearly enough for a human to make a meaningful decision. An interruption that presents "task failed at step 7" is not an approval gate — it's an error. A real interruption surfaces: what has been completed, what intermediate artifacts exist, what the next proposed step is, and what the decision branches look like if the human chooses differently.
- https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
- https://arxiv.org/html/2512.12791v1
- https://www.permit.io/blog/human-in-the-loop-for-ai-agents-best-practices-frameworks-use-cases-and-demo
- https://fast.io/resources/ai-agent-state-checkpointing/
- https://aws.amazon.com/blogs/database/build-durable-ai-agents-with-langgraph-and-dynamodb/
- https://www.osohq.com/learn/best-practices-of-authorizing-ai-agents
- https://www.mindstudio.ai/blog/workflow-state-vs-session-state-ai-agents
- https://medium.com/data-science-collective/langgraph-vs-temporal-for-ai-agents-durable-execution-architecture-beyond-for-loops-a1f640d35f02
- https://fast.io/resources/ai-agent-error-handling/
- https://www.getmaxim.ai/articles/context-window-management-strategies-for-long-context-ai-agents-and-chatbots/
