Skip to main content

Designing for Partial Completion: When Your Agent Gets 70% Done and Stops

· 10 min read
Tian Pan
Software Engineer

Every production agent system eventually ships a failure nobody anticipated: the agent that books the flight, fails to find a hotel, and leaves a user with half a confirmed itinerary and no clear way to finish. Not a crash. Not a refusal. Just a stopped agent with real-world side effects and no plan for what comes next.

The standard mental model for agent failure is binary — succeed or abort. Retry logic, exponential backoff, fallback prompts — all of these assume a clean boundary between "task running" and "task done." But real agents fail somewhere in the middle, and when they do, the absence of partial-completion design becomes the bug. You didn't need a smarter model. You needed a task state machine.

The Four Failure Modes That Live Between Start and Done

Before designing for partial completion, it helps to categorize exactly where mid-task interruptions come from. There are four distinct causes, and each requires a different design response.

Duration overflow is the most measurable. Current frontier models complete tasks under four minutes with near-perfect reliability; success rates collapse to under 10% for tasks exceeding four hours. This isn't a fluke — it's a predictable degradation curve. Agents lose coherence over long sequences due to context drift, accumulated uncertainty, and compounding tool errors. Any workflow that could plausibly take hours needs to be designed as interruptible from the start.

Authorization gaps discovered mid-execution are more insidious. An agent begins a task with a reasonable permission scope, then encounters a step that requires credentials it doesn't have — a write operation to a system it only has read access to, a financial transaction above an approval threshold, an API endpoint behind an extra auth layer. Static permission grants at task initialization can't anticipate the full execution tree. The agent stalls, often silently.

Context exhaustion causes agents to degrade before they hit an outright limit. Every frontier model degrades as input length grows — often starting as early as 30–50% of the stated context window. Agents don't crash when they approach the limit; they start making poorer decisions, forgetting earlier context, and eventually producing incoherent outputs. Sixty-five percent of enterprise AI production failures have been attributed to context drift rather than hitting hard limits.

Uncertainty and confidence collapse happens when the agent reaches a decision point where the right action is genuinely unclear and the cost of a wrong guess is high. Without explicit handling for these cases, agents either guess (producing errors that are hard to trace) or halt with opaque error messages that give operators nothing to work with.

The Checkpoint-First Design Principle

The prerequisite for handling any of these gracefully is state externalization. An agent that holds all task state in its context window has no recovery surface. The moment it fails, everything it knew about what it had done disappears.

Checkpoint architecture treats each step of a task as a discrete, persistable event. Before executing a step, write the intent. After completing it, write the result. The persisted state forms the basis for resumption, rollback, and audit — independently of whether the agent's next run remembers anything.

The implementation choice matters. For small checkpoints (under 350 KB), writing serialized state directly to DynamoDB with a thread ID and timestamp is sufficient. For larger checkpoints that include tool outputs, documents, or intermediate artifacts, the pattern shifts: write the payload to S3 and store a reference pointer in DynamoDB. The checkpointer is then just a pointer store, and the actual state lives in object storage.

The critical distinction is between workflow state and session state. Workflow state is durable — what the task has accomplished, what artifacts it has produced, where it is in the execution sequence. Session state is ephemeral — the current conversation context, in-flight tool calls, the agent's working memory. Many teams conflate these and end up rebuilding workflow progress on every server restart, because they stored everything in the session.

Workflow state should survive process restarts, agent model upgrades, and handoffs between agents. Session state by definition doesn't need to. Building checkpointing into task design means choosing what belongs in each category before the first line of agent code gets written.

Three Patterns for Human-in-the-Loop Resumption

Once you have persisted task state, the question becomes how humans interact with partially completed work. Three patterns emerge from production systems, and choosing between them depends on task latency tolerance, error cost, and how often intervention is actually needed.

Synchronous approval gates pause execution at predefined points and require human confirmation before proceeding. The agent executes up to the gate, serializes its state, and surfaces a review UI with full context: what it has done, what it proposes to do next, and what happens if the user declines. This is the right pattern when the next step is irreversible or expensive — submitting a purchase order, sending an external communication, executing a database write. The cost is latency; human approval cycles add minutes to hours to task execution time. For workflows where the downstream cost of a wrong step exceeds the cost of waiting, that's the correct trade-off.

Asynchronous audit queues let agents run with full autonomy while logging every decision with enough context to allow after-the-fact review. The agent doesn't block waiting for human input, but a queue of decisions accumulates that operators can review, flag, or use to tune the system. This pattern is appropriate when task latency is tightly constrained and individual errors are recoverable — a customer service agent triaging tickets, a content generation pipeline producing drafts. The risk is that by the time a bad decision surfaces in the queue, the effect has already propagated.

Multi-tier oversight assigns different oversight levels to different categories of action. Low-risk, well-bounded steps run autonomously. Steps that exceed a cost threshold, touch sensitive systems, or reach a confidence floor get escalated synchronously. The agent itself makes the escalation decision based on a policy it receives at initialization. This is the pattern most production deployments converge on — it distributes human attention toward the decisions that most need it.

In all three cases, the underlying requirement is the same: the agent must be able to articulate its state clearly enough for a human to make a meaningful decision. An interruption that presents "task failed at step 7" is not an approval gate — it's an error. A real interruption surfaces: what has been completed, what intermediate artifacts exist, what the next proposed step is, and what the decision branches look like if the human chooses differently.

Context Management Before It Becomes an Emergency

Most context management strategies are reactive — triggered after the agent is already deep into degradation. A proactive approach treats context as a budget that needs to be managed throughout task execution, not recovered from when it runs out.

The practical threshold for triggering active context management is around 70% of the available budget. At that point, the agent should summarize completed task history into a compact representation, discard verbose intermediate tool outputs that have already been processed, and carry forward only the facts needed to complete remaining steps. This compacted summary replaces the raw conversation history in subsequent context windows.

The risk is lossy compression — summarization can drop details that turn out to matter. The mitigation is checkpoint-first: all the raw data lives in persisted workflow state. The agent doesn't need to hold everything in context because it can retrieve specific artifacts on demand. Context becomes a working index, not the authoritative store.

Planning Rollback Before Executing Irreversible Steps

Rollback in agentic systems is not transactional in the database sense. You cannot roll back a sent email, a created API resource, or a confirmed reservation. What you can do is plan compensation paths before executing any step that has lasting effects.

The pattern is to treat irreversible operations as pairs: an execution step and a compensation step. If the agent books a flight in step 3 and the subsequent hotel booking in step 5 fails, the execution tree includes a pre-defined path to cancel the step-3 booking. This compensation step lives in the workflow state from the moment the execution step is written, not discovered ad hoc when the failure happens.

This distinction matters for multi-agent systems. When subtasks are delegated to specialized agents, the orchestrator needs to track which agents have performed irreversible actions and ensure their compensation steps remain accessible. An orchestrator that loses this tracking — because it only maintained it in session state — cannot orchestrate meaningful rollback.

Selective rollback is almost always preferable to full task restart. If steps 1 through 6 of a 10-step task completed successfully, reverting to step 1 wastes all of that work and introduces new error surface. The goal is to preserve completed steps while rolling back only the affected branch.

Surfacing Progress to Users: What Partial Completion Actually Looks Like

The user experience of a stalled agent is often worse than a clean failure, because the user can't tell what happened or what to do next. A well-designed partial completion state makes the current position legible.

Useful progress signals include: what fraction of the task has completed, what artifacts or results are already available, what the blocking condition is, and what options the user has (provide additional authorization, abandon the task, redirect to a different path). This is different from a progress bar — a progress bar tells you how much time has passed. Useful progress signals tell you what exists and what's possible.

The blocking condition deserves particular care. "Requires authorization" is actionable. "Error: permission denied" is not. "Uncertain about intent — would you like to proceed with interpretation A or B?" is actionable. "Task failed" is not. The framing of the interruption determines whether users can respond meaningfully or have to restart from scratch.

What to Build Before Your First Agent Ships

These patterns work best when designed in from the start. Retrofitting checkpoint architecture into an agent that already manages all state in-context is a significant rewrite.

Before shipping, verify that your agent design has explicit answers to five questions:

  • Where is durable workflow state stored, and how does an agent pick up from that state after a restart?
  • At which points can a human interrupt execution, and what information do they see when they do?
  • What is the compensation step for each irreversible action, and where is it recorded?
  • When context approaches 70% utilization, what gets summarized and what gets preserved verbatim?
  • What does a blocking condition look like to the user — and is it actionable?

If any of these questions doesn't have a concrete answer, the agent will eventually stall mid-task with no recovery path. The failure that follows won't look like a bug; it'll look like the model made a mistake. But the model will be fine. The missing design will be the problem.

Partial completion isn't an edge case. It's the normal operating condition for any agent running tasks that take more than a few minutes, touch multiple systems, or encounter real-world uncertainty. Treating it as a design requirement from the start is what separates agents that degrade gracefully from agents that leave users worse off than if they'd done the task themselves.

References:Let's stay in touch and Follow me for more thoughts and updates