The Async Tool Call Your Agent Fired and Forgot

June 2, 2026 · 10 min read

Software Engineer

The clearest sign that an agent's tool-call abstraction is broken is when the trace shows the step marked done and the downstream system shows nothing happened. The model called a tool, received a job ID back, treated the job ID as the answer, and moved on. Three minutes later the actual work either succeeded with nobody listening or failed with the error landing in a log nobody reads. The user sees a confident summary; the operations queue sees a stranded task.

This is the failure mode the function-calling abstraction quietly enables. JSON schemas describe parameters and return types, but they do not distinguish between "this tool returns a result" and "this tool returns a receipt for an operation whose result you will need to ask about later." The model treats both the same way, because to the planner they look the same — a successful tool call with a non-error payload.

The benchmark numbers are bleak. On Robotouille, an asynchronous planning benchmark that measures whether agents can interleave actions with operations that take real time, ReAct on GPT-4o scores 47% on synchronous variants and 11% on asynchronous ones. The architecture isn't subtly worse at async — it falls apart, because every async tool call is an opportunity for the planner to mistake acknowledgment for completion.

The Type System Has a Hole in It

A function call schema names parameters, return shape, and a one-line description. What it does not name is the temporal contract: does this call return a result, or a promise of a result?

A send_email tool that returns {"status": "sent"} looks identical, at the JSON level, to a start_video_render tool that returns {"job_id": "abc123"}. Both produce a payload. Both come back with no error. The planner has no type-level signal that one of them is done and the other is barely started. Tool authors write descriptions like "starts a render job and returns the job ID" — but that prose is one sentence in a system prompt where the model is juggling dozens of tools, and at runtime the model collapses both calls into the same mental category: "tool succeeded, advance the plan."

The new MCP specification (2025-11-25 revision) acknowledges this gap by adding Tasks as a separate primitive — a durable state machine with explicit states like working, input_required, completed, failed, and cancelled. The point is not the state names. The point is that async work has a kind that synchronous work doesn't, and putting it on a different runtime path stops the planner from confusing the two. Bedrock AgentCore's runtime makes the same separation with explicit add_async_task and complete_async_task calls that the SDK uses to track tasks and manage status pings independently of the model's reasoning loop.

If your function-calling layer treats every tool as a synchronous returns-the-answer call, you have one type for two phenomena. The first time you ship a long-running tool, you have shipped a planner that lies about completion.

The Polling Loop Whose Budget Was Set Without Looking at the Work

Teams that recognize the async case usually patch it by giving the agent a check_status tool and trusting the planner to call it until the job is done. This works in demos and falls apart in production for one specific reason: the agent's outer loop budget — max turns, max tokens, max wall-clock — is set by people thinking about cost, not by people thinking about how long real operations take.

A typical loop budget is 20 to 30 turns and a few hundred seconds of total execution time. A typical long-running tool is a video render, a large file transcription, a multi-step provisioning job, an ETL pipeline. The operation's typical duration is several minutes. The agent's poll-and-wait loop gives up after roughly ninety seconds because that's all the turn budget allows.

What does the agent report when the loop budget exhausts before the job completes? Almost always: the agent synthesizes a result. Phantom status reports — the model references a job status from an earlier check_status call instead of making a fresh one. Premature collection — the model tries to assemble a final answer from "the job is queued" because that's the most recent observation it has. ID truncation — under context pressure, the model abbreviates or reformats the job ID, and the next check_status call fails because the lookup string is mangled.

The user sees a coherent answer. The operation may still be running. The agent has produced a result indistinguishable from the case where it actually waited. This is worse than a timeout error, because at least a timeout is honest.

Sync and Async Tools Are Different Abstractions Wearing the Same JSON Schema

The trap is that the wire format lets you pretend they're the same. The two abstractions actually differ on every axis that matters:

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Async Tool Call Your Agent Fired and Forgot

The Type System Has a Hole in It

The Polling Loop Whose Budget Was Set Without Looking at the Work

Sync and Async Tools Are Different Abstractions Wearing the Same JSON Schema

Recommended Reading

About Tian Pan

The Type System Has a Hole in It​

The Polling Loop Whose Budget Was Set Without Looking at the Work​

Sync and Async Tools Are Different Abstractions Wearing the Same JSON Schema​

Recommended Reading

About Tian Pan

The Type System Has a Hole in It

The Polling Loop Whose Budget Was Set Without Looking at the Work

Sync and Async Tools Are Different Abstractions Wearing the Same JSON Schema