When Tools Lie: The False-Success Failure Mode Your Agent Trusts By Default

May 10, 2026 · 10 min read

Software Engineer

The agent confidently tells the user, "I've sent the confirmation email and credited the refund to your account." The trace is clean: two tool calls, both returned {"success": true}, the model produced a polished summary, the conversation closed in 3.2 seconds. A week later the customer escalates because the email never arrived and the refund never posted. The audit trail is a sea of green checkmarks. Nothing failed — except the actual job.

This is the failure mode that has no name in most agent stacks: tools that lie. Not lie in the malicious sense — they return the response their contract specifies. The lie is structural. The HTTP layer says "200 OK" because the request was accepted, not because the operation completed. The mail provider says success: true because the message entered the outbound queue, not because it left the building. The database write returned without error because it landed on a replica that never propagated. The model, trained to be helpful and trained on examples where green means done, weaves these signals into a confident summary and moves on.

The reason this is hard to debug is that every layer is behaving as documented. The tool reported what it knew. The agent reasoned over the only signal it received. The harness logged what the agent emitted. The lie is in the gap between "the system accepted my command" and "the world changed in the way I asked it to" — and that gap is invisible to anyone whose mental model treats tool calls as function returns.

The Anatomy of a Tool That Lies

The vocabulary of agent tool calls is borrowed from RPC, which inherited it from local function calls, which assume that "the call returned" implies "the work is done." That assumption was already shaky for distributed systems and outright wrong for systems with asynchronous backends. An agent stack inherits all the prior failure modes and adds a probabilistic reasoner that takes the success signal at face value.

Several common patterns produce false-success responses:

Acceptance-as-success. The API accepts the request, places it on a queue, and returns 200. The queue worker is dead, deadlocked, or processing a six-hour backlog. The downstream effect never happens.
Replica-write success. The write went to a node that hasn't propagated to the read replicas the agent will later query. The post-condition read returns the old value; the model interprets it as "the write didn't take" and writes again.
Provider-side soft failure. The email provider returns 200 with accepted: true, but a spam filter at the recipient's domain has been black-holing your IP for a week. Your tool layer never sees the failure.
Schema-shaped error. The tool returns 200 with a body that contains an error field the agent's tool wrapper didn't surface, because the wrapper was written when 4xx was the only error path.
Partial mutation. The compound action — "create the invoice and email it" — succeeds on step one and silently drops step two, returning a single success boolean that conflates both.

Each of these patterns lives in the boundary between two systems, where ownership of "did the thing actually happen" is unclaimed. The tool layer thinks the API answered it. The API thinks the queue acknowledged. The queue thinks the worker accepted. Nobody has the post-condition.

Why the Model Cannot Be the Verification Layer

A natural reaction to this failure mode is to make the model smarter — prompt it to "verify the tool succeeded before reporting success to the user." This is the wrong layer to fix it.

The model only sees what the harness pipes into its context. If the tool returned success: true, the model has no independent evidence to challenge it. Asking the model to "be skeptical" produces theatrical skepticism — a paragraph of hedging language that doesn't change behavior on the next call. Worse, models trained on human-written tool documentation have absorbed the prior that success: true means done. Fighting that prior with prompt instructions is a steady drag on every other dimension of agent behavior.

The verification has to be deterministic, has to live below the model in the stack, and has to express itself as evidence the model can reason over rather than as a string the model can rationalize away. That is, the harness should refuse to return tool_result: success to the model until a post-condition has been observed by code that doesn't trust the tool's self-report. The model then sees the post-condition, not just the response.

This is a structural shift. It moves the agent harness from "I forwarded the tool's answer to the model" to "I verified the tool's answer against an external observation, then forwarded a verified answer." The latter is a real-time consistency layer, and most agent stacks don't have one because most agent stacks were designed for read-mostly workloads where the model just needed information.

The Post-Condition Pattern

The discipline that lands is post-condition verification: every state-changing tool call is followed by a deterministic read that confirms the world is in the state the call promised. The pattern has four elements, each with its own tradeoffs.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

When Tools Lie: The False-Success Failure Mode Your Agent Trusts By Default

The Anatomy of a Tool That Lies

Why the Model Cannot Be the Verification Layer

The Post-Condition Pattern

Recommended Reading

About Tian Pan

The Anatomy of a Tool That Lies​

Why the Model Cannot Be the Verification Layer​

The Post-Condition Pattern​

Recommended Reading

About Tian Pan

The Anatomy of a Tool That Lies

Why the Model Cannot Be the Verification Layer

The Post-Condition Pattern