4 posts tagged with "tooling"

When the Agent Asks Forgiveness Instead of Permission

May 23, 2026 · 11 min read

Software Engineer

Your team gave the agent a tool to refund customers, a tool to escalate to a manager, a tool to update a record in the CRM, and a system prompt that says "use your judgment." Six weeks in, the agent has shortened average resolution time by 40%, the demo to the executive team went beautifully, and the eval scores climbed every sprint. Then the apology emails start. A refund went to the wrong account because the agent didn't double-check the customer ID. An escalation pinged a director's phone at 11pm over a question a tier-one rep could have answered. A CRM update overwrote the "preferred contact channel" field that the field-sales team owns and uses to drive their territory routing. None of these are bugs in the model. They are the model doing exactly what your eval rewarded it for.

The agent learned, correctly, that taking action is scored positively and that asking the user "should I proceed?" is scored as friction. It also learned that an apology after an irreversible action is cheaper, on the metric it was being graded on, than a confirmation that delays a resolution. The act-first-apologize-later default arrived in production without any single engineer choosing it, because the eval set, the system prompt, and the tool surface together described a reward function where that policy wins.

Silent Tool Truncation: The Default Cap Your Agent Reasons Over Without Knowing

May 10, 2026 · 11 min read

Tian Pan

Software Engineer

A tool call returns a 142 KB JSON blob. Your agent framework drops everything past byte 8,192, hands the prefix to the model, and the model writes a confident answer based on a fragment it never knew was a fragment. Three weeks later a customer escalates. You scroll the trace, see "tool returned successfully," and the post-mortem turns into a hunt for which step "ignored" the evidence — except no step ignored it. The evidence was clipped before it ever reached the reasoner.

This isn't a hypothetical. Codex hardcodes tool output truncation at 10 KiB or 256 lines. Claude Code defaults to 25,000 tokens for tool results, with a separate display-layer cap that briefly clipped MCP responses at around 700 characters in 2025. OpenAI's tool-output submission caps at 512 KB. Each framework picked a number that seemed safe, and for short tool calls it is. The failure mode arrives when a single step's output crosses the line — quietly, without an exception, without a flag the model can see.

Tool Reentrancy Is the Bug Class Your Function-Calling Layer Doesn't Know Exists

April 28, 2026 · 11 min read

Tian Pan

Software Engineer

The agent took four hundred milliseconds to answer a simple question, then crashed with a recursion-limit error. The trace showed twenty-five tool calls. Reading the trace top-to-bottom, an engineer would conclude the agent was confused — calling the same handful of tools in slightly different orders, never converging. That conclusion is wrong. The agent wasn't confused. It was stuck in a cycle: tool A invoked the model, the model picked tool B, tool B's implementation invoked the model again to format its output, and the formatter chose tool A. The trace UI rendered four nested calls as four sibling calls in a flat list, and the cycle was invisible to the only human who could have caught it.

This is tool reentrancy, and it's a bug class your function-calling layer almost certainly doesn't model. Concurrency-safe code has decades of primitives for it: reentrant mutexes that count nested acquisitions by the same thread, recursion limits at the language level, stack inspection APIs, and a cultural understanding that any function which calls back into the runtime needs a clear contract about what re-entry is allowed. Tool-calling layers default to fire-and-forget. There is no call stack the runtime can inspect, no cycle detector before dispatch, no reentrancy attribute on the tool definition, and the trace UI is shaped like a log, not a graph. The result is that every tool catalog past about a dozen entries silently becomes a recursion the framework can't see.

The Dual-Writer Race: When Your Agent and Your User Edit the Same Calendar Event

April 26, 2026 · 12 min read

Tian Pan

Software Engineer

The agent confidently reports "I rescheduled the meeting to Thursday at 3pm." The user is staring at the original Tuesday 10am slot, because between the agent's plan and its commit they edited the event themselves. Last-write-wins overwrote a human change with an automated one, and the user's trust in the assistant collapses on a single incident. This is the dual-writer race, and it is the bug class that agent toolchains were never designed for.

Most agent platforms inherit this category by accident. The tool layer treats update_event as a single function call: take an ID, take new fields, return success. The provider API underneath has offered optimistic concurrency primitives — ETags, version tokens, If-Match preconditions — for a decade, and almost nobody plumbs them through. The model has no way to know that the world it reasoned about a minute ago is no longer current, because the abstraction it was given silently throws that information away.

About Tian Pan