320 posts tagged with "ai-agents"

"Done!" Is Not a Return Code: Why Agent Completion Needs a Structured Signal

April 23, 2026 · 10 min read

Software Engineer

An agent ends its turn with "All done — let me know if you want any changes!" and your orchestrator has to decide whether to mark the ticket resolved, kick off the next handoff, or retry. That sentence is not a return code. It is a polite closing line trained to sound reassuring at the end of a chat, and every line of automation downstream of it inherits the ambiguity. The teams that treat this as a parsing problem write regexes that catch \b(done|complete|finished)\b and call it a day. The teams whose agents run in production eventually learn that completion is an event, not a mood.

The failure mode is bimodal and boring. Either the agent announces done when it isn't — premature termination — and the orchestrator happily advances the workflow on a half-finished artifact. Or the agent is actually done, but phrases it in a way that doesn't match the detector ("I went ahead and landed the change, though the test for the edge case is still flaky"), and the orchestrator spins up a retry that re-does the work, duplicates the side effect, and sometimes contradicts the successful first pass. Both modes degrade silently. Neither shows up in a dashboard until someone reads a trace and notices that the agent said "I think that covers it" and the billing system treated that as a commit.

The fix is not smarter parsing. It is giving the agent a structured way to terminate — a done-tool with an enumerated status, a reason code, and a handle your pipeline can route on — and changing the orchestrator to wait for that event instead of listening to the chat stream.

Your Eval Harness Runs Single-User. Your Agents Don't.

April 23, 2026 · 9 min read

Tian Pan

Software Engineer

Your agent passes 92% of your eval suite. You ship it. Within an hour of real traffic, something that never appeared in any trace is happening: agents are stalling on rate-limit retry storms, a customer sees another customer's draft email in a tool response, and your provider connection pool is sitting at 100% utilization while CPU is idle. None of these failures live in the model. They live in the gap between how you tested and how production runs.

The gap has a single shape. Your eval harness loops one agent at a time through a fixed dataset. Your production loops many agents at once through shared infrastructure. Sequential evaluation hides every bug whose precondition is "two things touching the same resource." Until you build adversarial concurrency into the harness itself, those bugs will only surface as on-call pages.

First-Touch Tool Burn: Why Your Agent Reads Twelve Files Before Doing What You Asked

April 23, 2026 · 11 min read

Tian Pan

Software Engineer

Your agent just spent ninety seconds and a few dollars to change a three-line function. Before the edit landed, it listed two directories, opened the test file, ran a grep for callers, read the config module, checked the CI workflow, and pulled up a type definition it never used. The diff it produced was four lines. The trace that produced it was forty-three tool calls.

This is first-touch tool burn: the pattern where an agent, handed a well-scoped task, behaves as if every request is a research problem. The exploration happens first and it happens hard — sixty to eighty percent of the token budget spent on listing, grepping, and reading before a single character is written to a file. Teams discover this the first time they look at a trace and realize the agent did the equivalent of a two-hour onboarding for a two-minute task.

The behavior isn't a bug in any specific model. It's the predictable output of how these systems were trained and evaluated, colliding with a production environment that measures something training never did: whether the work was cheap enough to bother doing at all.

The Hallucinated Success Problem: When Your Agent Says Done and Means Nothing

April 23, 2026 · 9 min read

Tian Pan

Software Engineer

The most dangerous failure in agent systems is not the loud one. It is the agent that confidently declares "Task complete" and returns a polished summary of work it never did. The file was never written. The webhook never fired. The database row is still the way it was an hour ago. But the trace is green, the completion counter ticks up, and the dashboard tells leadership the new feature is working.

This is the hallucinated success problem, and it is the single hardest bug class to catch in production because it evades every cheap signal you have. The agent did not crash. It did not time out. It did not return an error. It narrated a plausible, coherent, and completely fabricated account of a successful execution. Your observability stack was built to catch noisy failures. Silent success looks identical to real success until a user notices the output is wrong.

The MCP Server Graveyard: When Your Agent's Dependencies Stop Shipping

April 23, 2026 · 10 min read

Tian Pan

Software Engineer

The last commit to the MCP server your agent calls every five minutes was eight months ago. The upstream API it wraps rolled out a new authentication model in February. There are 47 open issues, 12 of them flagged security. The maintainer's GitHub account hasn't shown activity since October. Your agent still connects, still receives tool descriptions, still executes calls — and silently, every one of those calls flows through a piece of infrastructure that nobody is watching.

This is the shape of MCP abandonment. Not a malicious rug pull, not a compromised package, just neglect. Somebody published a useful server in 2025, got adopted, then moved on. The server kept working because nothing forced it to break. Until it does — and by then, the trust boundary your agent was crossing every five minutes has already failed.

Most teams adopted community MCP servers the way they adopted npm packages: by running install and reading the README. That mental model makes sense for libraries that sit in your dependency tree, get audited at build time, and surface their deprecations through your package manager. It does not survive contact with MCP, where the dependency is a live trust boundary that the LLM invokes in a loop, with credentials, on production data.

No Results Is Not Absence: Why Agents Treat Retrieval Failure as Proof

April 23, 2026 · 10 min read

Tian Pan

Software Engineer

The most dangerous sentence in an agent transcript is not a hallucination. It is four calm words: "I could not find it." The agent sounds epistemically humble. It sounds like due diligence. It sounds, to any downstream reader or caller, exactly like a fact. And yet the statement carries no information about whether the thing exists. It only carries information about what happened when a specific tool, invoked with a specific query, consulted a specific index that the agent happened to have access to at that moment.

Between those two readings lies a production incident waiting to happen. A support agent tells a customer "we have no record of your order" because a replication lag delayed the write to the read replica by ninety seconds. A coding agent declares "there are no tests for this module" because it searched a directory that did not contain the test folder. A compliance agent replies "no prior violations on file" because the audit index had not ingested last week's report. In each case the agent's output is grammatically a negation, but epistemically it is a shrug that has been re-typed as a claim.

Your OAuth Tokens Expire Mid-Task: The Silent Failure Mode of Long-Running Agents

April 23, 2026 · 11 min read

Tian Pan

Software Engineer

The first time a production agent runs for forty minutes and hits a 401 on step 27 of 40, the incident review is almost always the same. Someone in the room asks why the token wasn't refreshed. Someone else points out that the refresh logic exists, but it lives in the HTTP client the agent's tool wrapper was never wired into. A third person notices that even if the refresh had fired, two of the agent's parallel tool calls would have tried to rotate the same refresh token at the same instant and blown up the session anyway. Everyone nods. Then the team spends the next week retrofitting credential lifecycle into an architecture that assumed requests finish in 800 milliseconds.

OAuth was designed for a world where an access token outlives the request that uses it. Long-running agents inverted that assumption. The request — really, a chain of tens or hundreds of tool calls orchestrated across minutes or hours — now outlives the token. The industry spent a decade building libraries, proxies, and refresh flows around the short-request assumption, and almost none of it transplants cleanly to agent loops.

Plan-and-Execute Is Marketing, Not Contract: Plan Adherence as a First-Class SLI

April 23, 2026 · 9 min read

Tian Pan

Software Engineer

The agent printed a five-step plan. Step three said "fetch the user's billing history from the invoices service." The trace shows step three actually called the orders service, joined a stale customer table, and produced a number that looked right. The output passed the eval. The post-mortem found the regression six weeks later, when finance noticed the dashboard had quietly diverged from source-of-truth by 4%.

Nobody wrote a bug. The planner wrote a contract the executor never signed.

This is the failure mode plan-and-execute architectures bury under their own architectural elegance. The pattern was sold as a way to give agents long-horizon coherence: a strong model drafts a plan, weaker models execute steps, the plan acts as a scaffold. In practice the plan is a marketing artifact — a plausible-looking story emitted at t=0, then promptly invalidated by every interesting thing that happens at t>0. The trace shows the plan. The trace shows the actions. Almost nobody is measuring the distance between them.

Your Planner Knows About Tools Your User Can't Call

April 23, 2026 · 9 min read

Tian Pan

Software Engineer

A free-tier user opens your support chat and asks, "Can you issue a refund for order #4821?" Your agent replies, "I'm not able to issue refunds — that's a manager-only action. You could escalate via the dashboard, or I can transfer you." The refusal is correct. The ACL at the refund tool is correct. And you have just told an anonymous user that a tool named issue_refund exists, that it is gated by a role called manager, and that your platform accepts order IDs of the shape #NNNN.

Your planner knows about tools your user can't call. That asymmetry — full catalog visible to the reasoning layer, partial catalog executable at the action layer — is where most agent authorization gets quietly wrong. ABAC at the tool boundary catches the unauthorized invocation. It doesn't catch the capability disclosure that already happened one token earlier, in the plan, the refusal, or the "helpful" suggestion of a workaround.

Rate Limit Hierarchy Collapse: When Your Agent Loop DoSes Itself

April 23, 2026 · 12 min read

Tian Pan

Software Engineer

The bug report says the service is slow. The dashboard says the service is healthy. Token-per-minute usage is at 62% of the tier cap, well inside the green band. Then you open the traces and see the shape: one user request spawned a planner step, which emitted eleven parallel tool calls, four of which were search fan-outs that each triggered sub-agents, which each called three tools in parallel — and that single "request" is now pounding your own token bucket from forty-seven different workers at the same time. The other ninety-nine users of your product are stuck behind it, getting 429s they never earned. Your agent is DoSing itself, and the rate limiter is doing exactly what you told it to.

This is rate limit hierarchy collapse. You bought a perimeter defense designed for HTTP APIs where one request equals one unit of work, then wired it in front of a system where one request means a tree of unknown depth and unbounded branching factor. The single-bucket model doesn't just fail to protect — it fails invisibly, because your aggregate numbers never breach anything. The damage happens in the tails, in correlated bursts, and in the heads-down users who happen to be adjacent in time to a heavy one.

The Reasoning-Model Tax at Tool Boundaries

April 23, 2026 · 10 min read

Tian Pan

Software Engineer

Extended thinking wins benchmarks on novel reasoning. At a tool boundary — the moment your agent has to pick which function to call, when to call it, and what arguments to pass — that same thinking budget often makes things worse. The model weighs three equivalent tools that a fast model would have disambiguated in one token. It manufactures plausible-sounding ambiguity where none existed. It burns a thousand reasoning tokens to second-guess the obvious search call, then calls search anyway. You paid the reasoning tax on a decision that didn't need reasoning.

This is the quiet cost center of agentic systems in 2026: not the reasoning model itself, which is priced fairly for what it does well, but the reasoning model deployed at the wrong step of the loop. The anti-pattern hides in plain sight because the top-of-loop task looks hard ("answer the user's question"), so teams wrap the entire loop in high-effort thinking mode and never notice that 80% of the thinking budget is being spent deliberating on tool-choice micro-decisions the model already got right on its first instinct.

The Reflection Placebo: Why Plan-Reflect-Replan Loops Return Version One

April 23, 2026 · 9 min read

Tian Pan

Software Engineer

Open an agent's trace during a long-horizon planning task and count the number of times the model writes "let me reconsider," "on reflection," or "a better approach would be." Now compare the plan it finally commits to with the one it drafted first. In the majority of traces I've audited, the second plan is the first plan wearing a different hat — the same decomposition, the same tool calls, the same order of operations, with some renamed step labels and a reworded rationale. The reflection ran. The model emitted tokens that looked like reconsideration. The plan did not move.

This matters because "with reflection" has quietly become a quality tier. Teams ship planners with one, two, or three reflection rounds and bill themselves for the difference. The inference spend is real and measurable. Whether anything on the plan side actually changed is a question almost nobody instruments for, and the answer is frequently: no.

About Tian Pan