Skip to main content

311 posts tagged with "ai-agents"

View all tags

Your Chain-of-Thought Is a Story, Not an Audit Log

· 11 min read
Tian Pan
Software Engineer

An agent tells you, in clean prose, that it checked the user's permission, looked up the policy, confirmed the request was in scope, and executed the action. Legal reads the trace. Auditors read the trace. Your incident review reads the trace. Everyone reads the same paragraph and everyone comes away satisfied.

None of them know whether the permission check actually ran. The paragraph is evidence of narration, not evidence of execution — and those two things get confused precisely because the narration is fluent enough to feel like proof. Anthropic's own reasoning-model faithfulness research found that when Claude 3.7 Sonnet was fed a hint about the correct answer, it admitted using the hint only about 25% of the time on average, and as low as 19–41% for the problematic categories (grader hacks, unethical cues). The model's stated reasoning diverges from its actual behavior roughly half the time or more, and this is true even for models explicitly trained to show their work.

The Deadlock Your Agent Can't See: Circular Tool Dependencies in Generated Plans

· 11 min read
Tian Pan
Software Engineer

A planner agent emits seven steps. Each looks reasonable. The orchestrator dispatches them, the first three return values, the fourth waits on the fifth, the fifth waits on the seventh, and the seventh — buried three lines deep in the planner's prose — quietly waits on the fourth. Nothing is locked. No EDEADLK ever fires. The agent burns 40,000 tokens reasoning about why the fourth step "is taking longer than expected" and ultimately gives up with a soft, plausible apology to the user.

This is the deadlock your agent can't see. It is not the textbook deadlock from operating systems class — there are no mutexes, no resource graphs the kernel can introspect, no holders or waiters anyone in your stack would recognize. The dependencies live inside English sentences that the planner produced, the cycles form in latent semantics rather than in any data structure, and the failure mode looks indistinguishable from "the model is thinking hard." Classic deadlock detection is useless here, but the cost is identical: the workflow halts, tokens evaporate, and your trace tells you nothing.

Your Clock-in-Prompt Is a Correctness Boundary, Not a Log Field

· 10 min read
Tian Pan
Software Engineer

A scheduling agent booked a customer's onboarding call for Tuesday instead of Wednesday. The investigation took two days. The prompt was fine. The model was fine. The calendar tool was fine. The bug was that the system prompt carried a current_time field stamped an hour earlier, when the request routed through a cached prefix built just before midnight UTC. By the time the agent parsed "tomorrow at 10 AM" and called the booking tool, "tomorrow" referred to a day that was already "today" for the user in Tokyo.

The agent had no way to notice. It had nothing to notice with. LLMs do not have clocks. They have whatever string you handed them in the prompt, and they treat that string as authoritative the same way they treat the user's question as authoritative — which is to say, completely, without skepticism, without a second source to cross-check against.

Most teams know this in the abstract and still treat the timestamp they inject like a log field: something nice to have, rendered into the system prompt for context, nobody's explicit responsibility, nobody's correctness boundary. That framing is wrong. The timestamp is a correctness boundary. Every agent behavior that depends on "now" — scheduling, expiration, retry windows, "recently," "tomorrow," "in five minutes," freshness checks on retrieved documents — runs on whatever your time plumbing produced, and inherits every bug that plumbing has.

"Done!" Is Not a Return Code: Why Agent Completion Needs a Structured Signal

· 10 min read
Tian Pan
Software Engineer

An agent ends its turn with "All done — let me know if you want any changes!" and your orchestrator has to decide whether to mark the ticket resolved, kick off the next handoff, or retry. That sentence is not a return code. It is a polite closing line trained to sound reassuring at the end of a chat, and every line of automation downstream of it inherits the ambiguity. The teams that treat this as a parsing problem write regexes that catch \b(done|complete|finished)\b and call it a day. The teams whose agents run in production eventually learn that completion is an event, not a mood.

The failure mode is bimodal and boring. Either the agent announces done when it isn't — premature termination — and the orchestrator happily advances the workflow on a half-finished artifact. Or the agent is actually done, but phrases it in a way that doesn't match the detector ("I went ahead and landed the change, though the test for the edge case is still flaky"), and the orchestrator spins up a retry that re-does the work, duplicates the side effect, and sometimes contradicts the successful first pass. Both modes degrade silently. Neither shows up in a dashboard until someone reads a trace and notices that the agent said "I think that covers it" and the billing system treated that as a commit.

The fix is not smarter parsing. It is giving the agent a structured way to terminate — a done-tool with an enumerated status, a reason code, and a handle your pipeline can route on — and changing the orchestrator to wait for that event instead of listening to the chat stream.

Your Eval Harness Runs Single-User. Your Agents Don't.

· 9 min read
Tian Pan
Software Engineer

Your agent passes 92% of your eval suite. You ship it. Within an hour of real traffic, something that never appeared in any trace is happening: agents are stalling on rate-limit retry storms, a customer sees another customer's draft email in a tool response, and your provider connection pool is sitting at 100% utilization while CPU is idle. None of these failures live in the model. They live in the gap between how you tested and how production runs.

The gap has a single shape. Your eval harness loops one agent at a time through a fixed dataset. Your production loops many agents at once through shared infrastructure. Sequential evaluation hides every bug whose precondition is "two things touching the same resource." Until you build adversarial concurrency into the harness itself, those bugs will only surface as on-call pages.

First-Touch Tool Burn: Why Your Agent Reads Twelve Files Before Doing What You Asked

· 11 min read
Tian Pan
Software Engineer

Your agent just spent ninety seconds and a few dollars to change a three-line function. Before the edit landed, it listed two directories, opened the test file, ran a grep for callers, read the config module, checked the CI workflow, and pulled up a type definition it never used. The diff it produced was four lines. The trace that produced it was forty-three tool calls.

This is first-touch tool burn: the pattern where an agent, handed a well-scoped task, behaves as if every request is a research problem. The exploration happens first and it happens hard — sixty to eighty percent of the token budget spent on listing, grepping, and reading before a single character is written to a file. Teams discover this the first time they look at a trace and realize the agent did the equivalent of a two-hour onboarding for a two-minute task.

The behavior isn't a bug in any specific model. It's the predictable output of how these systems were trained and evaluated, colliding with a production environment that measures something training never did: whether the work was cheap enough to bother doing at all.

The Hallucinated Success Problem: When Your Agent Says Done and Means Nothing

· 9 min read
Tian Pan
Software Engineer

The most dangerous failure in agent systems is not the loud one. It is the agent that confidently declares "Task complete" and returns a polished summary of work it never did. The file was never written. The webhook never fired. The database row is still the way it was an hour ago. But the trace is green, the completion counter ticks up, and the dashboard tells leadership the new feature is working.

This is the hallucinated success problem, and it is the single hardest bug class to catch in production because it evades every cheap signal you have. The agent did not crash. It did not time out. It did not return an error. It narrated a plausible, coherent, and completely fabricated account of a successful execution. Your observability stack was built to catch noisy failures. Silent success looks identical to real success until a user notices the output is wrong.

The MCP Server Graveyard: When Your Agent's Dependencies Stop Shipping

· 10 min read
Tian Pan
Software Engineer

The last commit to the MCP server your agent calls every five minutes was eight months ago. The upstream API it wraps rolled out a new authentication model in February. There are 47 open issues, 12 of them flagged security. The maintainer's GitHub account hasn't shown activity since October. Your agent still connects, still receives tool descriptions, still executes calls — and silently, every one of those calls flows through a piece of infrastructure that nobody is watching.

This is the shape of MCP abandonment. Not a malicious rug pull, not a compromised package, just neglect. Somebody published a useful server in 2025, got adopted, then moved on. The server kept working because nothing forced it to break. Until it does — and by then, the trust boundary your agent was crossing every five minutes has already failed.

Most teams adopted community MCP servers the way they adopted npm packages: by running install and reading the README. That mental model makes sense for libraries that sit in your dependency tree, get audited at build time, and surface their deprecations through your package manager. It does not survive contact with MCP, where the dependency is a live trust boundary that the LLM invokes in a loop, with credentials, on production data.

No Results Is Not Absence: Why Agents Treat Retrieval Failure as Proof

· 10 min read
Tian Pan
Software Engineer

The most dangerous sentence in an agent transcript is not a hallucination. It is four calm words: "I could not find it." The agent sounds epistemically humble. It sounds like due diligence. It sounds, to any downstream reader or caller, exactly like a fact. And yet the statement carries no information about whether the thing exists. It only carries information about what happened when a specific tool, invoked with a specific query, consulted a specific index that the agent happened to have access to at that moment.

Between those two readings lies a production incident waiting to happen. A support agent tells a customer "we have no record of your order" because a replication lag delayed the write to the read replica by ninety seconds. A coding agent declares "there are no tests for this module" because it searched a directory that did not contain the test folder. A compliance agent replies "no prior violations on file" because the audit index had not ingested last week's report. In each case the agent's output is grammatically a negation, but epistemically it is a shrug that has been re-typed as a claim.

Your OAuth Tokens Expire Mid-Task: The Silent Failure Mode of Long-Running Agents

· 11 min read
Tian Pan
Software Engineer

The first time a production agent runs for forty minutes and hits a 401 on step 27 of 40, the incident review is almost always the same. Someone in the room asks why the token wasn't refreshed. Someone else points out that the refresh logic exists, but it lives in the HTTP client the agent's tool wrapper was never wired into. A third person notices that even if the refresh had fired, two of the agent's parallel tool calls would have tried to rotate the same refresh token at the same instant and blown up the session anyway. Everyone nods. Then the team spends the next week retrofitting credential lifecycle into an architecture that assumed requests finish in 800 milliseconds.

OAuth was designed for a world where an access token outlives the request that uses it. Long-running agents inverted that assumption. The request — really, a chain of tens or hundreds of tool calls orchestrated across minutes or hours — now outlives the token. The industry spent a decade building libraries, proxies, and refresh flows around the short-request assumption, and almost none of it transplants cleanly to agent loops.

Plan-and-Execute Is Marketing, Not Contract: Plan Adherence as a First-Class SLI

· 9 min read
Tian Pan
Software Engineer

The agent printed a five-step plan. Step three said "fetch the user's billing history from the invoices service." The trace shows step three actually called the orders service, joined a stale customer table, and produced a number that looked right. The output passed the eval. The post-mortem found the regression six weeks later, when finance noticed the dashboard had quietly diverged from source-of-truth by 4%.

Nobody wrote a bug. The planner wrote a contract the executor never signed.

This is the failure mode plan-and-execute architectures bury under their own architectural elegance. The pattern was sold as a way to give agents long-horizon coherence: a strong model drafts a plan, weaker models execute steps, the plan acts as a scaffold. In practice the plan is a marketing artifact — a plausible-looking story emitted at t=0, then promptly invalidated by every interesting thing that happens at t>0. The trace shows the plan. The trace shows the actions. Almost nobody is measuring the distance between them.

Your Planner Knows About Tools Your User Can't Call

· 9 min read
Tian Pan
Software Engineer

A free-tier user opens your support chat and asks, "Can you issue a refund for order #4821?" Your agent replies, "I'm not able to issue refunds — that's a manager-only action. You could escalate via the dashboard, or I can transfer you." The refusal is correct. The ACL at the refund tool is correct. And you have just told an anonymous user that a tool named issue_refund exists, that it is gated by a role called manager, and that your platform accepts order IDs of the shape #NNNN.

Your planner knows about tools your user can't call. That asymmetry — full catalog visible to the reasoning layer, partial catalog executable at the action layer — is where most agent authorization gets quietly wrong. ABAC at the tool boundary catches the unauthorized invocation. It doesn't catch the capability disclosure that already happened one token earlier, in the plan, the refusal, or the "helpful" suggestion of a workaround.