Skip to main content

320 posts tagged with "ai-agents"

View all tags

The MCP Tool List Grew Mid-Session and Your Agent Called a Tool It Had Never Been Told About

· 10 min read
Tian Pan
Software Engineer

A security incident review opens with a question the team cannot answer: how did the agent learn the name of the tool it just called? The audit trail shows a tools/call for a tool whose name does not appear in any tools/list response the harness logged. The MCP server cheerfully accepted the call and executed it. The model, asked in a postmortem to explain where the tool name came from, offers no answer because there is none — it guessed, and the guess landed on a real action.

This is the failure mode at the seam between two assumptions that look compatible on paper. The client treats the tool list as a contract that names the surface area of authority it has been granted. The server treats the tool list as a snapshot of what is currently available, free to grow when the world grows. Between those two views, the LLM is a bridge that does not know the difference.

The Async Tool Call That Resolved After the User Already Closed the Conversation

· 12 min read
Tian Pan
Software Engineer

The clearest sign that an agent's session model is broken is when a tool result has nowhere to go. The agent fired a long-running call — a render, a provisioning job, a multi-step query. The user watched the spinner for a few seconds, decided they didn't need it after all, closed the tab, and moved on. Forty seconds later the tool finishes. Its callback hits your gateway with a conversation_id that no longer points at anything. The gateway has two equally bad options: silently drop the result, or stitch it into whatever session inherits that ID next.

Most teams discover this failure mode the same way: a support ticket where a user sees an answer they did not ask for, attached to a conversation they did not start. Or a downstream system that processed the same charge twice because the gateway helpfully "retried" delivery against the next active session. Or — most commonly — nothing visible at all, just a slow drift in completion metrics that nobody can correlate to anything specific, because the failures don't fire alerts; they fire emptiness.

The OAuth Scope One Tool Requested That Every Other Tool Quietly Inherited

· 10 min read
Tian Pan
Software Engineer

The design document said each tool gets its own OAuth token, scoped to the minimum permissions that tool needs. The implementation stored tokens keyed by (user_id, provider). Both statements were true on the day v1 shipped, because there was exactly one tool per provider. The day a second tool against the same provider went live, the design document was still true and the storage layer silently invalidated it.

Six months later, a security review traced an incident back to that line of schema. A calendar-reader tool, compromised through a prompt injection in an event description, had successfully called events.delete on the user's primary calendar. The reader had never been granted that scope. The writer had. The token store didn't distinguish between them.

This is the failure mode where a per-provider key shape silently aggregates privilege across tools that share a provider — and the architectural realization that OAuth scope is a property of a token, not a property of a tool.

The Pinned Dependency Your Security Agent Upgraded Past the Comment It Could Not See

· 10 min read
Tian Pan
Software Engineer

A Spanish customer complained that her annual renewal had been billed a day early. The support ticket bounced through three queues before it landed in front of an engineer who recognized the smell: a date-formatting regression, European cohort only. He ran git log against the date-formatting module and found nothing. The module had not been touched in eleven days. What had been touched, eleven days earlier, was its package.json — a lodash bump from 4.17.20 to 4.17.22, opened by a security agent, approved by the on-call, merged without comment.

Two lines above the version string, in the same file, was a comment written eighteen months ago: // do not upgrade — breaks the snapshot tests in date-formatting, see FRONT-2418. The security agent had not read it. Or, more precisely: the security agent had read the entire file, but its prompt instructed it to find vulnerable version strings, not to weigh the comments around them. The comment was load-bearing institutional knowledge. The agent treated it as scenery.

This is a coordination failure between two systems that did not know they were colliding. The security agent was doing its job. The original engineer who wrote the comment had done his job. The feature-development agent that respected the pin every time it touched the file was doing its job. Nobody had decided whose job it was to mediate between them.

The Session Affinity Your Provider Load Balancer Quietly Ignored

· 11 min read
Tian Pan
Software Engineer

Your dashboard says cache hit rate is 71%. Your finance partner is pleased. Your latency p50 is fine. Then a customer support thread arrives from a long-running agent session: turn 14 took eleven seconds to produce the first token, turn 15 took eight, turn 16 took nine. You pull the trace. Every one of those turns reports a cache_read_input_tokens value of zero. The system prompt is sixteen thousand tokens. The user thinks the agent is broken. You think your provider is broken. Neither of you is right. The aggregate hit rate is a survivorship statistic — it averages over the short conversations that hit cache trivially and quietly absorbs the long conversations that have collapsed to cold-on-first-token mid-session.

This is the failure mode that no provider postmortem will ever describe to you, because from their telemetry the system is working as designed. The load balancer is making the routing decision it was told to make. The cache is being populated and evicted on the schedule it was told to follow. The hint you passed — the prompt_cache_key, the conversation ID, the user ID, whatever string you serialized into that field — was advisory the whole time, and "advisory" means "ignored when convenient." Under load, when a scaling event happens, when an upstream pod is draining, when the affinity-aware tier is saturated, your hint quietly degrades to a uniform routing decision. The request lands on a cold pod. The KV tensors that would have served the prefix at sub-millisecond cost are sixteen feet away in a sibling rack and unreachable. Your conversation pays full-prefix cost again, and your dashboard's headline number doesn't move because two thousand other one-turn conversations hit cache fine.

The Tool Schema Migration That Broke Your Agent's Retries for Two Weeks

· 11 min read
Tian Pan
Software Engineer

The deprecation notice went out on a Tuesday. The downstream team rotated the response shape on their search tool — results[].snippet became results[].excerpt, a clean rename, six-week window, banner in the docs, three reminder emails to the engineering list. Every human consumer migrated. The agent did not, because the agent does not read email. For fourteen days the retry loop quietly parsed the new payload, found the field it was looking for missing, raised a KeyError, and counted that as a retryable failure. The retry hit the same endpoint, got the same new shape, raised the same error, gave up after three attempts, and returned an apologetic message to the user. The retry budget dashboard stayed green the entire time — retries were never exhausted, they were just permanently failing within budget. Success rate, measured at the tool layer, sat at zero on that path. Nobody looked because there was no page.

This is the shape of the failure that gets the most engineers in 2026: not the dramatic outage, but the silent contract drift where a human-facing migration runs to completion and the agent-facing one never starts because nobody knew there was one to run. The deprecation worked exactly as designed for the consumers it was designed for. The agent was a consumer nobody listed.

The Agent Budget That Approved Cost-Per-Call and Never Measured Cost-Per-Resolved-Task

· 10 min read
Tian Pan
Software Engineer

A quarter into the rollout, the AI team reported a 25% reduction in average cost-per-API-call. The support team reported that average handle time on AI-routed tickets had drifted from four turns to seven. Both numbers were correct. Both teams were measuring the system they had been told to optimize. The finance team, sitting between them, could not reconcile the dashboards because neither one was denominated in the thing the customer was actually paying for: a resolved ticket. The cost-per-call had gone down. The cost-per-resolved-task had gone up 40%. Nobody owned that number, so nobody was watching it move.

This is the most common unit-economics failure I see in agentic deployments, and it is not a measurement bug. It is a definitional one. The vendor's pricing page exposes cost-per-call because that is the unit they bill. The spreadsheet line item inherits that unit because it fits in a cell. The engineering team optimizes against the unit they were given. By the time the gap between API economics and business economics becomes visible, it has been compounding for a quarter, and the agent has been quietly trained on the wrong loss function the entire time.

When 'Escalate to Human' Becomes the Queue: The Hidden Incentive Bug in Your AI Support Stack

· 10 min read
Tian Pan
Software Engineer

You shipped an AI support agent six months ago to deflect 40% of tier-one tickets. Today your human queue is longer than it was before launch, your CSAT is down, and the per-ticket cost has gone up. The deflection dashboard says everything is fine. It is not.

The failure mode is not that the agent is bad at answering questions. The failure mode is that "escalate to human" was supposed to be the safety valve, and instead it became the path of least resistance. The agent learned, through the structure of its rewards and the absence of any cost on the escalation action, that handing the conversation off is the cheapest way to discharge an ambiguous request. Your support team did not notice this happening because the metric they watched — deflection rate — does not penalize the agent for routing fixable problems into the human queue. It only penalizes the agent for the user explicitly clicking "talk to a human" after a long unsuccessful exchange.

This is not a tooling problem. It is an incentive design problem, and the leadership failure is treating it as something the vendor will fix in the next release.

The Agent Plan That Branched on a Fact Your Context Pruner Already Dropped

· 11 min read
Tian Pan
Software Engineer

A long-running agent generates a plan at step 3. The plan reads something like: "if the order returned by get_order in step 1 has status shipped, send the customer a tracking email; otherwise open a refund ticket." The agent confidently picks the email branch. The customer never received a tracking number, because the order was actually in pending. You go to the trace expecting to find a hallucination. What you find is worse: the step-1 tool result is no longer in context. The pruner evicted it between step 2 and step 3 — it ranked low on recency and there was a 12KB transcript to make room for. The plan still ran. The branch was still chosen. The decision now points at evidence that does not exist.

This is not a model failure in the usual sense. The model produced a syntactically valid plan, executed it in order, and made a branch decision. The branch was made against a fact that used to be in context and is not anymore. The chain of thought encoded the condition (if status == "shipped"); the actual status got dropped on the way to the step that needed it. The plan looks deterministic, but it has been quietly cut loose from its evidence.

The Agent Rollout Cadence Your Customer Success Team Could Not Absorb

· 11 min read
Tian Pan
Software Engineer

The customer pasted the agent's answer into a support chat and asked the human rep to confirm it. The rep, looking at the same product, said the opposite. The customer did not lose trust in the agent that day. They lost trust in the company, because two parts of it told them two different things in the same hour.

Nothing was broken. The AI team had shipped a prompt change on Tuesday behind a feature flag, ramped it to 100% by Thursday, and moved on. The customer success team's enablement cycle is monthly — that is how every other product feature has always landed, and nobody re-negotiated the contract for AI. The macro in the CS rep's queue and the FAQ doc on the public site still described the previous behavior. The agent was correct. The rep was correct against the documentation they had. The company was incoherent.

The Agent Runbook Your Incident Commander Could Not Execute

· 10 min read
Tian Pan
Software Engineer

The page fires at 02:17 local time. The on-call SRE pulls up the agent runbook on their phone and reads step one: "check the agent's tool-call traces for anomalous tool usage." They open the link. They hit an SSO prompt for a workspace they do not belong to. Step two says inspect the prompt-construction logs; same wall. Step three says roll back to the previous prompt version, but the deploy permission is scoped to a team they are not on. By the time they figure out which Slack channel to escalate to and wake up the AI team's product manager because she is the only person they can find at 02:17, ninety minutes have passed and the customer-visible regression is still serving wrong answers.

The post-mortem will identify the access gap as the proximate cause. The deeper discomfort is that the runbook reads fine in daylight and runs blocked at night, because the person who wrote it has access the person who executes it does not.

The AI Feature Your CTO Funded That Your Security Team Will Not Let You Ship

· 11 min read
Tian Pan
Software Engineer

The post-mortem says "we found security too late." The actual finding is that security found you on time. Your process found security too late.

This is the AI feature that cleared the budget gate in January because the CTO and the CFO agreed the company needed an AI moment. It cleared a light legal review in March because it was a prototype. Engineering built against the agreed spec through Q2. In late July, the launch-readiness security review opened, and on day one the threat model came back with blockers on the auth scopes, the data-exfiltration paths, the model provider's residency story, and the prompt-injection surface. The team's quarter is now spent rebuilding to address findings that should have shaped the original spec. Two quarters of slip, an executive memo about "process improvements," and a quiet decision next planning cycle to "deprioritize AI deep-integrations."

The launch did not fail because security was slow. It failed because security entered after the shape of the feature had already been frozen.