Skip to main content

191 posts tagged with "agents"

View all tags

Prompt Contract Testing: How Teams Building Different Agents Coordinate Without Breaking Each Other

· 10 min read
Tian Pan
Software Engineer

When two microservices diverge in their API assumptions, your integration tests catch it before production does. When two agents diverge in their prompt assumptions, you find out when a customer gets contradictory answers—or when a cascading failure takes down the entire pipeline. Multi-agent AI systems fail at rates of 41–87% in production. More than a third of those failures aren't model quality problems; they're coordination breakdowns: one agent changed how it formats output, another still expects the old schema, and nobody has a test for that.

The underlying problem is that agents communicate through implicit contracts. A research agent agrees—informally, in someone's mental model—to return results as a JSON object with a sources array. The orchestrating agent depends on that shape. Nobody writes this down. Nobody tests it. Six weeks later the research agent's prompt is refined to return a ranked list instead, and the orchestrator silently drops half its inputs.

Timeout-Aware Agent Design: How to Deliver Partial Results Instead of Silent Failure

· 10 min read
Tian Pan
Software Engineer

An agent successfully creates a GitHub issue, opens a Jira ticket, and updates a shared spreadsheet. Then it times out before sending the Slack announcement. The framework records the run as delivered. The user never gets notified. The side effects exist in three systems; the result that matters to the human doesn't.

This is the most common timeout failure mode in production agent systems, and it's almost never the one teams prepare for. Most agent implementations treat a timeout like any other exception: catch it, log it, return an error. The user gets nothing, even though the agent completed 90% of the work. The question isn't whether to set timeouts — every production system needs them. The question is what an agent does when the clock runs out.

The Agentic Debugger's Trap: When Your Agent Patches Faster Than You Can Diagnose

· 10 min read
Tian Pan
Software Engineer

A staff engineer I worked with last quarter caught a bug that had already been "fixed" three times in the previous six weeks. Three different engineers. Three different files. Three green CI runs. Three accepted agent-generated patches. Each patch made the failing test pass and the user-reported error disappear. Each one moved the bug somewhere else, where it waited until a different surface area triggered it again. The fourth time it surfaced, the data corruption it caused had been silently compounding for forty days.

The bug was a single off-by-one in a pagination cursor. The agent had been right that the symptom would go away. It had been wrong about why. And the engineers — competent, senior, well-intentioned — had each accepted a passing patch before they understood the failure mechanism.

This is the agentic debugger's trap: your agent can produce a fix faster than you can build the mental model needed to evaluate whether the fix is correct. Patch velocity outruns diagnosis. The bug count drops, the CI dashboard goes green, and you ship a codebase whose failure modes you no longer understand.

The Closed-Loop Escalation Bug: When Your Specialist Agents Route in Circles

· 11 min read
Tian Pan
Software Engineer

A multi-agent system for market data research quietly burned through $47,000 in inference cost over four weeks before anyone noticed. The original weekly bill was $127. The cause wasn't a traffic spike or a model upgrade — it was two agents passing the same conversation back and forth for eleven days, each one confident the other was the right place for the request to live. Nothing errored. No alarm fired. The bot's "queue transferred" metric and the other bot's "task received" metric both went up in lockstep, and both dashboards looked healthy.

This is the closed-loop escalation bug. It is the multi-agent version of two helpful colleagues each insisting "no, you take it," except neither of them ever gets bored and walks away. The architecture diagram you drew at design time has each specialist owning a clean slice of the problem. The architecture the runtime actually executes has a routing cycle nobody in the room can see.

The IDE Plugin Is the Product Now: When Your Coding Agent Outgrows the Editor's Plugin API

· 11 min read
Tian Pan
Software Engineer

The default mental model for an AI coding tool is a panel inside VS Code. A chat box, a few inline suggestions, maybe an "apply diff" button. That framing is two years out of date. The leading products in the category are not VS Code extensions; they are full editors that happen to look like VS Code on launch. Cursor is a fork. Windsurf is a fork. Zed is a from-scratch native editor. The pattern is not coincidence — it is what happens when an agent's surface area finally exceeds what the host editor's plugin API was designed to support.

If you are building a coding agent and still treating "ship a plugin" as the obvious distribution choice, you are about to hit the same wall the leaders walked into around 2024 and chose to climb. The wall has a name: the plugin API was built to add features to an editor controlled by humans, not to host an autonomous agent that wants to control the editor.

The Parallelism Trap in Agentic Pipelines: When Fan-Out Makes Latency Worse

· 8 min read
Tian Pan
Software Engineer

Your agent pipeline is slow, so you split the work across five parallel sub-agents. The p50 drops. You ship it. Three days later, an on-call page fires: a batch of user requests is timing out. You dig in and find that p99 has climbed from 4 seconds to 22 seconds. Nothing in the individual agents changed. The timeout was caused by the orchestration layer waiting for the slowest of the five, which ran into a retrieval hiccup that only happens 1% of the time — but now it happens to any request that touches all five paths.

This is the parallelism trap: a pattern that looks like an obvious speedup but restructures your latency distribution in ways that hurt real users more than the p50 improvement helps them. Across production benchmarks, single agents match or outperform multi-agent pipelines on 64% of evaluated tasks. When parallel fan-out wins, it wins cleanly — but only for a specific class of problems. The mistake is treating fan-out as the default.

Persona Drift in Long-Running Agent Sessions: Why Your Agent Forgets Who It Is

· 10 min read
Tian Pan
Software Engineer

Most production agent failures look like model errors. The agent starts a session responding correctly to the system prompt — maintaining the right tone, respecting tool constraints, following the defined workflow. Then somewhere around turn 30 or 40, things subtly shift. The agent starts hedging where it should be direct. It calls tools it was told to avoid. It contradicts a decision it made 15 turns earlier. The system prompt hasn't changed, but the agent's behavior has.

This is persona drift: the progressive divergence between an agent's actual behavior and its original system instructions, caused by how transformers attend to increasingly buried context. Research quantifies it precisely — after 8–12 dialogue turns, persona self-consistency metrics degrade by more than 30%. Single-turn agents achieve roughly 90% task accuracy; multi-turn agents running the same tasks fall to around 65%. That 25-point drop isn't a model quality problem you can prompt your way around. It's an architectural property of how attention works over long sequences, and most teams discover it only after they've shipped a feature that degrades silently for hours before a user finally notices.

The Reply-All That Wasn't: Agent Outbound Fan-Out Hazards

· 9 min read
Tian Pan
Software Engineer

The user asked the agent to "let Karen know we're done." The agent called send_email with the recipient field set to karen-team@, the most plausible address its contact-lookup tool returned. The message — three paragraphs of internal-only project status, including a candid line about a customer's renewal risk — landed in forty inboxes. One of those inboxes belonged to the customer in question. The postmortem ran for two weeks.

There was no prompt injection. There was no model jailbreak. The tool worked exactly as specified. The contract the team wrote for send_email was "send a message to a recipient." The contract the world enforces is "broadcast to a group whose composition the sender did not audit." That gap — between what the tool is named and what the tool can actually do — is where most outbound agent incidents live.

Email is the obvious example, but the same hazard hides in every messaging tool an agent ever touches. The thirty years of muscle memory humans built for these channels did not transfer to the planner pattern-matching its way through a contact list.

Retiring an Agent Tool the Planner Learned to Depend On

· 10 min read
Tian Pan
Software Engineer

You unregister lookup_account_v1 from the tool catalog, swap in lookup_account_v2, and edit one paragraph of the system prompt to point at the new name. Tests pass. Three days later, support tickets start mentioning that the assistant "keeps trying to call something that doesn't exist," or — more disturbingly — that it answers customer questions with confident, plausible numbers and never hits the database at all. The deprecation didn't fail at the wire. It failed in the planner.

This is the gap between treating a tool deprecation as a syntactic change and treating it as a behavioral migration. The agent didn't just have your function in a registry; it had months of plans, multi-step recipes, and few-shot examples that routed through that function as a checkpoint. Pulling it out is closer to retiring an internal API your downstream services have informally hardcoded — except the downstream service is a model whose habits you cannot grep, and whose fallback when its preferred tool disappears is to invent one.

Skills as Modules: When Your Agent Stack Needs an Import System

· 10 min read
Tian Pan
Software Engineer

A team I talked to last month hit a bug that any seasoned package-manager user would recognize on sight. Two skills in their agent shipped the same search_orders capability — one came from a billing toolpack, one came from a CRM toolpack. Whichever had been added to the manifest most recently won. The agent silently called the wrong one for three weeks. Refunds went to the wrong customer IDs. Their fix, they told me, was a meeting with the CRM and billing engineers to "agree on naming." A meeting. To resolve a name conflict between two installable modules.

That's the moment I realized what's happening in agent runtimes right now. The runtime-loadable capability pattern — skills, tool packs, prompt fragments, retrieval providers, MCP servers — is converging on the same problem languages solved with import systems decades ago. Name resolution. Version pinning. Dependency graphs. Conflict detection. Lazy loading. And most agent runtimes are reinventing each one badly, or not at all, and shipping the bill to their users in the form of meetings.

Tool Schema Design Is Your Blast Radius: When Function Definitions Become Security Boundaries

· 10 min read
Tian Pan
Software Engineer

The most dangerous file in your agent codebase is the one you've been writing as if it were API documentation. The tool registry — that JSON or Pydantic schema that tells the model what functions exist and what arguments they take — is no longer a docstring. It is your authorization layer. And if you designed it the way most teams do, you handed the LLM a master key and called it good engineering.

Consider the canonical first cut at a tool: query_database(sql: string). The intent is reasonable — let the model formulate the right SQL for the user's question. The reality is that the model is now an untrusted client with unlimited DDL and DML rights to whatever database the connection string points at. The system prompt that says "only run SELECTs on the orders table" is a suggestion, not a control. When a prompt-injected tool result — an email body, a webpage, a PDF — tells the model to run DROP TABLE users, your authorization model is the model's instruction-following discipline. That is not authorization. That is hope.

Agent IAM Is Not Service IAM: Why OAuth Breaks When Intent Is Constructed at Runtime

· 12 min read
Tian Pan
Software Engineer

The bearer token model has one assumption that agents quietly violate: the caller knows what they want when they ask. OAuth scopes, IAM roles, and API keys are all designed around a principal whose intent is fixed before authentication begins. Your CI runner has stable intent. Your microservice has stable intent. An agent does not. An agent's intent is assembled at request time out of a user prompt, a system prompt, retrieved documents, and the outputs of tools that may themselves have been written by an attacker. By the time the agent reaches for a token, the policy decision that the IAM layer has to make has already been made — by inputs the IAM layer never saw.

This is why the same auth pattern that has worked for fifteen years of service-to-service traffic is now producing a class of incidents nobody has good language for. A prompt injection lifts a long-lived bearer token. An agent "remembers" a permission across sessions because the token outlived the user's intent. A multi-step task that legitimately needs three scopes holds all of them for the entire session instead of acquiring and releasing them per step. None of these are OAuth bugs in the strict sense. They are consequences of stretching a model that assumes static intent to cover a caller whose intent is reconstructed every turn.