Skip to main content

55 posts tagged with "distributed-systems"

View all tags

The Async Tool Call Your Agent Fired and Forgot

· 10 min read
Tian Pan
Software Engineer

The clearest sign that an agent's tool-call abstraction is broken is when the trace shows the step marked done and the downstream system shows nothing happened. The model called a tool, received a job ID back, treated the job ID as the answer, and moved on. Three minutes later the actual work either succeeded with nobody listening or failed with the error landing in a log nobody reads. The user sees a confident summary; the operations queue sees a stranded task.

This is the failure mode the function-calling abstraction quietly enables. JSON schemas describe parameters and return types, but they do not distinguish between "this tool returns a result" and "this tool returns a receipt for an operation whose result you will need to ask about later." The model treats both the same way, because to the planner they look the same — a successful tool call with a non-error payload.

The Multi-Agent Deadlock That Hangs on Two Calendars

· 10 min read
Tian Pan
Software Engineer

Agent A asks Agent B for a piece of data it needs to finish its task. Agent B, before answering, asks Agent A for a piece of context it needs to produce that data. Both requests cross a "human review required" boundary on the way out. The first request lands in a Slack approval channel watched by Priya. The second lands in a Jira queue watched by Marcus. Priya is at lunch. Marcus is in a customer call. Neither knows the other exists. The workflow hangs for nineteen hours, and nobody notices until a customer escalation forces somebody to ask why the rollup never landed.

This is not a novel failure. It is the oldest failure in distributed systems, wearing a new costume. The Coffman conditions — mutual exclusion, hold and wait, no preemption, circular wait — were named in 1971, and a multi-agent system with human-in-the-loop approval queues satisfies all four by default. The new wrinkle is that one of the "resources" in the deadlock is a person's attention, which means your liveness guarantee is now bound by how quickly two humans who don't know they're paired can independently context-switch.

Your Scheduled Agent Has Four Clocks, and You Are Trusting the Wrong One

· 12 min read
Tian Pan
Software Engineer

A daily standup summary is scheduled for 09:00 UTC. The cron fires on time. A worker pod spins up two seconds later. The LLM call takes another forty seconds round-trip. The model writes its summary believing it is February of last year, because that is the last thing its training data confidently knew. The tool layer dispatches the Slack message against the wall clock at 09:00:42 UTC, on a date the model never mentions because nobody asked it to. The message lands in the right channel, with yesterday's standup notes summarized as "today's," and nobody notices for three weeks.

This is not a bug in any one component. It is a contract that nobody wrote between four different clocks that all believe they know what "now" is.

Your Agent Endpoint Is a Distributed System Pretending to Be a Function Call

· 9 min read
Tian Pan
Software Engineer

The most dangerous line of code in a modern AI application looks completely innocent:

result = await agent.run(user_query)

It reads like a function call. It has a name, it takes an argument, it returns a value. Your IDE autocompletes it. Your type checker is satisfied. And that single await is hiding a remote, multi-hop, partially-failing distributed system behind the syntax of a local procedure. The gap between what the code looks like and what it actually does is where most production agent incidents live.

Your Retry Logic Is Teaching the Agent the Wrong Lesson

· 10 min read
Tian Pan
Software Engineer

A tool call fails. Your agent framework retries it three times with exponential backoff. The third attempt goes through. The trace shows a green checkmark. Nobody gets paged, no error counter increments, the user gets their answer. By every dashboard you have, the system worked.

It didn't. The tool failed because the agent passed a malformed argument, and the only reason the third try succeeded is that the agent — sampling differently each time — happened to phrase the call correctly on attempt three. You didn't recover from a transient fault. You ran a slot machine until it paid out, then logged the payout and threw away the two pulls that told you the agent was broken.

This is the quiet way retry logic rots an agent system. Retries were designed for a world where the caller is correct and the network is flaky. Agents invert that assumption: the network is mostly fine, and the caller is the unreliable part. When you point a retry policy built for the first world at the second one, it stops being a recovery mechanism and becomes a way to launder bugs into green checkmarks.

The Idempotency Key Your Agent Never Sent

· 11 min read
Tian Pan
Software Engineer

A customer once got refunded three times for a single return. Not because the model hallucinated a policy, not because a human fat-fingered a form — because the refund tool timed out twice, the agent retried both times, and every retry carried a fresh request with no way for the payment processor to know it had seen this work before. Three clean HTTP 200s. Three real movements of money. The agent did exactly what it was told: when a call fails, try again.

The bug was not in the model. The bug was in a header that was never sent.

Retrying is the single most natural thing an agent does. A tool call returns an error, or worse, returns nothing at all, and the loop's instinct — encoded in the framework, the prompt, or the model's own training — is to try the action again. That instinct is correct for reads and catastrophic for writes. The difference between a resilient agent and one that double-charges customers is not intelligence. It is whether every state-changing tool call carries an idempotency key, and whether the system on the other end actually honors it.

When Two Agents Share a Tool: Concurrency Bugs in Multi-Agent Systems

· 9 min read
Tian Pan
Software Engineer

The moment you typed "spin up another agent to handle that in parallel," you became a distributed systems engineer. You probably didn't notice. The framework made it a one-line change, the demo worked, and the latency dropped. But under the hood you just introduced two processes that read and write shared state with no coordination — and every race condition, lost update, and dirty read that has haunted databases for fifty years is now sitting in your agent stack, waiting.

The reason this bites so hard is that the failure doesn't look like a concurrency bug. It looks like one agent being wrong. The output is syntactically valid, the pipeline is green, no exception is thrown — and yet a customer got charged twice, or a file is missing half its expected content, or an agent confidently acted on a number that another agent had already overwritten. You go debug "the dumb agent" and find nothing wrong with its prompt, because the prompt was never the problem.

Tool Latency Tail: Why p99 Reshapes Agent Architecture and p50 Hides the Problem

· 10 min read
Tian Pan
Software Engineer

A team I worked with last quarter launched a seven-step agent and built its latency budget the obvious way: search returns in 200ms, the SQL lookup takes 80ms, the email send is 150ms, and so on down the chain. Add the medians, sprinkle in some buffer, and the math says the agent fits comfortably inside its two-second SLA. The dashboards confirmed it for weeks. Median latency was beautiful. Then customers started complaining the feature was unusably slow, and the dashboards still looked green.

The story they were telling each other was wrong because they had built the architecture around sum(p50) while users were experiencing sum(p99). After three or four hops, the probability that any link in the chain has fallen into its own tail is no longer negligible. After seven hops, it approaches a coin flip. None of the per-tool dashboards ever turned red because none of the per-tool services were misbehaving — the problem was that nobody owned the multiplicative composition.

This is not a new lesson. Distributed-systems researchers have been writing about it for forty years. What's new is that every team building agents is rediscovering it, badly, on a deadline.

Reading the Agent Stack Trace: Triangulating Failures Across Model, Tool, and Harness

· 10 min read
Tian Pan
Software Engineer

A user reports that the agent gave a wrong answer. You open the trace. The model's reasoning looks fine. The tool calls all returned 200 OK. The harness logs show no retries, no truncation, no anomalies. And yet the answer is wrong. So you spend the next two hours stitching together three separate log streams in three different formats with three different clocks, and you eventually discover that a tool quietly returned {"result": null} for one specific query shape, the model rationalized the null into a plausible-sounding fact, and the harness happily forwarded the hallucination to the user. None of the three layers logged anything alarming on its own. The failure lived in the joints.

This is the dominant failure pattern in production agent systems, and most teams are debugging it with single-layer tools. The model team blames the tool. The tool team blames the model. The platform team blames the harness. Everyone is partially right, because an agent failure is almost never a single-component bug — it is a misalignment between three components that each operate on different mental models of "a step." Until your tracing infrastructure reflects that reality, you will keep paying for the same incident in different costumes.

The Write Side of the Agent: Designing for Reversibility at the Action Layer

· 11 min read
Tian Pan
Software Engineer

A Cursor agent running an AI coding assistant encountered a credential mismatch while working on a production database. It resolved the problem by deleting everything it couldn't access — the production database, its backups, and the ancillary records. The operation took nine seconds. Customers lost reservations. The company spent days reconstructing records from payment processor emails.

The agent had not been told to preserve data. It had also not been told not to delete it. There was no write journal, no staging step, no confirmation gate on destructive operations, and no separation between the agent's API token scope and full database access. The agent found the most direct path to satisfying its immediate objective and took it.

Agentic Systems Are Distributed Systems: Apply Microservices Lessons Before You Learn Them the Hard Way

· 12 min read
Tian Pan
Software Engineer

The failure rates for multi-agent AI systems in production are embarrassing. A landmark study analyzing over 1,600 execution traces across seven popular frameworks found failure rates ranging from 41% to 87%. Carnegie Mellon researchers put leading agent systems at 30–35% task completion on multi-step benchmarks. Gartner is predicting 40% of agentic AI projects will be cancelled by the end of 2027.

Here is the uncomfortable truth: these aren't AI problems. They're distributed systems problems that engineers already solved between 2010 and 2018, documented exhaustively in blog posts, conference talks, and eventually in Martin Kleppmann's Designing Data-Intensive Applications. The teams that are shipping reliable agent systems today aren't doing anything magical — they're applying circuit breakers, bulkheads, event sourcing, and idempotency keys. The teams that are failing are treating agents as a new paradigm when they're a new deployment target for old patterns.

Dead Letters for Agents: What to Do When No Agent Can Complete a Task

· 10 min read
Tian Pan
Software Engineer

A team building a multi-agent research tool discovered, on day eleven of a runaway job, that two of their agents had been cross-referencing each other's outputs in a loop the entire time. The bill: $47,000. No human had seen the results. No alarm had fired. The system simply kept running, confident it was making progress, because nothing in the architecture asked the question: what happens when a task genuinely cannot be completed?

Message queues solved this problem decades ago with the dead-letter queue (DLQ). A message that exceeds its delivery retry limit gets routed to a holding area where operators can inspect it, fix the root cause, and replay it when the system is ready. The pattern is simple, battle-tested, and almost entirely missing from production agent systems today.