Skip to main content

58 posts tagged with "distributed-systems"

View all tags

The Agent Wall-Clock Budget That Raced Your Tool's Own Timeout

· 11 min read
Tian Pan
Software Engineer

There is a class of agent bug that does not appear in any single component when you look at it in isolation. The model is fine. The tool is fine. The retry policy is fine. The timeout values are even, on paper, generous. And yet a tool that consistently completes in eight seconds keeps landing against an agent that has already declared it a failure at seven point nine, replanned around an "error" that never happened, and started a second call that the first call's result is about to collide with.

The bug is not in any of the boxes. It is in the gap between two clocks that nobody agreed should be the same clock.

The Async Tool Call That Resolved After the User Already Closed the Conversation

· 12 min read
Tian Pan
Software Engineer

The clearest sign that an agent's session model is broken is when a tool result has nowhere to go. The agent fired a long-running call — a render, a provisioning job, a multi-step query. The user watched the spinner for a few seconds, decided they didn't need it after all, closed the tab, and moved on. Forty seconds later the tool finishes. Its callback hits your gateway with a conversation_id that no longer points at anything. The gateway has two equally bad options: silently drop the result, or stitch it into whatever session inherits that ID next.

Most teams discover this failure mode the same way: a support ticket where a user sees an answer they did not ask for, attached to a conversation they did not start. Or a downstream system that processed the same charge twice because the gateway helpfully "retried" delivery against the next active session. Or — most commonly — nothing visible at all, just a slow drift in completion metrics that nobody can correlate to anything specific, because the failures don't fire alerts; they fire emptiness.

The Async Tool Call Your Agent Fired and Forgot

· 10 min read
Tian Pan
Software Engineer

The clearest sign that an agent's tool-call abstraction is broken is when the trace shows the step marked done and the downstream system shows nothing happened. The model called a tool, received a job ID back, treated the job ID as the answer, and moved on. Three minutes later the actual work either succeeded with nobody listening or failed with the error landing in a log nobody reads. The user sees a confident summary; the operations queue sees a stranded task.

This is the failure mode the function-calling abstraction quietly enables. JSON schemas describe parameters and return types, but they do not distinguish between "this tool returns a result" and "this tool returns a receipt for an operation whose result you will need to ask about later." The model treats both the same way, because to the planner they look the same — a successful tool call with a non-error payload.

The Cache Stampede That Hit Your Model Provider Instead of Your Database

· 10 min read
Tian Pan
Software Engineer

The pager went off at 14:02 UTC. Not for latency, not for errors — for spend. The cost dashboard showed a vertical line: three minutes of input-token billing at roughly nine times the trailing hourly average, then back to normal. No regression had shipped. No tenant had onboarded. Traffic was flat to the minute. The only thing that changed is that a single prompt prefix — the 14K-token system message that every agent in the fleet shared — had quietly expired on the provider side, and a thousand workers had all decided, within the same 200ms window, that they were the ones who needed to write it back.

This is a cache stampede. It is the same bug operators have been writing post-mortems about since memcached shipped in 2003. What is new in 2026 is that the cache it stampedes is no longer yours. It lives inside your model provider, you cannot inspect its state, and every miss costs real money instead of a few extra database queries. The synchronization bug that database engineers learned to jitter away two decades ago has quietly reappeared on a bill line item nobody thought to defend.

The Multi-Agent Deadlock That Hangs on Two Calendars

· 10 min read
Tian Pan
Software Engineer

Agent A asks Agent B for a piece of data it needs to finish its task. Agent B, before answering, asks Agent A for a piece of context it needs to produce that data. Both requests cross a "human review required" boundary on the way out. The first request lands in a Slack approval channel watched by Priya. The second lands in a Jira queue watched by Marcus. Priya is at lunch. Marcus is in a customer call. Neither knows the other exists. The workflow hangs for nineteen hours, and nobody notices until a customer escalation forces somebody to ask why the rollup never landed.

This is not a novel failure. It is the oldest failure in distributed systems, wearing a new costume. The Coffman conditions — mutual exclusion, hold and wait, no preemption, circular wait — were named in 1971, and a multi-agent system with human-in-the-loop approval queues satisfies all four by default. The new wrinkle is that one of the "resources" in the deadlock is a person's attention, which means your liveness guarantee is now bound by how quickly two humans who don't know they're paired can independently context-switch.

Your Scheduled Agent Has Four Clocks, and You Are Trusting the Wrong One

· 12 min read
Tian Pan
Software Engineer

A daily standup summary is scheduled for 09:00 UTC. The cron fires on time. A worker pod spins up two seconds later. The LLM call takes another forty seconds round-trip. The model writes its summary believing it is February of last year, because that is the last thing its training data confidently knew. The tool layer dispatches the Slack message against the wall clock at 09:00:42 UTC, on a date the model never mentions because nobody asked it to. The message lands in the right channel, with yesterday's standup notes summarized as "today's," and nobody notices for three weeks.

This is not a bug in any one component. It is a contract that nobody wrote between four different clocks that all believe they know what "now" is.

Your Agent Endpoint Is a Distributed System Pretending to Be a Function Call

· 9 min read
Tian Pan
Software Engineer

The most dangerous line of code in a modern AI application looks completely innocent:

result = await agent.run(user_query)

It reads like a function call. It has a name, it takes an argument, it returns a value. Your IDE autocompletes it. Your type checker is satisfied. And that single await is hiding a remote, multi-hop, partially-failing distributed system behind the syntax of a local procedure. The gap between what the code looks like and what it actually does is where most production agent incidents live.

Your Retry Logic Is Teaching the Agent the Wrong Lesson

· 10 min read
Tian Pan
Software Engineer

A tool call fails. Your agent framework retries it three times with exponential backoff. The third attempt goes through. The trace shows a green checkmark. Nobody gets paged, no error counter increments, the user gets their answer. By every dashboard you have, the system worked.

It didn't. The tool failed because the agent passed a malformed argument, and the only reason the third try succeeded is that the agent — sampling differently each time — happened to phrase the call correctly on attempt three. You didn't recover from a transient fault. You ran a slot machine until it paid out, then logged the payout and threw away the two pulls that told you the agent was broken.

This is the quiet way retry logic rots an agent system. Retries were designed for a world where the caller is correct and the network is flaky. Agents invert that assumption: the network is mostly fine, and the caller is the unreliable part. When you point a retry policy built for the first world at the second one, it stops being a recovery mechanism and becomes a way to launder bugs into green checkmarks.

The Idempotency Key Your Agent Never Sent

· 11 min read
Tian Pan
Software Engineer

A customer once got refunded three times for a single return. Not because the model hallucinated a policy, not because a human fat-fingered a form — because the refund tool timed out twice, the agent retried both times, and every retry carried a fresh request with no way for the payment processor to know it had seen this work before. Three clean HTTP 200s. Three real movements of money. The agent did exactly what it was told: when a call fails, try again.

The bug was not in the model. The bug was in a header that was never sent.

Retrying is the single most natural thing an agent does. A tool call returns an error, or worse, returns nothing at all, and the loop's instinct — encoded in the framework, the prompt, or the model's own training — is to try the action again. That instinct is correct for reads and catastrophic for writes. The difference between a resilient agent and one that double-charges customers is not intelligence. It is whether every state-changing tool call carries an idempotency key, and whether the system on the other end actually honors it.

When Two Agents Share a Tool: Concurrency Bugs in Multi-Agent Systems

· 9 min read
Tian Pan
Software Engineer

The moment you typed "spin up another agent to handle that in parallel," you became a distributed systems engineer. You probably didn't notice. The framework made it a one-line change, the demo worked, and the latency dropped. But under the hood you just introduced two processes that read and write shared state with no coordination — and every race condition, lost update, and dirty read that has haunted databases for fifty years is now sitting in your agent stack, waiting.

The reason this bites so hard is that the failure doesn't look like a concurrency bug. It looks like one agent being wrong. The output is syntactically valid, the pipeline is green, no exception is thrown — and yet a customer got charged twice, or a file is missing half its expected content, or an agent confidently acted on a number that another agent had already overwritten. You go debug "the dumb agent" and find nothing wrong with its prompt, because the prompt was never the problem.

Tool Latency Tail: Why p99 Reshapes Agent Architecture and p50 Hides the Problem

· 10 min read
Tian Pan
Software Engineer

A team I worked with last quarter launched a seven-step agent and built its latency budget the obvious way: search returns in 200ms, the SQL lookup takes 80ms, the email send is 150ms, and so on down the chain. Add the medians, sprinkle in some buffer, and the math says the agent fits comfortably inside its two-second SLA. The dashboards confirmed it for weeks. Median latency was beautiful. Then customers started complaining the feature was unusably slow, and the dashboards still looked green.

The story they were telling each other was wrong because they had built the architecture around sum(p50) while users were experiencing sum(p99). After three or four hops, the probability that any link in the chain has fallen into its own tail is no longer negligible. After seven hops, it approaches a coin flip. None of the per-tool dashboards ever turned red because none of the per-tool services were misbehaving — the problem was that nobody owned the multiplicative composition.

This is not a new lesson. Distributed-systems researchers have been writing about it for forty years. What's new is that every team building agents is rediscovering it, badly, on a deadline.

Reading the Agent Stack Trace: Triangulating Failures Across Model, Tool, and Harness

· 10 min read
Tian Pan
Software Engineer

A user reports that the agent gave a wrong answer. You open the trace. The model's reasoning looks fine. The tool calls all returned 200 OK. The harness logs show no retries, no truncation, no anomalies. And yet the answer is wrong. So you spend the next two hours stitching together three separate log streams in three different formats with three different clocks, and you eventually discover that a tool quietly returned {"result": null} for one specific query shape, the model rationalized the null into a plausible-sounding fact, and the harness happily forwarded the hallucination to the user. None of the three layers logged anything alarming on its own. The failure lived in the joints.

This is the dominant failure pattern in production agent systems, and most teams are debugging it with single-layer tools. The model team blames the tool. The tool team blames the model. The platform team blames the harness. Everyone is partially right, because an agent failure is almost never a single-component bug — it is a misalignment between three components that each operate on different mental models of "a step." Until your tracing infrastructure reflects that reality, you will keep paying for the same incident in different costumes.