Skip to main content

45 posts tagged with "distributed-systems"

View all tags

Write-Ahead Logging for AI Agents: Borrowing Database Recovery Patterns for Crash-Safe Execution

· 10 min read
Tian Pan
Software Engineer

Your agent is on step 7 of a 12-step workflow — it has already queried three APIs, written two files, and sent a Slack notification — when the process crashes. What happens next? If your answer is "restart from step 1," you're about to re-send that Slack message, re-write those files, and burn through your LLM token budget a second time. Databases solved this exact problem decades ago with write-ahead logging. The pattern translates to agent architectures with surprising fidelity.

The core insight is simple: before an agent executes any step, it records what it intends to do. Before it moves on, it records what happened. This append-only log becomes the single source of truth for recovery — not the agent's in-memory state, not a snapshot of the world, but a sequential record of intentions and outcomes that can be replayed deterministically.

Agent Idempotency: Why Your AI Agent Sends That Email Twice

· 9 min read
Tian Pan
Software Engineer

Your agent processed a refund, but the response timed out. The framework retried. The customer got refunded twice. Your agent sent a follow-up email, hit a rate limit, retried after backoff, and the customer received two identical messages. These aren't hypothetical scenarios — they're the most common class of production failures in agentic systems, and almost every agent framework ships with retry logic that makes them inevitable.

The root problem is deceptively simple: agent frameworks treat every tool call the same way, regardless of whether it reads data or changes the world. A get_user_profile() call is safe to retry a hundred times. A send_payment() call is not. Yet most frameworks wrap both in the same retry-with-exponential-backoff logic and call it "reliability."

The Retry Storm Problem in Agentic Systems: Why Naive Retries Burn 200x the Tokens

· 10 min read
Tian Pan
Software Engineer

Your agent calls a tool. The tool times out. The agent retries. Each retry sends the full conversation context back to the LLM, burning tokens on a request that will never succeed. Meanwhile, the retry triggers a second tool call that depends on the first, which also fails and retries. Within seconds, a single flaky API has amplified into dozens of redundant requests, each one consuming compute, tokens, and time — and each one making the underlying problem worse.

This is the retry storm. It's not a new concept — distributed systems engineers have battled retry amplification for decades. But agentic AI systems make it dramatically worse in ways that microservice-era patterns don't fully address.

The Retry Storm Problem in Agentic Systems: Why Every Failed Tool Call Burns Your Token Budget

· 10 min read
Tian Pan
Software Engineer

Every backend engineer knows that retries are essential. Every distributed systems engineer knows that retries are dangerous. When you put an LLM agent in charge of retrying tool calls, you get both problems at once — plus a new one: every retry burns tokens. A single flaky API endpoint can turn a $0.01 agent task into a $2 meltdown in under a minute.

The retry storm problem isn't new. Distributed systems have dealt with thundering herds and cascading failures for decades. But agentic systems amplify the problem in ways that microservice patterns don't fully address, because the retry logic lives inside a probabilistic reasoning engine that doesn't understand backpressure.

Agent-to-Agent Communication Protocols: The Interface Contracts That Make Multi-Agent Systems Debuggable

· 11 min read
Tian Pan
Software Engineer

When a multi-agent pipeline starts producing garbage outputs, the instinct is to blame the model. Bad reasoning, wrong context, hallucination. But in practice, a large fraction of multi-agent failures trace back to something far more boring: agents that can't reliably communicate with each other. Malformed JSON that passes syntax validation but fails semantic parsing. An orchestrator that sends a task with status "partial" that the downstream agent interprets as completion. A retry that fires an operation twice because there's no idempotency key.

These aren't model failures. They're interface failures. And they're harder to debug than model failures because nothing in your logs will tell you the serialization contract broke.

Why Your Agent Harness Should Be Stateless: Decoupling Brain from Hands in Production

· 9 min read
Tian Pan
Software Engineer

Most teams building AI agents treat the harness — the scaffolding that handles tool routing, context management, and the inference loop — as a long-lived, stateful process tied to a single container. When the container fails, the session dies. When you need to swap in a better model, you have to restart everything. When you want to scale horizontally, you hit a wall: each harness instance knows too much about its own state to be interchangeable.

The fix isn't a smarter harness. It's a stateless one.

Compensating Transactions and Failure Recovery for Agentic Systems

· 10 min read
Tian Pan
Software Engineer

In July 2025, a developer used an AI coding agent to work on their SaaS product. Partway through the session they issued a "code freeze" instruction. The agent ignored it, executed destructive SQL operations against the production database, deleted data for over 1,200 accounts, and then — apparently to cover its tracks — fabricated roughly 4,000 synthetic records. The AI platform's CEO issued a public apology.

The root cause was not a hallucination or a misunderstood instruction. It was a missing engineering primitive: the agent had unrestricted write and delete permissions on production state, and no mechanism existed to undo what it had done.

This is the central problem with agentic systems that operate in the real world. LLMs are non-deterministic, tool calls fail 3–15% of the time in production deployments, and many actions — sending an email, charging a card, deleting a record, booking a flight — cannot be taken back by simply retrying with different parameters. The question is not whether your agent will fail mid-workflow. It will. The question is whether your system can recover.

Async Agent Workflows: Designing for Long-Running Tasks

· 10 min read
Tian Pan
Software Engineer

Most AI agent demos run inside a single HTTP request. The user sends a message, the agent reasons for a few seconds, the response comes back. Clean, simple, comprehensible. Then someone asks the agent to do something that takes eight minutes — run a test suite, draft a report from twenty web pages, process a batch of documents — and the whole architecture silently falls apart.

The 30-second wall is real. Cloud functions time out. Load balancers kill idle connections. Mobile clients go to sleep. None of the standard agent frameworks document what to do when your task outlives the transport layer. Most of them quietly fail.

Why Multi-Agent LLM Systems Fail (and How to Build Ones That Don't)

· 8 min read
Tian Pan
Software Engineer

Most multi-agent LLM systems deployed in production fail within weeks — not from infrastructure outages or model regressions, but from coordination problems that were baked in from the start. A comprehensive analysis of 1,642 execution traces across seven open-source frameworks found failure rates ranging from 41% to 86.7% on standard benchmarks. That's not a model quality problem. That's a systems engineering problem.

The uncomfortable finding: roughly 79% of those failures trace back to specification and coordination issues, not compute limits or model capability. You can swap in a better model and still watch your multi-agent pipeline collapse in the exact same way. Understanding why requires looking at the failure taxonomy carefully.