Skip to main content

14 posts tagged with "distributed-systems"

View all tags

Async Agent Workflows: Designing for Long-Running Tasks

· 10 min read
Tian Pan
Software Engineer

Most AI agent demos run inside a single HTTP request. The user sends a message, the agent reasons for a few seconds, the response comes back. Clean, simple, comprehensible. Then someone asks the agent to do something that takes eight minutes — run a test suite, draft a report from twenty web pages, process a batch of documents — and the whole architecture silently falls apart.

The 30-second wall is real. Cloud functions time out. Load balancers kill idle connections. Mobile clients go to sleep. None of the standard agent frameworks document what to do when your task outlives the transport layer. Most of them quietly fail.

Why Multi-Agent LLM Systems Fail (and How to Build Ones That Don't)

· 8 min read
Tian Pan
Software Engineer

Most multi-agent LLM systems deployed in production fail within weeks — not from infrastructure outages or model regressions, but from coordination problems that were baked in from the start. A comprehensive analysis of 1,642 execution traces across seven open-source frameworks found failure rates ranging from 41% to 86.7% on standard benchmarks. That's not a model quality problem. That's a systems engineering problem.

The uncomfortable finding: roughly 79% of those failures trace back to specification and coordination issues, not compute limits or model capability. You can swap in a better model and still watch your multi-agent pipeline collapse in the exact same way. Understanding why requires looking at the failure taxonomy carefully.