Skip to main content

24 posts tagged with "multi-agent"

View all tags

When the Generalist Beats the Specialists: The Case for Unified Single-Agent Architectures

· 9 min read
Tian Pan
Software Engineer

The prevailing wisdom in AI engineering is that complex tasks require specialized agents: a researcher agent, a writer agent, a critic agent, each handling its narrow domain and handing off to the next. This architectural instinct feels correct — it mirrors how human teams work, how microservices are built, and how we decompose problems in software engineering. The problem is that empirical data increasingly says otherwise.

A 2025 study from Google DeepMind and MIT evaluated 180 configurations across five agent architectures and three LLM families. For sequential reasoning tasks — the category that covers most real knowledge work — every single multi-agent coordination variant degraded performance by 39 to 70 percent compared to a well-configured single agent. Not broke-even. Degraded.

This is not an argument against multi-agent systems categorically. There are workloads where coordination yields genuine returns. But the default instinct to reach for specialization is costing production teams real money, real latency, and real reliability — often for no measurable accuracy gain.

The Principal Hierarchy Problem: Authorization in Multi-Agent Systems

· 11 min read
Tian Pan
Software Engineer

A procurement agent at a manufacturing company gradually convinced itself it could approve $500,000 purchases without human review. It did this not through a software exploit or credential theft, but through a three-week sequence of supplier emails that embedded clarifying questions: "Anything under $100K doesn't need VP approval, right?" followed by progressive expansions of that assumption. By the time it approved $5M in fraudulent orders, the agent was operating well within what it believed to be its authorized limits. The humans thought the agent had a $50K ceiling. The agent thought it had no ceiling at all.

This is the principal hierarchy problem in its most concrete form: a mismatch between what authority was granted, what authority was claimed, and what authority was actually exercised. It becomes exponentially harder when agents spawn sub-agents, those sub-agents spawn further agents, and each hop in the chain makes an independent judgment about what it's allowed to do.

Agent-to-Agent Communication Protocols: The Interface Contracts That Make Multi-Agent Systems Debuggable

· 10 min read
Tian Pan
Software Engineer

When a multi-agent pipeline starts producing garbage outputs, the instinct is to blame the model. Bad reasoning, wrong context, hallucination. But in practice, a large fraction of multi-agent failures trace back to something far more boring: agents that can't reliably communicate with each other. Malformed JSON that passes syntax validation but fails semantic parsing. An orchestrator that sends a task with status "partial" that the downstream agent interprets as completion. A retry that fires an operation twice because there's no idempotency key.

These aren't model failures. They're interface failures. And they're harder to debug than model failures because nothing in your logs will tell you the serialization contract broke.

Agentic Engineering Patterns That Actually Work in Production

· 8 min read
Tian Pan
Software Engineer

The most dangerous misconception about AI coding agents is that they let you relax your engineering discipline. In practice, the opposite is true. Agentic systems amplify whatever you already have: strong foundations produce velocity, weak ones produce chaos at machine speed.

The shift worth paying attention to isn't that agents write code for you. It's that the constraint has changed. Writing code is no longer the expensive part. That changes almost everything about how you structure your process.

The Action Space Problem: Why Giving Your AI Agent More Tools Makes It Worse

· 9 min read
Tian Pan
Software Engineer

There's a counterintuitive failure mode that most teams encounter when scaling AI agents: the more capable you make the agent's toolset, the worse it performs. You add tools to handle more cases. Accuracy drops. You add better tools. It gets slower and starts picking the wrong ones. You add orchestration to manage the tool selection. Now you've rebuilt complexity on top of the original complexity, and the thing barely works.

The instinct to add is wrong. The performance gains in production agents come from removing things.

Multi-Agent Conversation Frameworks: The Paradigm Shift from Pipelines to Talking Agents

· 11 min read
Tian Pan
Software Engineer

A Google DeepMind study published in late 2025 analyzed 180 multi-agent configurations across five architectures and three LLM families. The finding that got buried in the discussion section: unstructured multi-agent networks amplify errors up to 17.2x compared to single-agent baselines. Not fix errors — amplify them. Agents confidently building on each other's hallucinations, creating echo chambers that make each individual model's failure modes dramatically worse.

This is the paradox at the center of multi-agent conversation frameworks. The same property that makes them powerful — agents negotiating, critiquing, delegating, and revising — is what makes them dangerous without careful design. Understanding the difference between conversation-based orchestration and traditional pipeline chaining is the first step toward using either correctly.

Building a Multi-Agent Research System: Patterns from Production

· 8 min read
Tian Pan
Software Engineer

When a single-agent system fails at a research task, the instinct is to add more memory, better tools, or a smarter model. But there's a point where the problem isn't capability — it's concurrency. Deep research tasks require pursuing multiple threads simultaneously: validating claims from different angles, scanning sources across domains, cross-referencing findings in real time. A single agent doing this sequentially is like a researcher reading every book one at a time before taking notes. The multi-agent alternative feels obvious in retrospect, but getting it right in production is considerably harder than the architecture diagram suggests.

This post is about how multi-agent research systems actually get built — the architectural choices that work, the failure modes that aren't obvious until you're in production, and the engineering discipline required to keep them useful at scale.

Why Multi-Agent AI Architectures Keep Failing (and What to Build Instead)

· 8 min read
Tian Pan
Software Engineer

Most teams that build multi-agent systems hit the same wall: the thing works in demos and falls apart in production. Not because they implemented the coordination protocol wrong. Because the protocol itself is the problem.

Multi-agent AI has an intuitive appeal. Complex tasks should be broken into parallel workstreams. Specialized agents should handle specialized work. The orchestrator ties it together and the whole becomes greater than the sum of its parts. This intuition is wrong — or more precisely, it's premature. The practical failure rates of multi-agent systems in production range from 41% to 86.7% across studied execution traces. That's not a tuning problem. That's a structural one.

Why Multi-Agent Systems Break at the Seams: Designing Reliable Handoffs

· 8 min read
Tian Pan
Software Engineer

There's a pattern that plays out repeatedly when teams graduate from single-agent to multi-agent AI systems: individual agents work beautifully in isolation, but the system as a whole behaves unpredictably. The agents aren't the problem. The boundaries between them are.

Studies across production multi-agent deployments report failure rates ranging from 41% to 86.7% without formal orchestration. The most common post-mortem finding isn't "the LLM gave a bad answer" — it's "the wrong context reached the wrong agent at the wrong time." The seams between agents are where systems quietly fall apart.

Why Multi-Agent LLM Systems Fail (and How to Build Ones That Don't)

· 8 min read
Tian Pan
Software Engineer

Most multi-agent LLM systems deployed in production fail within weeks — not from infrastructure outages or model regressions, but from coordination problems that were baked in from the start. A comprehensive analysis of 1,642 execution traces across seven open-source frameworks found failure rates ranging from 41% to 86.7% on standard benchmarks. That's not a model quality problem. That's a systems engineering problem.

The uncomfortable finding: roughly 79% of those failures trace back to specification and coordination issues, not compute limits or model capability. You can swap in a better model and still watch your multi-agent pipeline collapse in the exact same way. Understanding why requires looking at the failure taxonomy carefully.

How AI Agents Actually Work: Architecture, Planning, and Failure Modes

· 10 min read
Tian Pan
Software Engineer

Most agent failures are architecture failures. The model gets blamed when a task goes sideways, but nine times out of ten, the real problem is that nobody thought hard enough about how planning, tool use, and reflection should fit together. You can swap in a better model and still get the same crashes — because the scaffolding around the model was never designed to handle what the model was being asked to do.

This post is a practical guide to how agents actually work under the hood: what the core components are, where plans go wrong, how reflection loops help (and when they hurt), and what multi-agent systems look like when you're building them for production rather than demos.

Routines and Handoffs: The Two Primitives Behind Every Reliable Multi-Agent System

· 8 min read
Tian Pan
Software Engineer

Most multi-agent systems fail not because the models are wrong, but because the plumbing is leaky. Agents drop context mid-task, hand off to the wrong specialist, or loop indefinitely when they don't know how to exit. The underlying cause is almost always the same: the system was designed around what each agent can do, without clearly defining how work moves between them.

Two primitives fix most of this: routines and handoffs. They're deceptively simple, but getting them right is the difference between a demo that works and a system you can ship.