4 Months Running Multi-Agent Coding Workflows: What Actually Works and What Does Not

I have been running multi-agent coding workflows for the past four months, starting with Cursor’s multi-file editing and recently experimenting with Claude Code running parallel tasks. I want to share what I have actually experienced versus what the headlines claim, because the gap is significant.

What 12,000 Lines Per Day Actually Means (And Does Not Mean)

First, yes, it is possible to generate enormous quantities of code with parallel agents. On my best day, I generated roughly 8,000 lines across multiple agents working on a new React + Node.js application. But here is the context that matters:

  • About 3,000 of those lines were boilerplate: type definitions, component scaffolding, API route handlers following established patterns
  • About 2,500 were test files that the agents generated to match the implementations
  • About 1,500 were actual business logic implementations
  • About 1,000 were configuration, imports, and plumbing

Of the 1,500 lines of actual business logic, approximately 400 needed significant revision after review. The agents got the general approach right but missed edge cases, made incorrect assumptions about business rules, or chose suboptimal data structures.

So the “true” output for that day was roughly 1,100 lines of production-quality business logic plus 5,500 lines of correct-but-mundane supporting code. That is genuinely impressive – it would take me 3-4 days to produce the same output manually. But it is not the 10x multiplier the marketing implies.

What Actually Works Well

Scaffolding and boilerplate: Multi-agent systems excel at generating repetitive code that follows patterns. CRUD endpoints, database models, form components, test setups – anything where the pattern is well-established and the variation is minimal.

Test generation: When you have a clear function signature and documented behavior, agents generate excellent test suites. I had an agent produce 400 lines of property-based tests for a data validation module that caught two bugs I had missed in my specification.

Code migration: Moving code between frameworks or updating API versions across many files. This is tedious human work that agents handle well because the transformation rules are consistent.

Documentation: Generating JSDoc comments, README files, and API documentation from code. Agents are remarkably good at reading code and describing what it does.

What Does Not Work Well

Complex state management: Anything involving distributed state, race conditions, or complex lifecycle management. Agents consistently generate code that works in happy-path scenarios but fails under concurrent access or error conditions.

Cross-module integration: When a change requires understanding how multiple modules interact, agents working independently often produce code that is internally consistent but fails at integration points. This is the architectural coherence problem that others have discussed.

Implicit business rules: Requirements that are not explicitly documented – the kind of knowledge that lives in senior developers’ heads. Agents cannot infer what they have not been told, and in complex business domains, the undocumented rules are often the most important ones.

Performance-sensitive code: Agents rarely optimize for performance unless specifically instructed. The generated code is correct but may use O(n^2) algorithms where O(n log n) solutions exist, or make unnecessary database queries that are fine at small scale but fail under load.

My Practical Workflow

Here is the workflow I have settled on after four months of iteration:

  1. I write the specification and tests first (30-60 minutes): Clear function signatures, expected behavior, edge cases. This is the highest-leverage human work.

  2. Agents implement against my specs (10-20 minutes): I run 3-5 parallel agents, each working on a different module or component. More than 5 creates diminishing returns due to merge conflicts and context management overhead.

  3. I review and integrate (45-90 minutes): Read each agent’s output, verify against specs, resolve conflicts, fix integration issues.

  4. Agents handle follow-up work (15-30 minutes): Documentation, additional test cases, code style cleanup.

Total cycle: 2-3 hours for what would previously take 1-2 days. That is a 3-4x productivity multiplier – real and valuable, but not the 10x or 300% that gets quoted in articles.

The Honest Assessment

Multi-agent coding is a genuine productivity improvement for experienced developers who know what they want to build and can write clear specifications. It is not a replacement for engineering judgment, architectural thinking, or domain expertise.

The DORA finding of 9% more bugs is consistent with my experience: agents introduce subtle errors that pass tests but manifest in production. The bugs are different from human bugs – they tend to be logic errors from incorrect assumptions rather than typos or off-by-one mistakes.

The question for every developer and team is: does the productivity gain outweigh the quality cost and the investment in reviewing agent output? For me, the answer is yes for certain types of work and no for others.

What are other practitioners actually experiencing? I would love to hear from people who have been doing this for more than a month, not just weekend projects.

Alex, thank you for the honest breakdown. This is exactly the kind of practitioner report we need instead of the breathless “I built X in 20 minutes” posts.

Your finding that 400 out of 1,500 business logic lines needed significant revision is telling. That is a 27% significant revision rate on the code that actually matters. When you factor in the time to find and fix those issues, the net productivity gain on complex work shrinks considerably.

I want to highlight something from a leadership perspective: your workflow requires a senior developer who can write good specifications, identify edge cases, and effectively review AI output. This is not a tool that makes junior developers more productive – it is a tool that makes senior developers faster at their existing skill level.

This has implications for team composition. If multi-agent coding primarily amplifies senior developers, and the DORA data shows that weak teams get worse with AI, then the talent gap between strong and weak engineering organizations is about to widen dramatically.

At my company, we are seeing this play out. Our strongest team (2 staff engineers, 4 seniors) adopted multi-agent workflows and their throughput increased roughly 2.5x with stable quality. Our weakest team (1 senior, 5 mid-level) tried the same approach and their defect rate increased 35% in the first month. We had to walk it back.

The uncomfortable truth: multi-agent coding may accelerate the divergence between high-performing and struggling engineering organizations.

The part about “implicit business rules” resonates deeply. As someone who works at the intersection of product and engineering, I have seen this exact failure mode repeatedly.

Product requirements documents are never complete. They capture the explicit requirements but miss the implicit ones: “Oh, we also need to handle the case where the user has a legacy account from before the migration.” “The discount should not stack with the loyalty program.” “European users need different data retention policies.”

These rules live in Slack threads, in the heads of senior developers, in customer support ticket patterns. No AI agent has access to this institutional knowledge, and no specification document captures it completely.

In a multi-agent world, this becomes dangerous at scale. Twenty agents producing code based on incomplete specifications will consistently produce code that handles the documented cases but fails the undocumented ones. And these failures are the hardest to detect because they often look correct in testing but fail with real user data.

My recommendation: before deploying multi-agent systems, invest in capturing institutional knowledge in a structured format that agents can consume. Decision logs, business rule catalogs, domain glossaries. This is boring work that nobody wants to do, but it is the difference between agents that produce useful code and agents that produce confidently wrong code.

Alex’s 3-4x productivity multiplier probably has room to grow if the specification quality improves. The bottleneck is not the agents – it is the completeness of the information we give them.

I appreciate Alex’s honest assessment, but I want to flag something that is not getting enough attention: the experience gap this creates for the next generation of developers.

Alex’s workflow is: write specs, let agents implement, review and fix. This works because Alex has 7 years of experience and can identify when an agent’s output is subtly wrong. He knows what good code looks like because he spent years writing it.

But what about developers in their first two years? If they adopt multi-agent workflows early, they skip the implementation phase where deep learning happens. They never develop the intuition that allows Alex to spot a suboptimal data structure choice or a missing edge case.

I have been mentoring junior developers at Confluence Design Co. and I see this pattern already with single-agent AI assistants. Juniors who rely heavily on Copilot can produce code that works but cannot explain why it works or how to fix it when it breaks. They have output without understanding.

Multi-agent systems amplify this problem dramatically. If a junior developer can orchestrate 5 agents to produce 3,000 lines of code in an afternoon, they look productive on paper. But they have not built the mental models needed to review that code effectively, debug it when it fails, or evolve it when requirements change.

The 3-4x productivity multiplier Alex describes requires a foundation of expertise that takes years to build. We need to be careful that our enthusiasm for multi-agent coding does not shortcut the learning process that creates the expert reviewers the system depends on.

Otherwise, in five years we will have lots of code and nobody who understands it.