Skip to main content

The Second System Effect in AI: Why Your Agent v2 Rewrite Will Probably Fail

· 8 min read
Tian Pan
Software Engineer

Your agent v1 works. It's ugly, it's held together with prompt duct tape, and the code makes you wince every time you open it. But it handles 90% of cases, your users are happy, and it ships value every day. So naturally, you decide to rewrite it from scratch.

Six months later, the rewrite is still not in production. You've migrated frameworks twice, built a multi-agent orchestration layer for a problem that didn't require one, and your eval suite tests everything except the things that actually break. Meanwhile, v1 is still running — still ugly, still working.

This is the second system effect, and it has been destroying software projects since before most of us were born.

A Pattern as Old as Software Itself

In 1975, Fred Brooks described a phenomenon he'd observed at IBM: when an architect designs their second system, it becomes the most dangerous system they will ever design. The first system is "spare and clean" because the designer is cautious, still learning, and constrained by uncertainty. The second system absorbs every feature that was deferred from the first, every optimization that seemed premature, and every generalization that "would be nice to have." The result is bloated, late, and often worse than what it replaced.

Brooks was describing IBM's transition from relatively simple operating systems for the 700/7000 series to the catastrophically ambitious OS/360. The pattern has repeated across every era of computing since. And it is repeating right now, at scale, across the AI agent ecosystem.

The conditions are almost perfectly set for it. Agent v1 is always a prototype that surprised everyone by working. The team accumulated a long list of compromises they wanted to fix. A new framework promises to solve the problems they hit. And someone — usually the most senior engineer — says the magic words: "We should just rewrite it properly this time."

The Three Over-Engineering Traps

Agent rewrites don't fail because the team is incompetent. They fail because of predictable over-engineering patterns that feel like good engineering decisions in the moment.

Trap 1: Premature Multi-Agent Architecture

The most expensive mistake in the current agent ecosystem is reaching for multi-agent orchestration before you've exhausted what a single agent can do. Analysis of 47 production AI deployments found that 68% would have achieved equivalent or better outcomes with well-architected single-agent systems.

The cost difference is not subtle. For a customer service system processing 2.9 million queries per month, a multi-agent architecture cost 47,000/monthcomparedto47,000/month compared to 22,700 for a single-agent alternative — with only a 2.1 percentage point accuracy improvement. Token amplification in multi-agent systems runs 4.6x higher, with pure coordination overhead accounting for much of the cost.

The debugging penalty is even worse. Mean time to resolution jumps from 18 minutes for single-agent failures to 67 minutes for multi-agent failures — a 3.7x increase. When a single agent fails, you debug one component. When a multi-agent workflow fails, you trace interactions across agents, coordination logic, and shared state.

Teams reach for multi-agent because it feels architecturally sophisticated. But the decision threshold is straightforward: for monthly volumes under 10,000 queries with linear workflows, single-agent systems win 90% of the time. Multi-agent economics only start to materialize above 50,000 queries per month with genuinely parallelizable workflows.

Trap 2: Framework Lock-In Through Migration

The agent framework landscape is a graveyard of expensive migrations. Teams regularly spend three to six months building on one framework, hit its limitations, and face a 50–80% rewrite to migrate to another. One documented case involved a three-week rewrite where 60% of the codebase changed and produced two production bugs.

The second-system instinct makes this worse. When you rewrite, you don't just port your logic — you "fix" everything. You adopt the new framework's idioms completely. You restructure your tool definitions, your state management, your error handling. Each of these changes is individually reasonable, but collectively they mean you're testing an entirely new system while expecting the reliability of the old one.

The counterintuitive advice from practitioners who've survived multiple migrations: start with high-ceiling frameworks even if they feel like overkill. Growing into a framework is far easier than migrating out. The learning curve for LangGraph is 40–60 hours; for CrewAI, 20–30 hours. But the year-one framework overhead for an eight-engineer team doing a migration runs approximately $11,600 — and that's just the learning cost, not the bugs.

Trap 3: Eval-Driven Development Before You Understand the Task

Evals are critical for production agents. But building a comprehensive eval suite during a rewrite — before the rewrite has proven its basic architecture — is a form of premature optimization that Brooks would recognize immediately.

Teams delay shipping their rewrite because evals keep revealing failures. But the failures aren't in the new system's logic; they're in the eval suite's assumptions about what the system should do. The team ends up spending more time maintaining evaluation tooling than acting on what it tells them.

The practical approach: start with 20–50 simple test cases drawn from real production failures. Each change to early agent systems has a clear, noticeable impact. You don't need hundreds of tasks to validate that your rewrite handles the cases that actually matter. Build the comprehensive eval suite after you've validated the architecture, not as a prerequisite for it.

Why Incremental Refactoring Beats Clean-Room Rewrites

The evidence strongly favors incremental improvement over wholesale replacement. Production AI systems that follow a phased refactoring strategy maintain business continuity while reducing risk at each step. Consistent iteration transforms technical debt from an exponential problem into a declining curve.

The math behind multi-step agent workflows makes this concrete. With 85% accuracy per step, a five-step workflow achieves 44% end-to-end success. A ten-step workflow drops to 20%. A rewrite that adds orchestration complexity — more steps, more coordination, more state management — can actually reduce reliability even if each individual component is better.

Incremental refactoring works because it preserves the one thing a rewrite throws away: production-validated behavior. Your v1 agent has been shaped by thousands of real interactions. Every ugly conditional and every special-case prompt exists because a real user hit a real edge case. When you rewrite from scratch, you lose all of that institutional knowledge and have to rediscover it through production failures.

The strangler fig pattern — gradually replacing components of a running system rather than swapping it wholesale — maps naturally to agent architectures. Replace the retrieval layer. Then improve the tool definitions. Then refactor the prompt structure. Each change can be validated independently against production traffic.

When a Rewrite Is Actually Warranted

Not every rewrite is a second-system trap. Sometimes the foundation genuinely cannot support what you need to build. The decision framework comes down to three questions.

Has the problem fundamentally changed? If your agent started as a Q&A bot and now needs to execute multi-step workflows with side effects, the original architecture may not have the right primitives. This is a legitimate reason to rebuild — but rebuild the architecture, not the features. Port your existing prompts, tools, and edge-case handling into the new structure.

Are you hitting framework-level limitations, not application-level limitations? If your framework doesn't support persistence, streaming, or the state management you need, migration may be justified. But if your problems are in your prompts, your tool definitions, or your retrieval quality, a new framework won't fix them. Teams regularly blame the framework for what are actually application-level problems.

Can you articulate what specifically will be different? "Better architecture" and "cleaner code" are not specific enough. If you can't list the concrete capabilities the rewrite enables that incremental refactoring cannot achieve, you're probably motivated by frustration rather than necessity. Frustration is a legitimate feeling but a terrible engineering specification.

The Discipline of Restraint

Brooks's prevention strategy for the second-system effect was simple: resist functional ornamentation, make resource costs visible for small features, and ensure experienced architectural leadership. Translated to agent development:

  • Resist functional ornamentation. Your rewrite should do exactly what v1 does, but better. New capabilities come after parity, not during the rewrite. Every "while we're at it" feature is a schedule risk and a debugging surface.

  • Make costs visible. Track token usage, latency, and error rates for every architectural decision. Multi-agent coordination that adds 4.8 seconds of latency per query is a measurable cost, not an abstraction.

  • Require experienced leadership. Brooks noted that the safest designers are those who have built three or more systems, not two. In agent development, this means someone who has shipped multiple agents to production — not someone who has read about agent architectures.

The hardest part of engineering is knowing when not to build. Your v1 agent is ugly because it works in the real world, and the real world is ugly. A rewrite that makes the code beautiful but loses the battle scars is not an improvement. It's a regression wearing a clean-room suit.

Before you start that rewrite, ask yourself: am I solving a real architectural limitation, or am I just uncomfortable with code that looks like it was written under pressure? Because it was written under pressure. That's what production code looks like. And the next version will look the same way, six months later, after reality has had its say.

References:Let's stay in touch and Follow me for more thoughts and updates