The Monolith Scaling Inflection Point: Rewrite, Refactor, or Extract Services?

We just crossed 100 employees and $20M ARR at our financial services platform. The monolithic Rails application that got us here is now our biggest bottleneck. Build times hit 35 minutes. Three teams constantly merge-conflict over shared code. Our database queries are slowing to a crawl. Every “quick fix” takes a week.

Sound familiar?

My engineering leadership team is debating three paths forward, and I’m curious what this community thinks:

Option 1: The Big Bang Rewrite

Some of our senior engineers want to rewrite the entire system as microservices from scratch. They argue we know the domain now, we can build it right this time, and the monolith’s technical debt is too deep to refactor.

I’m skeptical. I’ve seen too many “six-month rewrites” turn into 18-month death marches while competitors ship features. The data backs up my concern—most Big Bang rewrites fail or deliver drastically reduced scope.

Option 2: Refactor to Modular Monolith

Our architect suggests keeping a single deployable but enforcing strict module boundaries. Think bounded contexts with clear APIs, separate databases per module, but one deployment unit. Faster to ship, keeps operational complexity manageable, preserves team velocity.

The appeal: we get architectural discipline without operational overhead. The downside: we’re still deploying everything together, which means blast radius stays large.

Option 3: Strangler Fig Pattern

Extract services incrementally using the strangler pattern. Start with our payment processing (PCI compliance isolation makes sense), then gradually migrate high-load endpoints. Run monolith and services side-by-side for 12-18 months.

Lower risk, but also slower transformation. And let’s be honest—we might end up with the worst of both worlds: a distributed monolith.

The 2026 Context That Changes Everything

Here’s what’s making this decision harder: 42% of organizations are consolidating from microservices back to modular monoliths in 2026. The operational overhead of distributed systems is overwhelming teams that don’t have Google-scale problems.

The cost data is eye-opening. Microservices infrastructure runs 3.75x to 6x more expensive than monoliths—we’re looking at $40k-65k per month versus $15k for equivalent functionality. That’s real money when you’re pre-Series B.

What I’m Really Asking

The technical decision framework is clear enough. What I’m struggling with is distinguishing between:

  • Actual scaling problems we have today (slow builds, database bottlenecks, team conflicts)
  • Future problems we think we’ll have (Netflix-scale traffic, hundreds of engineers, global latency requirements)

Most startups over-architect for problems they’ll never face. But some hit exponential growth and wish they’d invested in scalability earlier.

For those who’ve been through this inflection point:

  • What signals told you it was time to move beyond the monolith?
  • Did you choose evolutionary architecture or revolutionary change?
  • What would you do differently knowing what you know now?
  • How did you balance engineering investment versus feature delivery?

At my previous role at Adobe, we had the luxury of dedicated platform teams. Here, every hour spent on architecture is an hour not shipping customer value. But technical debt compounds, and I’ve seen too many startups choke on their own success.

What would you do?


Related context: Our team is 40+ engineers across 3 squads, we’re processing ~500k transactions/day, database is 2TB and growing 30% quarterly, deploy frequency dropped from 3x/day to 1x/day due to fear and coordination overhead.

Luis, this resonates deeply. I faced almost the exact same decision when I joined my current SaaS company as CTO two years ago.

Here’s what we did: Modular monolith core + 3 strategically extracted services. Not microservices-first, not Big Bang—targeted extractions where the ROI was crystal clear.

Our Three Extracted Services

  1. Payment processing - PCI compliance isolation (like you mentioned). This was non-negotiable for regulatory reasons anyway.
  2. Email/notification service - Completely different scaling characteristics. We send 50M emails/month but only 500k API transactions/day. Scaling the whole monolith for email traffic made no sense.
  3. Analytics pipeline - Different tech stack requirements (needed Kafka + ClickHouse for event streaming). The monolith’s Rails stack wasn’t the right tool.

Everything else? Still in the modular monolith. And I do mean modular—strict bounded contexts, clear APIs between modules, even separate logical databases within the same Postgres instance.

The Cost Reality Nobody Talks About

Your $40k-65k microservices estimate is conservative. We modeled a full migration and hit $80k/month once you factor in:

  • Service mesh (Istio isn’t free to run or operate)
  • Distributed tracing infrastructure
  • Additional monitoring and alerting
  • The human cost of on-call for 15 services vs 1 monolith

Most teams don’t have Google-scale problems but pay Google-scale operational costs. This is the trap.

My Decision Framework

We only extract a service when at least two of these are true:

  1. Different scaling characteristics - Component needs to scale independently
  2. Team autonomy requirements - Clear ownership boundary that enables parallel work
  3. Regulatory isolation needed - Compliance demands separation
  4. Technology mismatch - Monolith’s stack is genuinely wrong tool for the job

Your payment processing meets #1 and #3. What about your other bottlenecks?

The Warning About “Distributed Monoliths”

You’re right to worry about ending up with the worst of both worlds. I’ve seen it happen. The signs:

  • Services that can’t be deployed independently (require coordinated releases)
  • Synchronous calls everywhere creating cascade failures
  • Shared database undermining service boundaries

If you go the Strangler Fig route, set hard rules: Each extracted service gets its own database. No synchronous calls in the critical path. Deploy independently or don’t extract.

What I’d Do With Your Context

40+ engineers, $20M ARR, 500k transactions/day, 2TB database—you’re at an interesting inflection point but not crisis mode.

My recommendation: Modular monolith + payment service extraction

  1. Immediate (next 2 months): Enforce module boundaries in the monolith. Use something like Packwerk for Rails to create strict boundaries. This buys you team autonomy without operational complexity.

  2. Short-term (3-6 months): Extract payment processing as a service. You get compliance benefits, learn microservices patterns with bounded scope, and de-risk your highest-value workflow.

  3. Medium-term (6-12 months): Assess what’s still painful. Database bottleneck? Maybe it’s read replicas and better indexing, not architecture. Build times? Could be CI/CD optimization, not microservices.

The architectural answer depends on solving your actual problems, not theoretical scale. And honestly? Many “monolith problems” are actually process, tooling, or team structure problems wearing an architecture costume.

What are your database bottlenecks specifically? That might reveal whether you need architectural change or query optimization.

Oh Luis, this hits close to home. Too close.

My failed startup’s fatal mistake? We adopted microservices at 3 engineers because “that’s what Netflix does.”

Spoiler: We were not Netflix.

The Mistake That Killed 6 Months

Our tech lead (brilliant engineer, terrible judgment) convinced us that microservices would help us “scale” and “move fast.” Here’s what actually happened:

  • Infrastructure > features: We spent 60% of our time on service communication, deployment pipelines, and debugging distributed systems instead of building what customers needed
  • Debugging became a nightmare: Local development required running 12 services. Production bugs required distributed tracing we barely understood. A simple “user can’t log in” ticket took 5 hours to trace through 4 services
  • Onboarding was brutal: New engineers needed 2 weeks just to understand the service topology before they could write their first line of code
  • We shipped slower, not faster: Coordinating changes across services meant everything took longer

The Pivot That Came Too Late

Month 18, we were out of runway and consolidated everything back into a monolith. Build time dropped from 45 minutes to 8 minutes. Deployment complexity vanished. Developers were productive again.

But we’d burned 6 months and $400k in cloud costs we couldn’t afford. The company died 3 months later—not from technical failure, but from running out of time and money while our competitors shipped features we couldn’t.

“Premature distribution is the root of all evil” (paraphrasing Knuth’s famous quote about optimization)

What I Wish Someone Had Told Me

Modularity is about boundaries, not deployment units.

You can have terrible microservices (tightly coupled services that deploy together) or excellent monoliths (clear module boundaries with enforceable contracts). The packaging matters way less than the discipline.

At my current design systems role, we treat our Figma component library like a modular monolith:

  • Clear boundaries between design tokens, primitives, and composed components
  • Strict contracts for how components communicate
  • One “deployment” (shared library) but multiple autonomous teams

It works because we enforce boundaries, not because of how it’s deployed.

Questions for You, Luis

How do you know this is an architecture problem and not a process problem?

Your symptoms:

  • Slow builds → Could be CI/CD optimization, not microservices
  • Merge conflicts → Might be feature branching strategy or module boundaries
  • Database bottleneck → Might need better indexing, read replicas, caching
  • Features take 3x longer → Could be testing strategy, code ownership, or coupling

Before you invest 12-18 months in architectural transformation, have you ruled out simpler fixes?

Second question: If you refactored to modular monolith with strict boundaries (think Rails Engines with enforced dependencies), would that solve 80% of your team coordination problems for 20% of the effort?

Sometimes the right answer is not “which architecture” but “do we have the discipline to enforce boundaries regardless of architecture?”

I learned this the hard way. Don’t make my mistake—solve for your actual problem, not the problem you think you’ll have at Netflix scale. :folded_hands:

Luis, Michelle and Maya both dropped wisdom here. I want to add the organizational dimension that often gets overlooked in architecture discussions.

This is as much a people problem as a technical problem.

Conway’s Law is Real

Your architecture will mirror your org structure whether you plan for it or not. The question is: are you designing for the org structure you have, or the one you need?

At our EdTech startup, we faced a similar inflection at ~50 engineers. Here’s what we learned:

Organizational Signals You Need Architectural Change

The technical symptoms (slow builds, database bottlenecks) are important, but watch these team indicators:

  1. Teams can’t deploy independently - If team A has to coordinate with teams B and C for every release, you have an ownership problem that might need architectural boundaries

  2. Blast radius is too large - Engineers are afraid to deploy because changes have unpredictable ripple effects across the entire system

  3. Onboarding takes >2 weeks - If new engineers can’t be productive quickly, the cognitive load of your system is too high

  4. “Who owns this?” is asked daily - Unclear ownership boundaries create decision paralysis

Sound familiar with your 3 squads?

What We Did: Modular Monolith with Team Ownership

We didn’t extract microservices. We created clear ownership boundaries within the monolith mapped to team structure:

  • Growth Squad: User onboarding, authentication, billing
  • Content Squad: Course catalog, curriculum, student progress
  • Platform Squad: Infrastructure, shared services, data platform

Each squad “owns” their modules. Clear APIs between modules. Automated tests enforce dependency rules (we use dependency graphs in CI to block unauthorized imports).

Result: Teams deploy independently 90% of the time. Merge conflicts dropped 70%. Onboarding down from 3 weeks to 5 days.

The Microservices Team Maturity Tax

Here’s what nobody tells you: Microservices require senior engineers who deeply understand distributed systems.

Your 40-person team—how many have production experience with:

  • Distributed tracing and debugging
  • Service mesh operations
  • Distributed transactions and eventual consistency
  • Circuit breakers and resilience patterns
  • Service discovery and health checking

If the answer is “2-3 engineers,” you’re setting yourself up for pain. Microservices don’t just distribute your code—they distribute your operational complexity across engineers who may not be ready for it.

My Recommendation Based on Team Structure

Phase 1 (Now - 3 months): Organizational clarity before architectural change

  1. Map your 3 squads to clear domain boundaries
  2. Document ownership: Which team owns which parts of the codebase?
  3. Enforce module boundaries using tools (Packwerk for Rails, module boundaries for others)
  4. Measure: deployment frequency per squad, mean time to resolution, onboarding time

Phase 2 (3-6 months): Extract payment service with one dedicated team

Give ONE squad full ownership of the payment service extraction. Don’t split the work across teams. This accomplishes multiple goals:

  • De-risks payment processing (compliance, reliability)
  • Creates a microservices “center of excellence” that can train other teams
  • Proves (or disproves) that your team can operate distributed systems

Phase 3 (6-12 months): Assess based on data

Did deployment frequency improve? Did the team that owns payments learn and grow? Are the operational costs justified by the benefits?

Only then decide on further extractions.

The Question You’re Really Asking

“How do we scale our engineering organization while maintaining velocity?”

Sometimes the answer is architecture. But often it’s:

  • Better ownership models
  • Clearer team boundaries
  • Investment in shared platforms and tooling
  • Improved deployment and testing automation

Do you have the team maturity and organizational structure to operate microservices successfully? That’s often the real constraint, not the technical architecture.

What’s your current team structure and how does it map to your domain boundaries? That context might reveal more than the database metrics. :light_bulb:

Coming from the product side, I have a very different lens on this question.

The best architecture is the one that doesn’t slow down customer value delivery.

At my fintech startup, we faced this exact decision during our Series B planning. Our VP Eng wanted to “do microservices right” before scaling. I pushed back hard, and here’s why:

The Business Context That Matters

Architecture decisions aren’t made in a vacuum. What’s the strategic context driving this?

If it’s competitive pressure:

  • Slower feature delivery → competitors eat your lunch
  • You need to maintain or increase velocity NOW
  • Big Bang rewrite is organizational suicide

If it’s reliability/uptime concerns:

  • Customer churn from outages
  • SLA commitments to enterprise customers
  • This justifies investment in resilience (which might include service extraction)

If it’s cost/burn rate:

  • Michelle’s point about 3.75x-6x higher costs is REAL
  • Every dollar spent on infrastructure is a dollar not spent on growth
  • For pre-Series B companies, this can be existential

What’s your board asking about? Revenue growth or infrastructure costs? That should heavily influence the decision.

The Framework We Used

When evaluating architectural changes, we measured against three criteria:

1. Time to Value

How long until this architectural change makes us faster?

  • Big Bang rewrite: 12-18 months of slower velocity before benefits
  • Modular monolith: 1-2 months to enforce boundaries, immediate team autonomy gains
  • Strangler Fig: 3-6 months for first extraction, benefits proportional to extraction

2. Risk Profile

What’s the blast radius if we’re wrong?

  • Big Bang rewrite: Catastrophic if delayed or fails (see: healthcare.gov, numerous startup failures)
  • Modular monolith: Low risk, easily reversible
  • Strangler Fig: Bounded risk per service, can halt if learning proves it wrong

3. Team Capacity

Can we do this while shipping features customers need?

This is the killer question. Your competitors aren’t waiting for you to finish your architecture transformation.

What We Actually Did

Strangler Fig extraction of the billing service, but with product constraints:

  1. No disruption to core product development - Platform team (2 engineers) owned the extraction, product teams kept shipping
  2. 3-month time-box - If we couldn’t extract billing service in 3 months, we’d halt and reassess
  3. Clear success metrics - Not “microservices are good” but “billing service reliability >99.9% and payment processing latency <200ms”

We hit the goals. Learned a ton. Then made a data-driven decision about whether to continue.

The Questions I’d Ask Your Team

What’s driving this discussion—actual pain or anticipated future scale?

Your metrics (500k transactions/day, 40 engineers, $20M ARR) suggest you’re past early startup chaos but not at “we need Netflix architecture” scale.

Are you solving:

  • Today’s problem (slow builds, team conflicts, database bottlenecks)
  • Or tomorrow’s problem (10M transactions/day, 200 engineers, global scale)

Because the answer changes everything.

What customer value are you NOT delivering because of architectural constraints?

If the answer is “we’re losing deals because our system can’t handle enterprise scale” → invest in architecture

If the answer is “we’re losing deals because competitors shipped feature X and we’re still refactoring” → maybe the problem isn’t architecture

My Product Leader Take

Given your context:

  1. Short-term (now): Fix the build times and deployment fear. This might be CI/CD optimization, not architecture. Shipping should feel safe, not scary.

  2. Medium-term (3-6 months): Modular monolith + payment service extraction. You get regulatory isolation, learn microservices patterns, keep shipping features.

  3. Measure obsessively: Track deployment frequency, feature delivery time, customer impact. Let data drive the next decision.

The worst outcome is spending 12 months on architectural transformation while your competitors capture the market you could’ve owned.

What’s the opportunity cost of this decision? That’s the question your board will ask, and you should have an answer. :bar_chart: