18 months of CQRS in production at a fintech - what actually worked vs what was hype

We’re 18 months into running CQRS (Command Query Responsibility Segregation) in production for our core banking platform. Time for an honest retrospective on what actually delivered value vs what was just architectural complexity.

TL;DR: CQRS was worth it for specific services, but we backed off from full event sourcing and use it much more strategically than we originally planned.

The Context: Why We Went Event-Driven

Background: Fortune 500 financial services company, migrating from a legacy monolith to microservices. 40+ engineers, millions of transactions per day, strict regulatory requirements.

The promise of event-driven architecture and CQRS:

  • Audit trail for every state change (regulatory gold)
  • Independent scaling of reads and writes
  • Eventual consistency for analytics without impacting transactional performance
  • Temporal queries (“what was this account balance on July 15th?”)
  • Event replay for debugging and testing

We started with our account transaction service as the pilot.

What Actually Worked Brilliantly

1. Audit Trail / Regulatory Compliance

This was the unexpected killer feature. Financial regulators want to see:

  • Who made each change
  • When it happened
  • What the previous state was
  • Why the change was made (command context)

With CQRS and event sourcing, our event log IS the audit trail. No separate audit table to keep in sync. No worrying about missing audit records.

During our last regulatory audit, we could replay the entire state of an account from creation to present day, showing every transaction, every balance change, every authorized user modification. The auditors loved it.

This alone justified the complexity.

2. Read Model Optimization

Separating read and write models let us optimize each independently:

Write side (PostgreSQL):

  • Normalized tables
  • ACID transactions
  • Optimized for consistency and durability

Read side (Elasticsearch):

  • Denormalized documents
  • Optimized for complex queries
  • Fast full-text search and aggregations
  • No impact on write performance

Customer support can search across millions of transactions in milliseconds without slowing down transaction processing.

3. Analytics Without Impacting Production

Our data analytics team consumes the event stream (Kafka) to build real-time dashboards and ML models. They’re reading from a separate infrastructure that doesn’t touch production databases.

Before: Analytics queries would occasionally spike database CPU and slow down customer transactions.
After: Complete isolation. Analytics failures don’t impact customer-facing services.

What Was Harder Than Expected

1. Eventual Consistency Is Hard to Reason About

Example: Customer deposits a check. Write side processes the deposit immediately, publishes event. Read side processes event ~100ms later.

Customer refreshes their account page 50ms later: balance hasn’t updated yet.

We ended up with patterns like:

  • Optimistic UI updates
  • “Processing” states in the read model
  • Synthetic “read-your-own-writes” consistency for critical flows
  • Clear UX messaging about eventual consistency

This was a cultural shift. Developers and product managers both had to think differently.

2. Debugging Distributed Systems Is Complex

When something goes wrong:

  1. Check command processing logs
  2. Check event publication to Kafka
  3. Check Kafka consumer logs
  4. Check read model projection logs
  5. Check read model database state

A bug that would be one query in a monolith requires tracing through 5+ systems.

We invested heavily in:

  • Distributed tracing (Datadog APM with correlation IDs)
  • Event versioning and schema registry
  • Debugging tools to replay events locally
  • Clear runbooks for common failure modes

3. Event Schema Evolution Is Tricky

Events are immutable, but requirements change. We had to handle:

  • Adding new fields (easy with optional fields)
  • Removing fields (versioning and graceful degradation)
  • Changing field semantics (new event type + migration)

We now use Avro schemas with a schema registry, and we’re much more careful about event design upfront.

Where We Backed Off: Full Event Sourcing

Originally, we planned pure event sourcing - no current state storage, only events. Rebuild state by replaying all events.

In practice: Too slow and complex for most use cases.

We moved to a hybrid:

  • Write side stores current state in PostgreSQL (normalized tables)
  • Events are published to Kafka for audit and read model projection
  • Event log is retained for compliance and replay, but not the primary source of truth

This “event streaming” approach (rather than pure event sourcing) gave us most of the benefits with much less operational complexity.

The Infrastructure Stack

What we actually run:

Command Side:

  • Spring Boot microservices
  • PostgreSQL for current state
  • Kafka for event publication

Event Bus:

  • Kafka with multiple topics (account-events, transaction-events, etc.)
  • Confluent Schema Registry for Avro schemas
  • Retention: 90 days hot, archived to S3 for compliance

Query Side:

  • Kafka consumers projecting events into read models
  • Elasticsearch for search and complex queries
  • Redis for frequently accessed data
  • Separate PostgreSQL databases for some read models

Observability:

  • Datadog for distributed tracing and metrics
  • Custom dashboards for event lag monitoring
  • Alerts for consumer lag and schema violations

When to Use CQRS (Our Criteria)

After 18 months, we’re much more selective. We use CQRS when:

:white_check_mark: Strong audit requirements: Financial transactions, compliance-critical data
:white_check_mark: Read/write patterns very different: High read volume with complex queries
:white_check_mark: Temporal queries needed: “Show me state as of date X”
:white_check_mark: Multiple read models from same events: Mobile app, web app, analytics all consuming same events

We DON’T use CQRS for:
:cross_mark: Simple CRUD services
:cross_mark: Low-traffic internal tools
:cross_mark: Services where eventual consistency is problematic
:cross_mark: Teams without strong event-driven expertise

Lessons Learned

  1. Start small: Pilot with one service, learn, then expand. Don’t rewrite everything at once.

  2. Invest in tooling: Event replay tools, debugging dashboards, schema management. These aren’t optional.

  3. Cultural shift matters: Train the team, document patterns, pair program across services.

  4. Eventual consistency requires UX consideration: Work with product/design to handle async operations gracefully.

  5. Event design is critical: Spend time upfront designing events. Changing them later is painful.

  6. Don’t cargo-cult patterns: Use CQRS where it adds value, not everywhere.

Would We Do It Again?

Yes, but more strategically.

For our account transaction service: Absolutely worth it. The audit trail, read model optimization, and analytics integration are delivering real business value.

For simpler services: No. We’ve built several services since with traditional CRUD patterns, and they’re easier to maintain.

CQRS is a powerful pattern with real costs. Use it where the benefits justify the complexity.

What’s your experience with event-driven architecture? Anyone else running CQRS in production? What surprised you most?

Luis, this is exactly the kind of real-world experience I was hoping to hear. The audit trail benefit is fascinating - that’s an angle I hadn’t fully considered.

Event Streams for ML Features

You mentioned your data analytics team consuming the event stream for ML models. This is something we’re actively exploring at Anthropic.

The appeal: Real-time feature engineering from event streams. Instead of batch jobs pulling from databases, we process events as they flow through Kafka to generate ML features on the fly.

Questions about your setup:

  1. Schema evolution for ML: When you change event schemas, how do you handle ML models trained on older schema versions? Do you retrain? Version your features?

  2. Event replay for training data: Can you replay historical events to generate training datasets? Or do you store processed features separately?

  3. Data quality in event logs: How do you handle malformed events, duplicate events, out-of-order events? These can wreck ML model quality if not caught.

Testing Event-Driven Systems

You mentioned debugging complexity. How about testing? I’m curious about:

  • Integration testing: How do you test event flows end-to-end without full production infrastructure?
  • Event replay testing: Can you replay production events in a test environment to reproduce bugs?
  • Schema validation testing: How do you catch schema compatibility issues before production?

Our experimentation platform tried using events for experiment assignment and it was a nightmare to test properly. We ended up with a lot of flaky tests due to timing issues and eventual consistency.

The Eventual Consistency Challenge

Your check deposit example resonates. In ML systems, eventual consistency creates interesting problems:

  • Feature staleness: ML model gets stale features because read model hasn’t updated
  • Training/serving skew: Training data uses eventually consistent views, but serving needs real-time consistency
  • A/B test assignment: User gets assigned to experiment, but read model doesn’t reflect assignment yet

How do you reason about these trade-offs? Are there patterns beyond “optimistic UI updates” that work well?

Kafka at Scale

Quick operational questions about Kafka:

  • What’s your Kafka cluster size and typical event throughput?
  • How do you handle consumer lag monitoring and alerting?
  • What’s your strategy for disaster recovery and event replay after major incidents?

We’re evaluating Kafka vs alternatives (Pulsar, Kinesis) for our event bus, and real-world operational experience is invaluable.

The “use it strategically” lesson is huge. Not every service needs CQRS. Not every data problem needs ML. Match the pattern to the problem.

This is incredibly helpful, Luis. I’m working on a much smaller scale (12-person team), but we’re considering event-driven architecture for parts of our platform.

Local Development Setup Questions

This is my biggest concern: How do you develop and debug locally?

With a traditional monolith or simple microservices:

  • Run the app locally
  • Use a local database
  • Debug with breakpoints
  • See immediate results

With CQRS and event sourcing:

  • Need Kafka running locally (Docker?)
  • Multiple databases (write model, read model)
  • Event consumers running
  • Distributed tracing infrastructure?

Do your engineers run the full stack locally? Or do you have shared dev environments? What about CI/CD - how long do your pipelines take with all this infrastructure?

Observability and Developer Tooling

You mentioned investing in debugging tools. Can you be specific about what that looks like?

  • Custom admin dashboards to view event streams?
  • Tools to replay events in local dev?
  • Event search/filtering tools?
  • Correlation ID propagation through the stack?

We use Datadog too, but I’m curious what custom tooling you built.

GraphQL Subscriptions vs Event Streams

Here’s something I’m wrestling with: When do you use GraphQL subscriptions vs consuming event streams directly?

For our real-time features (notifications, live updates), we could either:

  1. Expose GraphQL subscriptions that internally consume events
  2. Have clients consume events directly via WebSocket
  3. Use Server-Sent Events from a REST endpoint

What’s your pattern for pushing real-time updates to clients?

The Complexity Question

Honest question: With 40+ engineers, CQRS makes sense. But at what team size does this become viable?

We’re 12 engineers total. If we went down this path:

  • Junior engineers would struggle with the complexity
  • Onboarding time would increase significantly
  • We’d need dedicated platform/infra engineers
  • Feature velocity might actually decrease

Your “don’t use CQRS for simple CRUD” advice resonates. But how do you prevent the pattern from spreading to services where it doesn’t belong? Does it become an “architecture standard” that gets cargo-culted?

Hybrid Approach Appreciation

I really appreciate the honesty about backing off from full event sourcing. The hybrid approach (current state in Postgres + events in Kafka) sounds much more pragmatic.

This actually sounds similar to database change data capture (CDC) patterns like Debezium. Did you consider CDC instead of explicitly publishing events? Or is the explicit event publishing important for your domain model?

Question: Would you recommend CQRS for a team our size, or should we wait until we hit specific scaling problems?

Luis, thank you for the transparency about what worked vs what didn’t. As someone scaling an EdTech platform, this is exactly the conversation I need to have with my team.

The Cultural Shift Is Underestimated

Your point about developers AND product managers needing to think differently - this is huge and often overlooked.

Organizational Change Management: When you shift to eventual consistency and event-driven thinking, it’s not just a technical change:

  • Product managers need to understand async operations and UX implications
  • Designers need to design for loading states and optimistic updates
  • QA engineers need new testing strategies for distributed systems
  • Customer support needs to understand why data might not be immediately consistent
  • Sales/legal might need to explain eventual consistency to enterprise customers

How did you handle this cultural transition? Was it:

  • Top-down mandate from leadership?
  • Gradual adoption with pilot teams?
  • Training programs and documentation?
  • Pair programming across teams?

We’re considering event-driven architecture for student learning interactions (millions of events per day), and the organizational change feels as challenging as the technical implementation.

The Junior Engineer Challenge

Alex mentioned this, and it’s a real concern for me. We’re scaling from 25 to 80+ engineers over the next year. Many will be junior or mid-level engineers.

Questions:

  1. How do junior engineers ramp up on CQRS patterns? What’s the learning curve?
  2. Do you have service templates or frameworks that abstract the complexity?
  3. Who owns the event-driven infrastructure - a platform team? Or do all teams need to understand it?

I worry about creating a two-tier system: senior engineers who understand events, and junior engineers who are blocked on distributed debugging.

Incident Response Complexity

Your 5-step debugging process when something goes wrong - this keeps me up at night.

In a monolith or simple services:

  • One log file to check
  • One database to query
  • Clear stack traces
  • Junior engineers can debug effectively

With distributed events:

  • Multiple systems to investigate
  • Correlation IDs to trace across services
  • Timing issues and race conditions
  • Need senior engineers for most incidents

How does this affect your on-call rotation? Do only senior engineers take on-call? Or do you have tiered escalation?

For EdTech, we have strict uptime requirements (schools depend on us daily). Incident response time is critical. Does CQRS slow down incident resolution?

The Student Actions Use Case

You mentioned this makes sense for EdTech, and I agree conceptually:

Student learning events:

  • Video watched (progress, duration, engagement)
  • Problem attempted (answer, time taken, hints used)
  • Quiz completed (score, time, question breakdown)
  • Discussion post created (content, interactions)

These are natural events. But here’s my concern: Does eventual consistency affect the learning experience?

Example: Student completes a lesson. Course progress shows 70% instead of 80% because read model hasn’t updated. Student sees outdated progress and gets confused.

Or: Student submits answer. Needs immediate feedback. Can’t wait 100ms for eventual consistency.

How do you handle use cases where users expect immediate feedback?

Staffing and Cost Implications

You mentioned investing in:

  • Distributed tracing infrastructure
  • Event replay tools
  • Schema management
  • Custom debugging dashboards
  • Clear runbooks

What does this cost in terms of engineering time and headcount?

If we go down this path, I need to justify to our CEO:

  • X additional platform engineers
  • Y months of reduced feature velocity during migration
  • Z% increase in infrastructure costs

What’s the realistic total cost of ownership?

Appreciating the Strategic Approach

Your decision criteria for when to use CQRS is exactly what I needed. Not a binary “use it everywhere” or “never use it,” but a thoughtful framework.

For EdTech:

  • :white_check_mark: Student learning events (high volume, analytics-heavy, audit trail for pedagogy research)
  • :cross_mark: Course catalog (simple CRUD, low update frequency)
  • :cross_mark: User profile management (consistency critical, simple access patterns)

Follow-up question: How do you prevent architectural inconsistency across services? Do you have standards that say “all high-volume services must use CQRS” or is it case-by-case?

Coming from the design side, the eventual consistency challenges you describe are exactly the kind of thing that can make or break user experience.

UX Patterns for Eventual Consistency

I love that you explicitly called out working with product/design on async operations. Too often, engineering makes architectural decisions and throws them over the wall to design to “figure out the UX.”

From my experience, here are patterns that work:

1. Optimistic Updates with Clear Rollback

  • Show the change immediately in the UI
  • Display a subtle “saving…” indicator
  • If the event fails, clearly communicate the error and revert the UI
  • Examples: Gmail’s “undo send”, Google Docs auto-save

2. Progressive Disclosure of Loading States

  • Don’t show a spinner for < 300ms (feels snappy)
  • Show skeleton screens for 300ms - 2s (feels responsive)
  • Show explicit “processing” messages for > 2s (sets expectations)

3. Status Indicators for Async Operations

  • “Processing” → “Completed” state transitions
  • Clear visual feedback when operations succeed or fail
  • Examples: Payment processing, file uploads, background jobs

But here’s what worries me about CQRS for product teams:

When Technical Constraints Become UX Debt

In my failed startup, we made architectural decisions that created permanent UX constraints:

We chose eventual consistency for our content publishing system. This meant:

  • Authors would publish a post, but it wouldn’t appear immediately
  • We had to show “Your post is being published…” messages
  • If you refreshed too quickly, your post wasn’t there yet
  • Customer support got confused: “I published it, where is it?”

The problem: This wasn’t a temporary loading state. It was a permanent characteristic of the system. No amount of good UX design could make it feel instant.

Eventually, we had to rebuild parts of the system with stronger consistency guarantees because the UX was hurting adoption.

Questions About User-Facing Impact

Luis, I’m curious:

1. Customer-facing consistency: For user-facing features (account balance display, transaction history), how do you handle the eventual consistency UX? Do customers notice? Complain?

2. Error states: When event processing fails, how does the user know? Do you show technical errors or user-friendly messages?

3. Design system implications: Do you have reusable components/patterns for async operations? Or is it custom per feature?

Debugging From a Design Perspective

When something goes wrong in production and users report issues, the debugging process you described sounds complex. From a customer support / UX perspective:

  • How do CS reps understand whether an issue is “eventual consistency delay” vs “actual bug”?
  • Can CS see the event processing status for a specific user action?
  • What tools exist for non-engineers to debug user-reported issues?

My startup didn’t build these tools, and our CS team was constantly frustrated.

Accessibility and Performance Considerations

Two angles I don’t see discussed enough:

Accessibility: Screen readers and assistive technology don’t handle dynamic, eventually-consistent UIs well. How do you ensure:

  • Loading states are announced to screen readers
  • Focus management when content updates asynchronously
  • Clear status updates for vision-impaired users

Performance on slow connections: 100ms event processing delay is one thing on fast WiFi. What about:

  • Mobile users on 3G
  • Rural schools with poor internet (EdTech context!)
  • International users with high latency

Does eventual consistency make the experience worse for these users?

The Design Lesson

My takeaway from your post: Architecture decisions are product decisions.

The choice to use CQRS isn’t just about technical elegance or performance. It fundamentally shapes:

  • What UX patterns are possible
  • How users perceive system responsiveness
  • Whether the product feels polished or janky
  • Support burden and customer satisfaction

Suggestion for engineering teams: Involve designers early when considering CQRS. Don’t let technical decisions create UX debt that designers have to clean up later.

Great post, Luis. Would love to see more honest retrospectives like this!