We’re 18 months into running CQRS (Command Query Responsibility Segregation) in production for our core banking platform. Time for an honest retrospective on what actually delivered value vs what was just architectural complexity.
TL;DR: CQRS was worth it for specific services, but we backed off from full event sourcing and use it much more strategically than we originally planned.
The Context: Why We Went Event-Driven
Background: Fortune 500 financial services company, migrating from a legacy monolith to microservices. 40+ engineers, millions of transactions per day, strict regulatory requirements.
The promise of event-driven architecture and CQRS:
- Audit trail for every state change (regulatory gold)
- Independent scaling of reads and writes
- Eventual consistency for analytics without impacting transactional performance
- Temporal queries (“what was this account balance on July 15th?”)
- Event replay for debugging and testing
We started with our account transaction service as the pilot.
What Actually Worked Brilliantly
1. Audit Trail / Regulatory Compliance
This was the unexpected killer feature. Financial regulators want to see:
- Who made each change
- When it happened
- What the previous state was
- Why the change was made (command context)
With CQRS and event sourcing, our event log IS the audit trail. No separate audit table to keep in sync. No worrying about missing audit records.
During our last regulatory audit, we could replay the entire state of an account from creation to present day, showing every transaction, every balance change, every authorized user modification. The auditors loved it.
This alone justified the complexity.
2. Read Model Optimization
Separating read and write models let us optimize each independently:
Write side (PostgreSQL):
- Normalized tables
- ACID transactions
- Optimized for consistency and durability
Read side (Elasticsearch):
- Denormalized documents
- Optimized for complex queries
- Fast full-text search and aggregations
- No impact on write performance
Customer support can search across millions of transactions in milliseconds without slowing down transaction processing.
3. Analytics Without Impacting Production
Our data analytics team consumes the event stream (Kafka) to build real-time dashboards and ML models. They’re reading from a separate infrastructure that doesn’t touch production databases.
Before: Analytics queries would occasionally spike database CPU and slow down customer transactions.
After: Complete isolation. Analytics failures don’t impact customer-facing services.
What Was Harder Than Expected
1. Eventual Consistency Is Hard to Reason About
Example: Customer deposits a check. Write side processes the deposit immediately, publishes event. Read side processes event ~100ms later.
Customer refreshes their account page 50ms later: balance hasn’t updated yet.
We ended up with patterns like:
- Optimistic UI updates
- “Processing” states in the read model
- Synthetic “read-your-own-writes” consistency for critical flows
- Clear UX messaging about eventual consistency
This was a cultural shift. Developers and product managers both had to think differently.
2. Debugging Distributed Systems Is Complex
When something goes wrong:
- Check command processing logs
- Check event publication to Kafka
- Check Kafka consumer logs
- Check read model projection logs
- Check read model database state
A bug that would be one query in a monolith requires tracing through 5+ systems.
We invested heavily in:
- Distributed tracing (Datadog APM with correlation IDs)
- Event versioning and schema registry
- Debugging tools to replay events locally
- Clear runbooks for common failure modes
3. Event Schema Evolution Is Tricky
Events are immutable, but requirements change. We had to handle:
- Adding new fields (easy with optional fields)
- Removing fields (versioning and graceful degradation)
- Changing field semantics (new event type + migration)
We now use Avro schemas with a schema registry, and we’re much more careful about event design upfront.
Where We Backed Off: Full Event Sourcing
Originally, we planned pure event sourcing - no current state storage, only events. Rebuild state by replaying all events.
In practice: Too slow and complex for most use cases.
We moved to a hybrid:
- Write side stores current state in PostgreSQL (normalized tables)
- Events are published to Kafka for audit and read model projection
- Event log is retained for compliance and replay, but not the primary source of truth
This “event streaming” approach (rather than pure event sourcing) gave us most of the benefits with much less operational complexity.
The Infrastructure Stack
What we actually run:
Command Side:
- Spring Boot microservices
- PostgreSQL for current state
- Kafka for event publication
Event Bus:
- Kafka with multiple topics (account-events, transaction-events, etc.)
- Confluent Schema Registry for Avro schemas
- Retention: 90 days hot, archived to S3 for compliance
Query Side:
- Kafka consumers projecting events into read models
- Elasticsearch for search and complex queries
- Redis for frequently accessed data
- Separate PostgreSQL databases for some read models
Observability:
- Datadog for distributed tracing and metrics
- Custom dashboards for event lag monitoring
- Alerts for consumer lag and schema violations
When to Use CQRS (Our Criteria)
After 18 months, we’re much more selective. We use CQRS when:
Strong audit requirements: Financial transactions, compliance-critical data
Read/write patterns very different: High read volume with complex queries
Temporal queries needed: “Show me state as of date X”
Multiple read models from same events: Mobile app, web app, analytics all consuming same events
We DON’T use CQRS for:
Simple CRUD services
Low-traffic internal tools
Services where eventual consistency is problematic
Teams without strong event-driven expertise
Lessons Learned
-
Start small: Pilot with one service, learn, then expand. Don’t rewrite everything at once.
-
Invest in tooling: Event replay tools, debugging dashboards, schema management. These aren’t optional.
-
Cultural shift matters: Train the team, document patterns, pair program across services.
-
Eventual consistency requires UX consideration: Work with product/design to handle async operations gracefully.
-
Event design is critical: Spend time upfront designing events. Changing them later is painful.
-
Don’t cargo-cult patterns: Use CQRS where it adds value, not everywhere.
Would We Do It Again?
Yes, but more strategically.
For our account transaction service: Absolutely worth it. The audit trail, read model optimization, and analytics integration are delivering real business value.
For simpler services: No. We’ve built several services since with traditional CRUD patterns, and they’re easier to maintain.
CQRS is a powerful pattern with real costs. Use it where the benefits justify the complexity.
What’s your experience with event-driven architecture? Anyone else running CQRS in production? What surprised you most?