From COBOL to Java: Our 18-Month Journey with AI-Assisted Modernization

We completed our COBOL-to-Java modernization project six months ago. Here’s the honest retrospective on what worked, what didn’t, and what we’d do differently.

The Starting Point

  • 2.3 million lines of COBOL across 2,400 modules
  • Core banking system processing $4.2 billion in daily transactions
  • Average age of codebase: 28 years
  • Documentation: essentially none
  • Number of engineers who understood the full system: 3 (average age: 57)

The Tools We Evaluated

We did a formal evaluation of five approaches:

  1. IBM watsonx Code Assistant for Z - Best for large-scale COBOL, strong mainframe integration
  2. Custom GPT-4 pipeline - Most flexible, required significant prompt engineering
  3. Microsoft Semantic Kernel approach - Best modular architecture, multiple specialized agents
  4. Traditional rule-based transpiler - Most predictable, least intelligent
  5. Manual translation - Baseline for comparison

Our choice: Hybrid of IBM watsonx for batch processing modules and custom GPT-4 for interactive components. Different tools excel at different patterns.

Phase 1: Discovery and Documentation (Months 1-4)

What AI did well:

  • Generated documentation for all 2,400 modules in 6 weeks (would have taken 8+ months manually)
  • Identified 47 instances of duplicated business logic we didn’t know existed
  • Created dependency maps showing which modules were tightly coupled
  • Extracted data flow diagrams from code

What required human judgment:

  • Validating that AI-generated documentation was accurate (about 85% was correct)
  • Understanding WHY code was written certain ways (regulatory requirements, historical bugs)
  • Identifying which ‘bugs’ were actually ‘features’ relied upon by downstream systems

Phase 2: Test Generation (Months 3-6)

The game-changer. AI-generated test suites gave us the confidence to modernize.

  • Test coverage went from 12% to 67%
  • AI generated 15,000+ test cases across the codebase
  • Human review identified 2,300 additional edge cases the AI missed
  • This test suite became our safety net for the entire project

Phase 3: Translation (Months 5-12)

Batch processing modules (1,800 modules): IBM watsonx handled these with 91% success rate on first pass. Remaining 9% required human intervention for complex state management and REDEFINES constructs.

Interactive modules (600 modules): Custom GPT-4 pipeline with extensive prompt engineering. 78% success rate on first pass - lower than batch, but these were more complex.

Human effort distribution:

  • 30% on the 9% of batch modules that failed AI translation
  • 40% on interactive modules (higher complexity)
  • 30% on integration, performance optimization, and security review

Phase 4: Integration and Hardening (Months 10-18)

This is where the ‘magic button’ expectation really breaks down.

Integration challenges:

  • 67 modules had implicit dependencies on mainframe infrastructure (CICS, DB2 stored procedures)
  • Transaction boundaries needed complete redesign for cloud-native deployment
  • Performance optimization required for 40% of translated modules

Security hardening:

  • Added input validation to all API endpoints
  • Implemented proper session management (didn’t exist in batch world)
  • Replaced hardcoded credentials with secrets management
  • Added encryption for data at rest and in transit

The Numbers

Metric Planned Actual
Duration 12 months 18 months
Team size 8 engineers 12 engineers (peak)
AI translation success 95% 87%
Post-migration defects <10 23 (all caught in staging)
Performance vs legacy Equivalent 15% faster
Infrastructure cost -40% -52%

Key Lessons

  1. AI is a force multiplier, not a replacement. Our team delivered what would have been a 36+ month project in 18 months. That’s the realistic expectation.

  2. Invest in test generation first. The AI-generated test suite was worth the entire AI tooling investment on its own.

  3. Budget 40% of effort for ‘the last 10%.’ The modules AI couldn’t handle cleanly consumed disproportionate time.

  4. Security is not automated. Every hour AI saved on translation, we spent on security review.

  5. Documentation is gold. Even imperfect AI-generated documentation accelerated every subsequent phase.

Would we do it again with AI? Absolutely. Would we promise 50% faster delivery? Never again.

Luis, your tool evaluation resonates with our experience. A few additional lessons on tool selection and integration.

Why hybrid approaches win:

No single AI tool excels at everything. We found distinct patterns:

IBM watsonx strengths:

  • Best understanding of COBOL semantics and mainframe concepts
  • Handles CICS and IMS integration patterns well
  • Strong on batch processing and file handling
  • Context-aware of DB2 and VSAM data structures

GPT-4/Claude strengths:

  • More flexible with non-standard coding patterns
  • Better at generating modern, idiomatic Java
  • Excellent for documentation and test generation
  • Easier to customize with prompt engineering

Microsoft Semantic Kernel approach:

  • Best when you need multiple specialized agents
  • Modular architecture lets you swap components
  • Good for complex multi-step workflows

Our integration architecture:

We built a pipeline that routes modules to different tools based on characteristics:

  1. Classifier - Analyzes COBOL module and categorizes by complexity and pattern
  2. Router - Sends to appropriate AI tool based on classification
  3. Translator - AI tool generates Java code
  4. Validator - Automated tests verify functional equivalence
  5. Reviewer - Human reviews flagged modules and failures

The classifier was trained on our first 200 manual translations to understand which patterns each tool handled best.

Integration infrastructure investment:

  • 3 weeks to build the pipeline
  • 2 weeks to train the classifier
  • Ongoing: 0.5 FTE maintaining and improving the pipeline

The infrastructure cost that surprised us:

AI inference costs for 2.3 million lines of COBOL across multiple passes (documentation, tests, translation, review): approximately $180,000 over 18 months. Not trivial, but a small fraction of the engineering labor saved.

My advice: Don’t commit to a single tool until you’ve tested at least three approaches on your specific codebase. Every legacy system is different.

Luis, your numbers table is going directly into my next board presentation. The gap between planned and actual is the most honest data I’ve seen on AI modernization.

Executive sponsorship and budget allocation lessons:

How we structured the budget:

We split the modernization budget into four categories with different approval thresholds:

  1. AI tooling and infrastructure (15%) - Approved upfront with clear success criteria
  2. Core engineering labor (50%) - Approved in phases based on milestone completion
  3. Security and compliance (20%) - Non-negotiable, fully funded from day one
  4. Contingency (15%) - Reserved for the ‘unknown unknowns’

We used all of the contingency. In fact, we went 8% over total budget. But having the contingency meant we didn’t have to go back to the board mid-project with a crisis.

The executive sponsorship structure that worked:

  • Executive sponsor (me): Weekly status, monthly board updates, quarterly steering committee
  • Business sponsor (COO): Owned the business case and success criteria
  • Technical sponsor (Luis): Daily operational decisions, escalation path
  • Risk sponsor (CRO): Continuous risk assessment, regulatory liaison

Having the COO as business sponsor was critical. When engineers said ‘we need 4 more months,’ the COO could validate that the business case still held. It wasn’t just a technology project - it was a business transformation.

What I tell other CTOs:

  1. Never promise timeline savings - Promise risk reduction, cost reduction, and capability improvement. Timeline is the variable.

  2. Budget for 150% of AI tool vendor estimates - Vendors demo on simple cases. Your system is not simple.

  3. Security is not optional - Fund security review as a first-class line item, not an afterthought.

  4. Hire for judgment, not just translation - The engineers who succeed in AI-assisted modernization are those who know when NOT to trust the AI output.

  5. Celebrate the real wins - 18 months instead of 36+ is a massive success. Don’t let the 6-month overrun versus optimistic plan overshadow that.

Luis’s project is now my go-to example when executives ask ‘should we use AI for modernization?’ The answer is yes - with realistic expectations.

Luis, your case study is the kind of rigorous data we need more of in this space. Let me add a metrics and measurement framework perspective.

Building a measurement framework for AI modernization:

Your planned vs. actual table is excellent. Here’s how we’d extend it into a comprehensive framework:

Tier 1: Project metrics (what Luis shared)

  • Duration: planned vs actual
  • Budget: planned vs actual
  • Team size: planned vs actual (peak and average)
  • Defect rate: post-migration vs legacy baseline

Tier 2: Efficiency metrics

  • Lines of code translated per engineer-month (with and without AI)
  • AI success rate by module complexity tier
  • Human intervention rate: percentage of modules requiring manual work
  • Rework rate: modules that needed multiple translation attempts

Tier 3: Quality metrics

  • Test coverage: before vs after
  • Defect density: defects per 1,000 lines of translated code
  • Performance: response time and throughput vs legacy
  • Security findings: vulnerabilities per module in AI-generated vs human code

Tier 4: Business impact metrics

  • Infrastructure cost reduction (Luis reported -52%, better than -40% target)
  • Developer productivity post-migration
  • Time to deploy new features
  • Incident rate and MTTR

The metrics that matter most:

From a data science perspective, the most predictive metrics for project success:

  1. Early AI success rate - If your first 100 modules translate at <80% success, recalibrate expectations immediately

  2. Human intervention trend - Should decrease over time as you refine prompts and processes. If it’s increasing, something is wrong

  3. Defect escape rate - Defects found in staging vs production. AI-generated code shouldn’t have a higher escape rate than human code

What Luis’s data tells us:

  • 87% AI success vs 95% planned: realistic variance, plan for this
  • 18 months vs 12 months: 50% timeline overrun is common, build in contingency
  • 23 defects vs <10 planned: still caught in staging, acceptable variance
  • -52% infrastructure cost vs -40% planned: this is the real win

My recommendation: Track all four tiers from day one. The Tier 2 and 3 metrics are early warning systems. By the time Tier 1 metrics show problems, it’s often too late to course-correct.