From COBOL to Java: Our 18-Month Journey with AI-Assisted Modernization

eng_director_luis · February 1, 2026, 10:46pm

We completed our COBOL-to-Java modernization project six months ago. Here’s the honest retrospective on what worked, what didn’t, and what we’d do differently.

The Starting Point

2.3 million lines of COBOL across 2,400 modules
Core banking system processing $4.2 billion in daily transactions
Average age of codebase: 28 years
Documentation: essentially none
Number of engineers who understood the full system: 3 (average age: 57)

The Tools We Evaluated

We did a formal evaluation of five approaches:

IBM watsonx Code Assistant for Z - Best for large-scale COBOL, strong mainframe integration
Custom GPT-4 pipeline - Most flexible, required significant prompt engineering
Microsoft Semantic Kernel approach - Best modular architecture, multiple specialized agents
Traditional rule-based transpiler - Most predictable, least intelligent
Manual translation - Baseline for comparison

Our choice: Hybrid of IBM watsonx for batch processing modules and custom GPT-4 for interactive components. Different tools excel at different patterns.

Phase 1: Discovery and Documentation (Months 1-4)

What AI did well:

Generated documentation for all 2,400 modules in 6 weeks (would have taken 8+ months manually)
Identified 47 instances of duplicated business logic we didn’t know existed
Created dependency maps showing which modules were tightly coupled
Extracted data flow diagrams from code

What required human judgment:

Validating that AI-generated documentation was accurate (about 85% was correct)
Understanding WHY code was written certain ways (regulatory requirements, historical bugs)
Identifying which ‘bugs’ were actually ‘features’ relied upon by downstream systems

Phase 2: Test Generation (Months 3-6)

The game-changer. AI-generated test suites gave us the confidence to modernize.

Test coverage went from 12% to 67%
AI generated 15,000+ test cases across the codebase
Human review identified 2,300 additional edge cases the AI missed
This test suite became our safety net for the entire project

Phase 3: Translation (Months 5-12)

Batch processing modules (1,800 modules): IBM watsonx handled these with 91% success rate on first pass. Remaining 9% required human intervention for complex state management and REDEFINES constructs.

Interactive modules (600 modules): Custom GPT-4 pipeline with extensive prompt engineering. 78% success rate on first pass - lower than batch, but these were more complex.

Human effort distribution:

30% on the 9% of batch modules that failed AI translation
40% on interactive modules (higher complexity)
30% on integration, performance optimization, and security review

Phase 4: Integration and Hardening (Months 10-18)

This is where the ‘magic button’ expectation really breaks down.

Integration challenges:

67 modules had implicit dependencies on mainframe infrastructure (CICS, DB2 stored procedures)
Transaction boundaries needed complete redesign for cloud-native deployment
Performance optimization required for 40% of translated modules

Security hardening:

Added input validation to all API endpoints
Implemented proper session management (didn’t exist in batch world)
Replaced hardcoded credentials with secrets management
Added encryption for data at rest and in transit

The Numbers

Metric	Planned	Actual
Duration	12 months	18 months
Team size	8 engineers	12 engineers (peak)
AI translation success	95%	87%
Post-migration defects	<10	23 (all caught in staging)
Performance vs legacy	Equivalent	15% faster
Infrastructure cost	-40%	-52%

Key Lessons

AI is a force multiplier, not a replacement. Our team delivered what would have been a 36+ month project in 18 months. That’s the realistic expectation.
Invest in test generation first. The AI-generated test suite was worth the entire AI tooling investment on its own.
Budget 40% of effort for ‘the last 10%.’ The modules AI couldn’t handle cleanly consumed disproportionate time.
Security is not automated. Every hour AI saved on translation, we spent on security review.
Documentation is gold. Even imperfect AI-generated documentation accelerated every subsequent phase.

Would we do it again with AI? Absolutely. Would we promise 50% faster delivery? Never again.

alex_infrastructure · February 1, 2026, 10:47pm

Luis, your tool evaluation resonates with our experience. A few additional lessons on tool selection and integration.

Why hybrid approaches win:

No single AI tool excels at everything. We found distinct patterns:

IBM watsonx strengths:

Best understanding of COBOL semantics and mainframe concepts
Handles CICS and IMS integration patterns well
Strong on batch processing and file handling
Context-aware of DB2 and VSAM data structures

GPT-4/Claude strengths:

More flexible with non-standard coding patterns
Better at generating modern, idiomatic Java
Excellent for documentation and test generation
Easier to customize with prompt engineering

Microsoft Semantic Kernel approach:

Best when you need multiple specialized agents
Modular architecture lets you swap components
Good for complex multi-step workflows

Our integration architecture:

We built a pipeline that routes modules to different tools based on characteristics:

Classifier - Analyzes COBOL module and categorizes by complexity and pattern
Router - Sends to appropriate AI tool based on classification
Translator - AI tool generates Java code
Validator - Automated tests verify functional equivalence
Reviewer - Human reviews flagged modules and failures

The classifier was trained on our first 200 manual translations to understand which patterns each tool handled best.

Integration infrastructure investment:

3 weeks to build the pipeline
2 weeks to train the classifier
Ongoing: 0.5 FTE maintaining and improving the pipeline

The infrastructure cost that surprised us:

AI inference costs for 2.3 million lines of COBOL across multiple passes (documentation, tests, translation, review): approximately $180,000 over 18 months. Not trivial, but a small fraction of the engineering labor saved.

My advice: Don’t commit to a single tool until you’ve tested at least three approaches on your specific codebase. Every legacy system is different.

cto_michelle · February 1, 2026, 10:47pm

Luis, your numbers table is going directly into my next board presentation. The gap between planned and actual is the most honest data I’ve seen on AI modernization.

Executive sponsorship and budget allocation lessons:

How we structured the budget:

We split the modernization budget into four categories with different approval thresholds:

AI tooling and infrastructure (15%) - Approved upfront with clear success criteria
Core engineering labor (50%) - Approved in phases based on milestone completion
Security and compliance (20%) - Non-negotiable, fully funded from day one
Contingency (15%) - Reserved for the ‘unknown unknowns’

We used all of the contingency. In fact, we went 8% over total budget. But having the contingency meant we didn’t have to go back to the board mid-project with a crisis.

The executive sponsorship structure that worked:

Executive sponsor (me): Weekly status, monthly board updates, quarterly steering committee
Business sponsor (COO): Owned the business case and success criteria
Technical sponsor (Luis): Daily operational decisions, escalation path
Risk sponsor (CRO): Continuous risk assessment, regulatory liaison

Having the COO as business sponsor was critical. When engineers said ‘we need 4 more months,’ the COO could validate that the business case still held. It wasn’t just a technology project - it was a business transformation.

What I tell other CTOs:

Never promise timeline savings - Promise risk reduction, cost reduction, and capability improvement. Timeline is the variable.
Budget for 150% of AI tool vendor estimates - Vendors demo on simple cases. Your system is not simple.
Security is not optional - Fund security review as a first-class line item, not an afterthought.
Hire for judgment, not just translation - The engineers who succeed in AI-assisted modernization are those who know when NOT to trust the AI output.
Celebrate the real wins - 18 months instead of 36+ is a massive success. Don’t let the 6-month overrun versus optimistic plan overshadow that.

Luis’s project is now my go-to example when executives ask ‘should we use AI for modernization?’ The answer is yes - with realistic expectations.

data_rachel · February 1, 2026, 10:48pm

Luis, your case study is the kind of rigorous data we need more of in this space. Let me add a metrics and measurement framework perspective.

Building a measurement framework for AI modernization:

Your planned vs. actual table is excellent. Here’s how we’d extend it into a comprehensive framework:

Tier 1: Project metrics (what Luis shared)

Duration: planned vs actual
Budget: planned vs actual
Team size: planned vs actual (peak and average)
Defect rate: post-migration vs legacy baseline

Tier 2: Efficiency metrics

Lines of code translated per engineer-month (with and without AI)
AI success rate by module complexity tier
Human intervention rate: percentage of modules requiring manual work
Rework rate: modules that needed multiple translation attempts

Tier 3: Quality metrics

Test coverage: before vs after
Defect density: defects per 1,000 lines of translated code
Performance: response time and throughput vs legacy
Security findings: vulnerabilities per module in AI-generated vs human code

Tier 4: Business impact metrics

Infrastructure cost reduction (Luis reported -52%, better than -40% target)
Developer productivity post-migration
Time to deploy new features
Incident rate and MTTR

The metrics that matter most:

From a data science perspective, the most predictive metrics for project success:

Early AI success rate - If your first 100 modules translate at <80% success, recalibrate expectations immediately
Human intervention trend - Should decrease over time as you refine prompts and processes. If it’s increasing, something is wrong
Defect escape rate - Defects found in staging vs production. AI-generated code shouldn’t have a higher escape rate than human code

What Luis’s data tells us:

87% AI success vs 95% planned: realistic variance, plan for this
18 months vs 12 months: 50% timeline overrun is common, build in contingency
23 defects vs <10 planned: still caught in staging, acceptable variance
-52% infrastructure cost vs -40% planned: this is the real win

My recommendation: Track all four tiers from day one. The Tier 2 and 3 metrics are early warning systems. By the time Tier 1 metrics show problems, it’s often too late to course-correct.