The CFO just rejected our AI investment proposal—did we skip the validation step?

Had a tough conversation with our CFO last week. We came in with a proposal to expand our AI coding assistant licenses across the entire engineering org—120 seats, roughly $180K annually. The engineering team was enthusiastic. We’d been piloting with 30 engineers for six months, and the feedback was positive. Developers liked the tools.

The CFO said no.

Not “maybe later” or “let’s revisit next quarter.” Just no. Her reasoning: “Show me the business impact first. I need to see this affecting our P&L before we scale it.”

That hit different. Because honestly? We couldn’t show her the impact. We had sentiment surveys showing developers were happy. We had anecdotal stories about faster code completion. But when she asked about delivery velocity, cycle time improvement, or revenue impact from features shipped faster… we had nothing concrete.

The reality check I wasn’t ready for

Turns out we’re not alone. I’ve been reading that only 14% of finance chiefs say they’ve seen clear, measurable impact from AI investments. Even more sobering: 95% of generative AI pilots fail to deliver tangible P&L results, according to MIT’s 2025 AI Report.

And now CFOs are deferring 25% of AI spending into 2027. The era of “show me you’re experimenting” is over. It’s “show me measurable impact, this year.”

Did we overhype, or did we just skip validation?

Here’s what bothers me most: I genuinely believe AI coding tools can improve productivity. The research shows engineering teams achieving 39% better R&D efficiency. The technology works.

But I’m questioning our approach. Did we:

  • Rush into deployment because everyone else was doing it?
  • Mistake developer satisfaction for business value?
  • Treat AI tools like free experiments instead of capital investments?
  • Skip the instrumentation needed to measure actual impact?

Looking back, we deployed these tools the same way companies rolled out collaboration software in 2005: install and pray. We didn’t establish baseline metrics. We didn’t define success criteria. We didn’t instrument our delivery pipeline to measure before/after.

We just… turned the tools on and assumed value would materialize.

The uncomfortable question

Was early AI adoption strategic, or was it FOMO-driven?

I keep thinking about this. If our CFO had asked about ROI before we started the pilot, would we have designed it differently? Would we have:

  • Measured baseline cycle times and velocity metrics first?
  • Defined specific hypotheses about where AI would help?
  • Created control groups to isolate the AI impact?
  • Connected engineering metrics to customer value and revenue?

Probably yes. But we didn’t do any of that. We got caught up in the narrative that “AI is the future” and “we need to move fast or get left behind.”

And now we’re stuck. We have 30 engineers using tools they like, but we can’t prove those tools justify $180K—or even the $45K we’re currently spending.

What would actually prove ROI to a CFO?

I’m genuinely asking: What metrics would convince a skeptical CFO that AI tools are worth the investment?

The engineering metrics I care about—PR velocity, code quality, developer satisfaction—don’t translate to finance language. She needs to see:

  • Revenue enabled by faster feature delivery?
  • Costs avoided through efficiency gains?
  • Customer retention improved by better product quality?
  • Margin expansion from doing more with same headcount?

But connecting AI coding assistants to those outcomes requires instrumentation and attribution we don’t have. And building that measurement infrastructure might cost more than the tools themselves.

So where does that leave us? Do we:

  1. Shut down the pilot and admit we can’t justify it?
  2. Invest in measurement infrastructure before scaling?
  3. Accept that some innovations can’t be measured in traditional ROI terms?
  4. Find better proxies that connect engineering gains to business impact?

Right now, I’m leaning toward option 2. But I’m curious: How are other engineering leaders handling CFO scrutiny on AI investments? What validation approach actually works?

Because if 95% of AI pilots are failing, we need better playbooks. The “move fast and figure it out later” approach clearly isn’t working when finance is demanding proof.

Your CFO is doing you a favor. Seriously.

I’ve been on both sides of this conversation—as the engineering leader defending AI investments and as the CTO having to justify them to the board. The CFO rejection you experienced is actually healthy organizational discipline.

Here’s what we got wrong at my previous company: We treated AI tools like research budget when we should’ve treated them like capital expenditure.

The framework that actually worked

When I moved to my current role, I learned from those mistakes. Before deploying any AI coding assistants, we established:

1. Baseline productivity metrics (3 months pre-AI)

  • Average cycle time from commit to production
  • PR review time and iteration count
  • Incident resolution time
  • Feature delivery velocity (story points per sprint)

2. Hypothesis-driven deployment
We didn’t say “AI will make us faster.” We said:

  • “AI should reduce PR iteration count by catching bugs earlier”
  • “AI should reduce incident resolution time via better stack trace analysis”
  • “AI should free senior engineers from repetitive code, letting them focus on architecture”

3. Control groups
We rolled out to 50% of teams initially. The other 50% continued without AI tools. This gave us clean comparison data.

4. Business-aligned metrics
We connected engineering gains to outcomes the CFO cares about:

  • Reduced incident count → lower customer churn
  • Faster feature delivery → revenue from new capabilities
  • Senior engineer time savings → capacity for strategic initiatives

The results

After 6 months, we could show our CFO:

  • 39% improvement in R&D efficiency (measured by features delivered per engineer)
  • 28% reduction in incident resolution time (tracked in PagerDuty)
  • 15% decrease in PR review cycles (GitHub metrics)
  • Estimated $450K in avoided hiring costs (we didn’t need 3 additional engineers we’d planned to hire)

That last one—avoided costs—was what won the CFO over. She could put it directly on the P&L.

Why most pilots fail

The 95% failure rate you mentioned doesn’t surprise me. Most organizations:

  1. Deploy AI tools without baseline metrics
  2. Assume correlation equals causation
  3. Measure inputs (tool usage) instead of outcomes (business impact)
  4. Skip the control group, so they can’t isolate AI’s contribution

Engineering velocity isn’t enough. Every infrastructure investment claims to improve velocity. CFOs need to see customer value or cost avoidance.

My advice

You’re right to lean toward option 2 (invest in measurement infrastructure). But start smaller:

Phase 1 (Month 1-2): Establish baselines with your current 30-user pilot

  • Instrument your existing delivery pipeline
  • Capture current state metrics
  • Interview the 30 users about specific use cases where AI helped

Phase 2 (Month 3-4): Run structured experiments

  • A/B test specific workflows (e.g., AI for debugging vs. code generation)
  • Track time saved on concrete tasks
  • Document what doesn’t work (eliminates false positives)

Phase 3 (Month 5-6): Connect to business outcomes

  • Map time savings to capacity gains
  • Calculate avoided costs or revenue enabled
  • Build the P&L bridge your CFO needs

If you can’t show measurable impact after this structured approach, then the honest answer is the tools aren’t worth it for your organization. And that’s valuable information too.

The alternative—scaling to 120 seats without proof—just multiplies the waste.

I’m going to push back a bit on the “CFOs need clear P&L impact” framing, because I think it misses something important about innovation timing.

Michelle’s framework is solid—baseline metrics, control groups, business-aligned outcomes. That’s textbook good practice. But there’s a question underneath this whole conversation: Are we optimizing for quarterly board slides or multi-year competitive advantage?

The cloud migration parallel

This reminds me of cloud ROI conversations in 2015. CFOs demanded proof that migrating to AWS would improve the P&L. Engineering teams couldn’t show immediate revenue impact. The migration itself was expensive and risky.

Most early cloud business cases were built on “avoided capex” and “elastic scaling.” Soft benefits. Hard to prove. Yet the companies that moved early—even without perfect ROI math—built capabilities their competitors couldn’t match.

Traditional P&L metrics lag the actual value. By the time you can prove ROI in finance terms, you’ve already missed the window where the investment mattered most.

The measurement trap

Here’s my concern with “prove it first” approaches: They optimize for measurable outcomes at the expense of exploratory learning.

AI coding tools don’t just make current workflows faster. They enable experiments that wouldn’t happen otherwise:

  • Junior engineers attempting architecture changes they’d normally escalate
  • Product teams prototyping features to test customer demand before committing engineering resources
  • Engineering teams exploring technical approaches that would’ve been too expensive pre-AI

How do you measure “experiments that wouldn’t have happened”? How do you quantify “learning that accelerated team capability”?

You can’t. Or at least, not in the 6-month window a CFO demands.

Innovation needs breathing room

The 95% AI pilot failure rate might actually be… fine?

If you run 10 pilots and 1 succeeds spectacularly, that could be worth more than 10 incremental improvements with guaranteed ROI. But you only discover which pilot succeeds by running all 10.

The problem isn’t that we’re experimenting. The problem is we’re not treating experiments like experiments.

Real experiments:

  • Have clear hypotheses (Michelle nailed this)
  • Define failure criteria upfront
  • Kill failed experiments quickly
  • Scale successful ones aggressively

Fake experiments:

  • Deploy tools organization-wide from day one
  • Assume value without measuring it
  • Continue funding indefinitely without evaluation
  • Treat “not proven” as “not working”

Where I agree

Michelle’s right that you need some instrumentation. If you can’t tell whether something’s working, you’re just hoping.

But I’d argue for a different sequencing:

Phase 1: Small-scale exploration with loose metrics

  • Give 10 engineers AI tools
  • Ask them to log “moments where AI unlocked something new”
  • Look for qualitative signals, not quantitative proof

Phase 2: Hypothesis formation from exploration

  • Identify the specific use cases that showed promise
  • Now define metrics around those cases
  • This is where you build Michelle’s framework

Phase 3: Rigorous validation

  • A/B tests, control groups, P&L bridges
  • The full CFO-grade measurement

Most organizations jump straight to Phase 3, which is why they miss the value. They’re measuring generic “productivity” instead of specific breakthroughs.

The uncomfortable truth

Sometimes the best investments can’t be justified by traditional ROI until after you’ve made them.

That doesn’t mean “ignore ROI entirely.” It means acknowledge that some strategic bets require faith backed by informed judgment, not proof.

The question isn’t “Can you prove this will work?” The question is “Can you afford to wait for proof while your competitors build capabilities you don’t have?”

I’m not saying skip validation entirely. But I am saying that over-indexing on measurability might be its own form of risk.

Both Michelle and David are making valid points, but I think we’re missing the elephant in the room: Most teams deployed AI tools without training or process changes. The ROI failure isn’t the technology—it’s how we rolled it out.

We’re literally repeating the enterprise software mistakes from 2005. “Install and pray.”

What our fintech org learned the hard way

We run financial systems where every minute of downtime is expensive. When we piloted AI coding assistants last year, we had the exact same CFO conversation Keisha described.

First pilot: unstructured rollout to 20 engineers. “Here’s your license, go use it.” Six months later? Mixed results. Some engineers loved it. Others never used it. Zero measurable productivity gain.

Second pilot: structured enablement for 25 different engineers. We invested in:

  • 2-hour onboarding workshop covering actual use cases from our codebase
  • Weekly office hours where engineers shared what worked
  • Documentation of team-specific workflows where AI helped (vs. where it didn’t)
  • Integration into existing processes (code review checklists, incident runbooks, etc.)

Same tools. Different deployment approach. Night and day difference.

The structured enablement group showed:

  • 35% faster incident resolution (we track this rigorously for compliance)
  • 22% reduction in PR review cycles
  • Significantly higher sustained usage after 3 months

The real insight: foundation investment matters more than model selection

There’s research backing this up. Organizations reporting meaningful AI returns invest far more in foundations than in models:

  1. Process integration: AI tools work when they fit into existing workflows, not when they require new ones
  2. Change management: Developers need to understand when to use AI vs. when not to
  3. Team learning infrastructure: Sharing what works compounds benefits across teams

We were so focused on “which AI tool to buy” that we ignored “how to actually deploy it effectively.”

David’s point about innovation is valid, but…

I agree with David that some innovations need breathing room. But there’s a difference between:

Strategic exploration: “We don’t know if this will work, so let’s run a disciplined experiment to find out”

vs.

Undisciplined rollout: “Everyone’s doing AI, so we should too—we’ll figure out the value later”

The first is healthy innovation. The second is FOMO.

Michelle’s framework works because it treats innovation as rigorous exploration, not faith-based investment. You can be experimental and disciplined.

What actually changes the ROI conversation

When we went back to our CFO with the second pilot results, we didn’t just show productivity metrics. We showed:

Process maturity: “Here’s the playbook for enablement. Here’s what works and what doesn’t.”

This changed her question from “Does this tool work?” to “Can we replicate these results at scale?”

That’s a much better conversation. Because it acknowledges that technology effectiveness depends on implementation quality.

My recommendation

Before you shut down your pilot or scale it, do this:

  1. Take your 30-user pilot group and split them into cohorts

    • High engagement users (using AI daily)
    • Medium engagement (occasional use)
    • Low engagement (rarely using it)
  2. Interview representatives from each cohort

    • What workflows do AI tools help with?
    • What workflows do they actively avoid AI for?
    • What training or support would make them more effective?
  3. Design structured enablement based on what you learn

    • Document the successful patterns
    • Create lightweight training
    • Integrate AI into existing team processes
  4. Run a comparison

    • Keep 15 users on self-directed approach
    • Give 15 users structured enablement
    • Measure the difference after 2 months

If structured enablement doesn’t improve outcomes, then maybe the tools genuinely aren’t a fit for your organization. But I’d bet that deployment quality matters more than tool quality.

We’re blaming the technology for our implementation failures. That’s backwards.

Reading this whole thread from the design team perspective, and honestly? This conversation proves something our team suspected from the beginning.

The engineering teams hyped AI tools more than they actually delivered.

I don’t mean this as a criticism—I get why engineers were excited. But watching this play out has been… instructive.

What we saw from the design side

When our engineering team rolled out AI coding assistants, they were enthusiastic. “This will change everything!” “We’ll ship features twice as fast!” “Code quality will improve!”

Our design team was like… okay, show us.

Six months later, we haven’t seen material changes in:

  • Feature delivery timelines
  • Engineering bandwidth for design collaboration
  • Quality of technical implementation
  • Velocity of iterating on design feedback

What we have seen:

  • Engineers spending time learning AI tools
  • Conversations about “AI-generated code” in code reviews
  • Excitement about individual productivity that doesn’t seem to translate to team outcomes

It reminded me of when Figma introduced auto-layout. Individual designers got faster at certain tasks, but shipping actual design work? That stayed roughly the same because individual speed isn’t the bottleneck.

The narrative got ahead of the reality

Luis’s point about “install and pray” resonates. But here’s what bothers me: even with structured rollout, the productivity gains everyone’s citing are coding-specific metrics.

Michelle mentioned 39% R&D efficiency, 28% incident resolution improvement. Those sound impressive. But:

  • Did product velocity increase?
  • Did customer-facing features ship faster?
  • Did technical debt decrease?
  • Did cross-functional collaboration improve?

Because from where design sits, we didn’t see those changes. We saw engineers excited about their tools, but the system-level constraints didn’t budge.

Maybe we built the wrong narrative around AI tools

David’s cloud migration parallel is interesting, but I think it misses something. Cloud migration had clear structural benefits:

  • Eliminated hardware procurement bottlenecks
  • Enabled elastic scaling
  • Reduced ops overhead

Those benefits affected everyone downstream. Product could move faster. Sales could promise faster delivery. Support could rely on better uptime.

AI coding assistants… don’t have that same system-level impact. They optimize one part of the value chain (writing code) but leave everything else unchanged:

  • Requirements still need clarification
  • Design still needs iteration
  • Testing still takes time
  • Deployment still has dependencies
  • Customer feedback still drives changes

Luis is right that implementation matters. But even perfect implementation might not justify the cost if coding speed isn’t actually your constraint.

The CFO forcing honest evaluation is a gift

Your CFO did you a favor by asking for proof. Not because AI tools are bad, but because the question “what are we actually optimizing for?” is important.

If your bottleneck is writing code, AI tools might help. If your bottleneck is:

  • Understanding what to build
  • Coordinating across teams
  • Validating with customers
  • Managing technical dependencies
  • Handling production issues

…then optimizing coding speed won’t move the needle. And spending $180K/year on marginal gains is hard to justify.

What I appreciate about this conversation

All three of you—Michelle, David, Luis—are taking this seriously. You’re wrestling with measurement, implementation, innovation strategy.

But from the outside, I can’t help thinking: What if the tools just aren’t worth it for most organizations?

Not because the technology is bad. Not because rollout was poor. But because the problem they solve isn’t the problem most teams actually have.

Sometimes the best ROI is admitting a tool doesn’t work for your workflow. And that’s okay.

Not every innovation needs to succeed. Not every tool needs to fit every team. The honest acknowledgment that “this doesn’t solve our actual constraint” is more valuable than forcing justification.