Skip to main content

The Metrics Translation Problem: Why Technically Successful AI Projects Lose Funding

· 10 min read
Tian Pan
Software Engineer

Your model achieved 91% accuracy on the held-out test set. Latency is under 200ms at p95. You've cut the error rate by 40% compared to the previous rule-based system. By every technical measure, the project is a success. Six months later, leadership cancels it.

This is not a hypothetical. Eighty percent of AI projects fail to deliver intended business value, and the majority of those failures are not caused by model performance. They are caused by the gap between what engineers measure and what decision-makers understand. The technical team speaks a language that executives cannot evaluate — and in the absence of comprehensible signal, leadership defaults to skepticism.

The metrics translation problem is not a communication soft skill. It is an engineering discipline that most teams treat as optional until the funding review.

The Credibility Vacuum

When an engineering team presents "87% on the eval set" to a VP of Operations, something predictable happens: the VP has no frame of reference for whether 87% is good, acceptable, or catastrophic. They don't know what the baseline was. They don't know what 87% costs when it's wrong. They don't know whether the remaining 13% failure rate is randomly distributed or consistently hits the most expensive cases.

In that information vacuum, the VP does what any rational actor does when facing uncertainty: they wait for more evidence, reduce commitment, or cancel. The engineering team walks away confused — the model works, what more do they want? — while the VP's team quietly redirects budget to an initiative that someone can explain.

This dynamic repeats across organizations at scale. A 2025 MIT study found that 95% of corporate generative AI pilots fail to produce measurable returns. An S&P Global survey found that 42% of companies abandoned most AI initiatives in 2025, up from 17% in 2024 — a 147% year-over-year increase in abandonment. The technology is not the bottleneck. The communication layer is.

What makes this particularly damaging is that the credibility vacuum cuts both ways. Once a VP has approved funding for an AI initiative that couldn't demonstrate its value, they become more skeptical of the next proposal — regardless of its actual merits. The engineering team's track record of delivering value becomes indistinguishable from their track record of delivering confusion.

Why Engineers Default to the Wrong Metrics

The metrics engineers report are the metrics they optimize. F1 score, AUC-ROC, perplexity, p95 latency, token throughput — these are the objective functions baked into the development loop. They are precise, comparable, and meaningful to people who share the technical context.

The mistake is not that these metrics are wrong. It is that they are partial. They describe how the model performs in isolation. They say nothing about what the model enables in the context of a business process.

Consider a fraud detection model. An improvement from 0.82 to 0.91 F1 is technically significant. But the business question is: what does a false positive cost in customer experience? What does a false negative cost in fraud losses? If high-value fraud cases are concentrated in the 9% the model misses, the improvement may have made things worse along the dimension the business actually cares about. The F1 number hides all of this.

The same principle applies to latency. "p95 under 200ms" is a meaningful engineering target. "Agents can resolve tier-1 support tickets without escalating 40% more often" is what gets budget renewed.

Engineers default to technical metrics because they are what the team controls and measures. Translating to business outcomes requires understanding the downstream workflow in detail — what happens after the model runs, who acts on its output, what the cost of errors is, and where the actual value is captured. That requires sustained collaboration with the business side, which engineers often deprioritize during development.

The Translation Layer: Mapping Metrics to Outcomes

The translation is not about dumbing down the results. It is about adding a second layer of meaning that connects model performance to operational impact.

A practical mapping looks like this:

Accuracy / F1 → error-reduction rate, rework cost, escalation rate. "Our model reduced manual review escalations by 18%, saving an estimated 2,100 hours per quarter."

Latency p95 → process cycle time, decision speed. "The inference pipeline returns results in under 200ms, which lets our support agents resolve tickets in one interaction instead of two — reducing average handle time by 22%."

Eval score improvement → baseline comparison with dollar translation. "We improved document classification accuracy from 81% to 94%. At our current volume, this eliminates approximately 800 misrouted tickets per month, each of which previously required 12 minutes of manual correction."

Inference cost → cost-per-outcome, not cost-per-token. "Processing cost per invoice dropped from 0.14to0.14 to 0.04, reducing AI compute costs by $180K annually at current volume."

The key step is establishing a pre-deployment baseline before you write a single line of model code. Document what the current process looks like: how long it takes, what it costs, where errors occur, and how frequently. Without this baseline, you have no before/after story — and without that story, you cannot demonstrate that the AI changed anything.

McDonald's deployed an AI-powered drive-thru ordering system to over 100 locations. The engineering team's accuracy benchmarks — in the low-to-mid 80% range — were plausible for many applications. But no one had translated that into a business threshold before launch: at what accuracy level does AI ordering actually outperform human employees in customer experience, throughput, and labor cost? Without that definition, the project ran for years past its viable ROI window. When viral TikTok videos of bizarre AI orders accumulated, there was no quantitative framework to assess whether the system was performing acceptably or not. The project ended in June 2024 without a clear articulation of what success would have looked like.

The Three-Tier ROI Framework

One of the most common causes of AI project cancellation is premature evaluation. Projects get reviewed at month 9 for "ROI" when they were building foundational capability whose compounding returns would have materialized by month 24.

A useful framework distinguishes three time horizons:

Realized ROI (18–36 months): Hard financial outcomes that have already been captured — cost savings banked, revenue generated, headcount reduction achieved. This is the number the CFO uses.

Trending ROI (3–12 months): Early directional signals that indicate whether realized ROI is on track. Process improvements, output quality measurements, adoption rates, before/after workflow comparisons. This is the number the VP of Engineering should be presenting monthly.

Capability ROI (ongoing): The strategic option value of the infrastructure being built. Data pipelines, evaluation frameworks, model deployment systems, and team expertise that compound into future projects. This is the hardest to quantify and the easiest for leadership to dismiss — so it requires the most deliberate communication.

The practical implication is that teams need to negotiate which time horizon will be used to evaluate the project before development starts. A project funded with a Capability ROI rationale should not be evaluated at 9 months on Realized ROI criteria. This sounds obvious. It almost never happens explicitly, which is why 73% of organizations report lacking executive consensus on how AI success is defined before launch — making failure nearly structurally inevitable.

The Reporting Cadence That Keeps Projects Alive

A single quarterly report is not enough to maintain leadership confidence through the messy middle of AI development. Executives who stop receiving regular, comprehensible signals assume nothing is happening.

A tiered cadence aligned to audience:

Engineering / data science (weekly): model accuracy, drift metrics, latency percentiles, uptime, throughput. This is the team's internal dashboard.

Department heads / VPs (monthly): time saved per workflow, task completion rates, error reduction percentages, adoption curves, and a one-paragraph narrative connecting current metrics to the agreed-upon success criteria.

CFO / Finance (quarterly): ROI statement with realized savings, trending indicators, and projected timeline to next threshold. Use the dollar-denominated translations from the metric mapping above.

C-suite / Board (quarterly or semi-annually): strategic narrative — how this capability positions the organization, what optionality it creates, and how it compares to the competitive baseline.

The most effective reporting keeps technical details in appendices. Executives who are curious can ask; executives who are not should not be forced to wade through confusion before reaching the business conclusion.

One operational detail that consistently gets missed: establish the baseline before the first sprint and include it in every subsequent report. "We have saved 2,100 engineer-hours this quarter" requires knowing how many hours the process consumed before the AI existed. Teams that skip baselining end up with stories like "the model is performing well" — which communicates nothing to the people deciding whether to continue funding.

The Organizational Failure Mode

Projects with sustained CEO involvement achieve a 68% success rate compared to 11% when leadership backing lapses. Projects with defined, pre-approved success metrics show a 54% success rate versus 12% without. Projects with dedicated change management resources achieve nearly three times the success rate of those without.

These numbers point to a consistent pattern: the communication gap is not just a reporting problem. It is an alignment problem that starts before the first model is trained. Teams that secure explicit agreement on success metrics, evaluation timelines, and reporting cadence at project initiation dramatically outperform those that defer these conversations until the funding review.

The actionable implication for engineering teams is to front-load the translation work. Before development begins, draft a one-page stakeholder brief that defines:

  • The business process being improved and its current cost
  • The specific, dollar-denominated outcomes that constitute success
  • The time horizon over which each type of ROI will be measured
  • The minimum performance threshold below which the project should be redesigned or stopped

This document is not a technical spec. It is a shared contract between engineering and the business that makes the project's success criteria comprehensible to everyone involved — before anyone has spent money on GPU time.

Conclusion

The AI stakeholder communication gap is a solvable problem, but it requires treating translation as a deliverable rather than an afterthought. Engineers who wait until the funding review to explain what they built will consistently lose to teams that maintain a running narrative of business value throughout development.

The core discipline is straightforward: establish a baseline, map technical metrics to operational outcomes in dollar terms, negotiate the evaluation time horizon before development starts, and report to each stakeholder tier in their native language. None of this requires sacrificing technical rigor. It requires adding a second layer of meaning that connects what the model does to what the organization needs.

The model that gets cancelled is not always the weakest model. It is usually the model whose team never explained, in terms leadership could evaluate, why it should continue to exist.

References:Let's stay in touch and Follow me for more thoughts and updates