Skip to main content

The Eval Budget Your CFO Cannot See on a Spreadsheet

· 8 min read
Tian Pan
Software Engineer

Open any quarterly planning spreadsheet and you can find every feature your team shipped, every contractor invoice, every cloud line item. What you will not find is a row for the outage that never happened, the hallucinated refund that was caught before it reached a customer, or the prompt regression that an eval blocked at 2 a.m. Those non-events have no SKU. They generate no ticket, no postmortem, no Slack thread. And so, when the eval budget comes up for renewal, it is competing for headcount against a feature that has a demo — and it loses, almost every time.

This is not a failure of nerve. It is a measurement problem. Eval investment behaves like a safety net and a test suite at the same time: it compounds quietly, it pays out in disasters avoided, and its entire value is counterfactual. Finance is structurally blind to counterfactuals. If you lead an AI team, your job is not to argue that evals are important — everyone already nods at that. Your job is to make a compounding, invisible return legible to people who only trust spreadsheets.

Why the Value of Evals Is Structurally Invisible

Security teams have lived this problem for two decades, and they gave it a name: the prevention paradox. The better a security program performs, the less visible its value becomes. A CISO whose organization suffers no breaches has, in the eyes of the budget committee, presided over a department that "did nothing." A $2 million investment in endpoint detection that prevents a $15 million ransomware attack produces, on the books, a $2 million cost and a $0 benefit — because the $15 million attack is hypothetical. Practitioners call this the invisible protection paradox, and it produces a vicious cycle: you invest, the investment works, the absence of disaster makes the investment look unnecessary, support erodes, and the organization quietly becomes more fragile.

Evals are the prevention layer of an AI product, and they inherit the paradox wholesale. Every eval that catches a regression before deploy converts a potential incident into a non-event. The non-event is the win. The non-event is also, by definition, unmeasurable after the fact — you cannot point to the customer who was not given wrong medical advice, because there is no customer, there is no record, there is nothing.

The asymmetry with feature work is brutal. A feature produces a demo, a launch post, a usage graph that goes up and to the right. An eval suite produces a number that, when the team is doing its job, simply stays flat. Flat does not photograph well in a board deck. So the eval budget enters every prioritization meeting already losing on narrative, regardless of how much value it is actually generating.

Evals Are Not QA Overhead — They Are a Speed Instrument

The most expensive framing mistake a leader can make is to file evals under "quality assurance." QA, in most organizations' mental model, is a tax on shipping — a gate you pass through, slower, on the way to production. If evals are a tax, then cutting them buys speed, and any pressure to move faster becomes pressure to cut evals.

That model is exactly backwards. A good eval suite is what lets you ship fast, because it is the thing standing between "move fast" and "move fast and break the customer's trust." Without evals, every change to a prompt, a model version, a retrieval index, or a tool definition is a bet placed blind. Teams without that safety net do not actually ship faster — they ship, get burned, and then slow to a crawl out of fear, hand-testing every change because they have no automated way to know what they broke.

The 2025 AI Index from Stanford HAI reported that organizations with structured evaluation workflows experience significantly fewer production incidents. Quality is now cited as the top barrier to AI deployment by roughly a third of organizations — it is the thing keeping prototypes from becoming products. Evals are not what slow you down on the way to production; the absence of evals is what keeps you stuck in the prototype graveyard. Reframe the line item accordingly. It is not "QA overhead." It is "deploy velocity insurance," and velocity is something your CFO already knows how to value.

Metrics That Connect Eval Coverage to Velocity and Incidents

Counterfactual value cannot be measured directly, but it can be made legible through proxies that finance already trusts. The goal is not a fake ROI number. It is a small set of metrics that move together and tell a coherent story over time.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates