Skip to main content

The 'We'll Add Evals Later' Trap: How Measurement Debt Compounds

· 9 min read
Tian Pan
Software Engineer

Every team that ships an AI feature without evals tells themselves the same story: we'll add measurement later, after we find product-market fit, after the prompt stabilizes, after the next release. Six months later, the prompt has been touched by four engineers and two product managers, the behavior is load-bearing for three customer integrations, and the team discovers that "adding evals later" means reconstructing intent from production logs they never structured for that purpose. The quarter that was supposed to be new features becomes a quarter of archaeology.

This isn't a planning mistake. It's a compounding one. The team that skipped evals to ship faster is the same team that will spend twelve weeks rebuilding eval infrastructure from incomplete traces, disagreeing about what "correct" meant in February, and quietly removing features nobody can prove still work. The cost of catching up exceeds the cost of building in — not by a little, but by a multiplier that grows with every prompt edit that shipped without a regression check.

The Drift From "Moving Fast" to "Operating Blind"

The transition is invisible in any single week. Week one, the prompt is a paragraph; the engineer who wrote it remembers every decision. Week six, a PM has pasted in three examples from a customer call; the rationale is "this felt better in the demo." Week twelve, a second engineer has added a "handle edge case X" clause because a support ticket mentioned it. Week twenty, nobody knows whether removing any line would regress anything, so nobody removes lines.

The leading indicators are social, not technical. When PMs start negotiating prompt edits in Slack threads — "can we add a sentence telling it not to be so verbose?" — the team has crossed a threshold. The prompt is now a shared surface with no specification, no versioning discipline, and no test. Every change is a gamble, and the house is paying out in customer escalations.

A second indicator: hotfix rollbacks based on three customer tickets. The team can't distinguish "real regression affecting many users" from "loud complaint from one integration," so they revert on the loudest signal. This is rational under the circumstances — without evals, customer tickets are the measurement — but it means the prompt's behavior is being steered by whoever complains most recently. Product direction becomes a function of sample bias.

The third indicator is a phrase: "we'll know it's regressed because users will tell us." This phrase should trigger an alarm. It means the team has externalized its quality signal to customers, who are neither systematic nor cheap. Every regression costs a support ticket, a trust dent, and an engineer's afternoon tracing the trigger. The cost per detected bug is orders of magnitude higher than an eval suite would charge.

The Compounding Cost of Deferred Measurement

Software tech debt accrues linearly. Evaluation debt compounds. The reason is that every prompt edit without a regression check creates a new unknown behavioral commitment — a behavior some user is now depending on, whether the team intended it or not. Six months of edits produces a surface of implicit contracts that no one has documented and no one can verify.

Recent industry surveys suggest technical debt remediation can consume up to 29% of AI implementation budgets, with projects facing 15% to 22% schedule overruns when foundational measurement and governance are deferred. Those numbers undersell the AI-specific case, because traditional tech debt has fixed cost: refactoring a tangled module takes as long as it takes. Eval debt has unbounded cost, because the act of "adding evals later" requires you to first decide what the correct output was for cases you never defined — and the people who made those decisions have moved on, changed opinions, or never agreed in the first place.

The archaeology problem is worst when the production logs weren't structured for evaluation. Teams discover that their tracing captures the final output but not the intermediate tool calls; or captures the prompt template but not the filled-in context; or captures the response but lost the user's actual question because PII redaction ran too aggressively. Rebuilding the eval dataset requires rebuilding the logging pipeline first, and the logging pipeline requires product decisions nobody wants to make retroactively.

Meanwhile, the prompt continues to accrue edits. The team is now refactoring two things simultaneously — the prompt and the measurement of the prompt — while also trying to ship whatever feature triggered the refactor. This is where the "we'll do it next quarter" becomes "we'll do it next half," and eventually becomes a dedicated tiger team with a three-month mandate.

What Early-Stage Evals Actually Cost

The argument for skipping evals at the start is that they're premature optimization. This argument works for a week. After that, the real cost of a minimal eval stack is measured in days of engineering, not weeks.

A useful baseline costs less than most teams believe: thirty to fifty curated input examples with expected behavioral properties (not exact string matches — "refuses to give medical advice," "includes a disclaimer when discussing pricing," "extracts the three required fields"); a script that runs the current prompt against all of them and diffs outputs against the last-known-good version; a CI hook that runs this on prompt-touching PRs. That's it. This baseline catches the majority of regressions that would otherwise ship, and it creates the artifact that future evals extend.

The reason teams skip even this is cultural, not technical. Early-stage AI work feels like prototyping, and evals feel like production discipline. But the asymmetry is that prototype code is easy to throw away, while prompt behavior — once users depend on it — is not. A prompt in production is a contract, and a contract without a test is a liability that grows with usage.

Writing thirty examples also forces a productive conversation that teams otherwise avoid: what is the feature actually supposed to do? The process of turning "it should be helpful" into thirty concrete cases reveals disagreements that would otherwise surface as post-launch conflict. The eval set is, at minimum, a specification artifact that costs less than a design doc and binds behavior more tightly.

The Forcing Functions That Prevent The Drift

Individual discipline doesn't scale. Teams that avoid eval debt do it with org-level forcing functions that make the skippable path harder than the right path.

The first is an eval-on-launch gate: no AI feature ships to production without a curated eval set and a minimum pass threshold. This is controversial because it delays launches, but the delay is the point. A launch that can't define what "working" means is a launch that will define it later under duress. Teams that enforce this gate discover that defining the eval set takes two to three days and saves quarters of remediation.

The second is treating eval debt as a tracked engineering metric. The same dashboard that shows unfixed bugs, open incidents, and expired TLS certs should show AI features with stale or missing eval coverage. Making the debt visible to the same leadership that approves roadmaps changes the conversation from "should we add evals?" to "which of these debts do we retire this quarter?"

The third is the prompt-PR discipline: a prompt change is not mergeable without an attached eval delta. The PR template includes a field — "eval impact" — that's either a link to eval runs or an explicit "N/A, this prompt is not measured" that triggers a separate conversation. This is the same pattern as requiring tests on code PRs, and it has the same effect: teams learn to write the eval alongside the change, not after. Prompt regression testing treats prompts as code; prompts without tests are untested code.

The fourth is owning the upstream logging surface from day one. If the feature ships with tracing that captures the full prompt, full context window, full tool calls, and full outputs in a form that can later be filtered into an eval set, the team has bought itself the option to build evals retroactively without rebuilding logging first. This is cheap at launch and irreplaceable at month six. The team that saves every request in a queryable store for ninety days — even without running evals on them yet — has a fundamentally different recovery curve than the team that stored only aggregated metrics.

The Honest Accounting

The team that skipped evals to ship faster doesn't move faster on average. They move faster for three months and slower for twelve. The gain is front-loaded and visible; the cost is back-loaded and obscured, surfaced as "we need to stabilize the platform before we build more features" — a framing that hides the root cause.

A useful exercise: when a team is about to skip evals, write down the expected cost of adding them later. Be specific. Who builds the eval set when the original author has moved to a different team? Who owns the retroactive logging pipeline? Who arbitrates when stakeholders disagree about what the prompt should have been doing? How many weeks does the archaeology take, and what ships during those weeks? If the honest answer is "we don't know, probably a quarter or two," that's the cost of "we'll add evals later" — not zero, and not "a few days whenever we get around to it."

The trap isn't that teams don't know evals matter. Most teams nod along when asked. The trap is that the cost of skipping them is invisible in the first quarter and dominant by the third. By the time the bill comes due, the team that made the decision has moved on, the team that inherits the codebase has a different set of priorities, and the engineering organization has learned exactly one lesson that will generalize to the next AI feature: shipping first and measuring later is how senior engineers build career-ending messes. The healthiest teams encode this lesson into process before they learn it personally. The rest pay tuition in production incidents and quarterly roadmap rewrites.

References:Let's stay in touch and Follow me for more thoughts and updates