The 'We'll Add Evals Later' Trap: How Measurement Debt Compounds
Every team that ships an AI feature without evals tells themselves the same story: we'll add measurement later, after we find product-market fit, after the prompt stabilizes, after the next release. Six months later, the prompt has been touched by four engineers and two product managers, the behavior is load-bearing for three customer integrations, and the team discovers that "adding evals later" means reconstructing intent from production logs they never structured for that purpose. The quarter that was supposed to be new features becomes a quarter of archaeology.
This isn't a planning mistake. It's a compounding one. The team that skipped evals to ship faster is the same team that will spend twelve weeks rebuilding eval infrastructure from incomplete traces, disagreeing about what "correct" meant in February, and quietly removing features nobody can prove still work. The cost of catching up exceeds the cost of building in — not by a little, but by a multiplier that grows with every prompt edit that shipped without a regression check.
The Drift From "Moving Fast" to "Operating Blind"
The transition is invisible in any single week. Week one, the prompt is a paragraph; the engineer who wrote it remembers every decision. Week six, a PM has pasted in three examples from a customer call; the rationale is "this felt better in the demo." Week twelve, a second engineer has added a "handle edge case X" clause because a support ticket mentioned it. Week twenty, nobody knows whether removing any line would regress anything, so nobody removes lines.
The leading indicators are social, not technical. When PMs start negotiating prompt edits in Slack threads — "can we add a sentence telling it not to be so verbose?" — the team has crossed a threshold. The prompt is now a shared surface with no specification, no versioning discipline, and no test. Every change is a gamble, and the house is paying out in customer escalations.
A second indicator: hotfix rollbacks based on three customer tickets. The team can't distinguish "real regression affecting many users" from "loud complaint from one integration," so they revert on the loudest signal. This is rational under the circumstances — without evals, customer tickets are the measurement — but it means the prompt's behavior is being steered by whoever complains most recently. Product direction becomes a function of sample bias.
The third indicator is a phrase: "we'll know it's regressed because users will tell us." This phrase should trigger an alarm. It means the team has externalized its quality signal to customers, who are neither systematic nor cheap. Every regression costs a support ticket, a trust dent, and an engineer's afternoon tracing the trigger. The cost per detected bug is orders of magnitude higher than an eval suite would charge.
The Compounding Cost of Deferred Measurement
Software tech debt accrues linearly. Evaluation debt compounds. The reason is that every prompt edit without a regression check creates a new unknown behavioral commitment — a behavior some user is now depending on, whether the team intended it or not. Six months of edits produces a surface of implicit contracts that no one has documented and no one can verify.
Recent industry surveys suggest technical debt remediation can consume up to 29% of AI implementation budgets, with projects facing 15% to 22% schedule overruns when foundational measurement and governance are deferred. Those numbers undersell the AI-specific case, because traditional tech debt has fixed cost: refactoring a tangled module takes as long as it takes. Eval debt has unbounded cost, because the act of "adding evals later" requires you to first decide what the correct output was for cases you never defined — and the people who made those decisions have moved on, changed opinions, or never agreed in the first place.
The archaeology problem is worst when the production logs weren't structured for evaluation. Teams discover that their tracing captures the final output but not the intermediate tool calls; or captures the prompt template but not the filled-in context; or captures the response but lost the user's actual question because PII redaction ran too aggressively. Rebuilding the eval dataset requires rebuilding the logging pipeline first, and the logging pipeline requires product decisions nobody wants to make retroactively.
Meanwhile, the prompt continues to accrue edits. The team is now refactoring two things simultaneously — the prompt and the measurement of the prompt — while also trying to ship whatever feature triggered the refactor. This is where the "we'll do it next quarter" becomes "we'll do it next half," and eventually becomes a dedicated tiger team with a three-month mandate.
What Early-Stage Evals Actually Cost
- https://hamel.dev/blog/posts/evals/
- https://hamel.dev/blog/posts/field-guide/
- https://hamel.dev/blog/posts/evals-faq/
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://arxiv.org/html/2411.13768v3
- https://arxiv.org/html/2506.13023v1
- https://developers.openai.com/api/docs/guides/evaluation-best-practices
- https://www.datadoghq.com/blog/llm-evaluation-framework-best-practices/
- https://www.oreilly.com/radar/evals-are-not-all-you-need/
- https://newsletter.pragmaticengineer.com/p/evals
- https://www.braintrust.dev/blog/evals-for-pms
- https://launchdarkly.com/docs/tutorials/when-to-add-online-evals
- https://www.maxpetrusenko.com/blog/ci-cd-eval-gates-for-llm-apps
