Skip to main content

The Eval Debt Ratchet: How Teams Get Buried Cleaning Up What They Shipped on Vibes

· 10 min read
Tian Pan
Software Engineer

Three months after shipping a document summarization feature, a team at a mid-size company runs a prompt improvement. The new prompt scores better on the five examples they tested manually. They deploy it Friday afternoon. Monday morning, their Slack is full of user reports: summaries are now truncating half the document and presenting the truncated version as complete. The feature looked fine. The change passed review. Nobody noticed because there was no evaluation — no golden test set, no regression baseline, no automated check. The ratchet had been turning silently for months.

This is eval debt in its most recognizable form. The team didn't skip evaluations because they were careless. They skipped them because writing evaluations for AI features is harder than it sounds, the feature shipped fast and looked good, and nobody wanted to slow down a team with momentum. Now they're paying the compound interest.

Eval Debt Is Not Just Technical Debt

Technical debt gets talked about loosely — code you wrote quickly and plan to clean up. Eval debt is different in kind, not just degree. In traditional software engineering, you can often observe correctness directly: the function returns the right value or it doesn't, the API responds with the right status code or it doesn't. These are deterministic, verifiable facts.

AI features produce probabilistic outputs. The summarization isn't right or wrong in a binary sense — it's more or less faithful, more or less complete, more or less useful. Defining "correct" requires human judgment, domain knowledge, or a carefully constructed rubric. That difficulty is real, and it's exactly why teams defer. When you can't write a simple assert statement, evaluation feels like a research problem rather than an engineering task.

But here's the trap: deferring evaluation doesn't make the behavior problem go away. It just means you have no way to measure, detect, or reason about it. Empirical research on ML software projects finds that they accumulate self-admitted technical debt at roughly twice the rate of non-ML projects — and the gap is concentrated in exactly the validation and testing areas that teams deprioritize under time pressure.

When you ship an AI feature without evaluations, you're not just skipping tests. You're committing to an implicit contract you can't read: the feature behaves in some distribution of ways, and you've agreed to discover what that distribution looks like via user complaints.

Why Teams Keep Skipping Evaluations

Four forces push teams toward vibes-based shipping, and understanding them matters because the remedies are different for each.

Metric ambiguity. For a classification model, accuracy is obvious. For a customer support bot, what does quality look like? Is a short response better if the user didn't follow up, or did they just give up? Is high task completion a sign of a great experience or a sign that you're not catching users who needed escalation? Teams that can't agree on what to measure often decide not to measure anything, because partial measurement can feel more misleading than no measurement.

Demo adequacy bias. A feature that works on the 10 examples you tested feels like a feature that works. This is particularly acute for generative AI because outputs look polished even when they're wrong. A hallucinated citation, a dropped clause, a confidently incorrect entity — none of these look broken to a reviewer who isn't checking the source material. The feature appears to work in all the demos that get shown to stakeholders. The cases where it doesn't work are invisible until users find them.

Evaluation infrastructure cost. Writing a useful eval isn't just collecting test cases. It requires deciding what to measure, labeling ground truth (which often requires domain experts), building a harness to run tests reproducibly, and integrating the harness into the deployment pipeline. For a team of three trying to ship in two weeks, this is a multi-week project. The cost is real, and when you're behind on features, paying it upfront feels irrational.

Velocity culture. The AI product landscape rewards fast iteration. Teams that move quickly get to market, get feedback, and improve. The implicit assumption is that user feedback in production is a form of evaluation — the market is the judge. This is true for some kinds of information. It's not true for quality degradation that users can't articulate ("it just feels worse"), for subtle factual errors that users don't catch, or for regressions that affect edge cases users rarely hit but rely on heavily when they do.

How the Ratchet Turns

The dangerous property of eval debt is not any single bad deployment. It's the compounding effect over time.

When you ship the first feature without evaluations, the cost is abstract. Nothing has gone wrong yet. The feature works on the happy path. Engineers move to the next thing.

When you ship the second feature, you now have two systems with unmeasured behavior. Any interaction between them is uncharted territory. But still, nothing visibly wrong.

The pattern continues. Features get layered. Prompts get updated. Models get upgraded. Tool schemas change. Each modification touches unmeasured systems. The blast radius of any change is unknown, so engineers start moving conservatively — making smaller changes, testing more manually, taking longer on things that should be fast.

Then the ratchet clicks.

The click is usually a regression that users notice. Not a catastrophic failure — those are easy to catch. A subtle degradation that accumulates: success rates dropping by 15%, response relevance declining, agent tasks completing with more retries. Users start saying the system "feels worse." The team starts an investigation and finds they have no baseline. They don't know when it started, which change caused it, or what good even looked like. All they have is the current behavior and angry users.

Debugging without evaluations is reactive by design: wait for complaints, reproduce manually, form hypotheses, make a change, wait to see if complaints decrease. This can take weeks per regression. And every investigation cycle consumes capacity that could have gone to features.

The teams that built evaluation infrastructure upfront face the same regression. But for them, it appears in CI before deployment. The alert is specific: test case #47 (the multi-part entity extraction scenario that tripped them up in QA six months ago) regressed. They know exactly where to look. The fix ships in hours, not weeks.

The Trust Collapse

Eval debt accumulates silently for a long time, then collapses suddenly. The collapse is a trust problem as much as a technical one.

Users who experience inconsistent AI behavior develop what might be called defensive usage patterns. They stop relying on the feature for important work. They add manual verification steps that eliminate the productivity benefit. They report the feature as unreliable in feedback surveys, which affects team roadmaps and executive confidence. All of this happens even if the underlying technical quality never degraded — the perception of unreliability is sticky and hard to reverse.

On the engineering side, trust collapses in a different way. When engineers can't measure the effect of their changes, they can't justify confidence. Reviews get longer and more contentious because nobody has data. Deployment cadence slows because the implicit approval bar rises — "I need to be really sure before I ship this" replaces "CI is green, let's go." Senior engineers start manually reviewing more changes. The team that was moving fast on vibes is now slow on fear.

Reconstructing trust requires evaluation coverage, and retrofitting evaluation coverage onto an established system is significantly harder than building it upfront. You have to reverse-engineer what the system is supposed to do, build ground truth that may not exist, instrument pipelines that were never designed to be instrumented, and do all of this while still shipping features. Teams that defer long enough often find that the eval retrofit project is longer than the original feature development was.

Paying Down Eval Debt Without Stopping Feature Work

The right response to eval debt is not to pause and build a comprehensive evaluation system before shipping anything else. That trades one kind of dysfunction for another. The goal is to introduce measurement incrementally, starting with the areas of highest risk.

Start with 20 test cases, not 200. Teams routinely overestimate how many examples they need to get signal. A set of 20 well-chosen cases — real failures from production, known edge cases, the scenarios that came up in QA — gives you regression detection and a deployment gate. You can expand coverage continuously as you encounter new failure modes. The blocking mistake is waiting for perfect coverage before running anything.

Convert production failures directly to test cases. Every user complaint that gets investigated should end with a new entry in the evaluation dataset. This is the cheapest possible path to coverage — the cases are already discovered, already understood, and already painful enough that the team won't forget them. Within six months, a team doing this consistently has a test suite that reflects the actual failure surface of their system.

Use a layered evaluation strategy. Different measurement approaches have different cost/reliability tradeoffs. Schema validation and format checks are fast and cheap but narrow. LLM-as-judge evaluation is flexible but expensive and introduces its own measurement uncertainty. Human review is accurate but doesn't scale. The right setup uses all three: fast automated checks for clear failures, model-based scoring for quality metrics at scale, and sampled human review to calibrate the model scores. None of these layers alone is sufficient; together they cover different parts of the quality space.

Decouple AI deployments from application deployments. Teams that treat prompt changes as application code changes ship both more cautiously than necessary and less safely than they should. When a prompt update goes through the same review process as a schema migration, it slows down AI iteration. But when it bypasses review entirely, there's no gate. The solution is a separate AI deployment pipeline — versioned prompts and configurations that flow through an evaluation gate before promotion, independent of the main application release train. This is infrastructure work, but it's the structural change that makes everything else easier.

Pick a measurement north star and make it visible. Teams that track three dozen metrics track none of them effectively. Pick one metric that reflects the core user value of the feature — task success rate, answer relevance score, user acceptance rate — and make it visible on every deployment. Not as a blocking gate initially, but as a signal that engineers actually look at. The act of watching a number creates accountability that no policy document can.

The Underlying Shift

The teams that avoid the ratchet aren't teams that don't move fast. They're teams that changed what "done" means.

A feature isn't done when it works on the happy path in staging. It's done when it has at least one evaluation that would catch a regression. This isn't a high bar. It means that the team has thought about what the feature is supposed to do, written down at least one concrete example, and created a mechanism to detect when that example breaks. The discipline this requires is less about tooling and more about intent.

The failure mode isn't usually that teams are philosophically opposed to evaluations. It's that evaluation never gets prioritized because there's always a feature that seems more important. The ratchet doesn't care about intent — it just keeps turning as long as each deployment is one more unmeasured change on top of the last.

The answer is to treat "no eval coverage" as a form of technical risk that has a carrying cost — not a future problem to solve, but a present cost that makes current work harder and slower. Teams that internalize this see eval infrastructure not as overhead on top of feature work but as the thing that makes feature work sustainable at speed.

References:Let's stay in touch and Follow me for more thoughts and updates