AI Technical Debt: Four Categories That Never Show Up in Your Sprint Retro

April 12, 2026 · 11 min read

Software Engineer

Your sprint retro covers the usual suspects: flaky tests, that migration someone keeps punting, the API endpoint held together with duct tape. But if you're shipping AI features, the most expensive debt in your codebase is the kind nobody puts on a sticky note.

Traditional technical debt accumulates linearly. You cut a corner, you pay interest on it later, you refactor when the pain gets bad enough. AI technical debt compounds. A prompt that degrades silently produces training signals that pollute your evals, which misguide your next round of prompt changes, which further erodes the quality your users experience. By the time someone notices, three layers of assumptions have rotted underneath you.

Research analyzing 8.1 million pull requests across 4,800 engineering teams found that technical debt increases 30 to 41 percent after AI adoption, and developers are actually 19 percent slower on end-to-end tasks despite feeling faster. The debt is real, it's measurable, and most of it falls into four categories that traditional engineering practices weren't designed to catch.

Prompt Rot: When Yesterday's Instructions Break Tomorrow's Model

Prompt rot is the slow decay of prompt effectiveness as models change underneath them. You wrote a carefully tuned system prompt for GPT-4. It worked beautifully. Then the model got updated, or you migrated to a different provider, and the same prompt now produces subtly different outputs — not wrong enough to trigger an alert, but wrong enough to confuse users.

This happens because prompts are implicit contracts with a specific model's behavior. When you write "respond in exactly three bullet points," you're relying on that model's interpretation of "exactly." A new model version might interpret the same instruction differently — adding a preamble before the bullets or reformatting them as numbered items.

The compounding problem: teams rarely version their prompts with the same rigor they version code. A study of LLM software projects found that the median lifespan of technical debt in LLM systems is 553 days, with a removal rate of just 49 percent. More than half of the debt introduced never gets cleaned up. Prompts are especially vulnerable because they look like configuration, not code, so they escape the review processes that catch other issues.

What prompt rot hygiene looks like:

Treat prompts as code artifacts: version them, review changes, and pin them to specific model versions.
Build automated regression tests that run your prompt suite against new model versions before you upgrade.
Maintain a prompt changelog that records not just what changed but why the original wording was chosen — the reasoning decays faster than the text.
Set up output monitoring that detects distribution shifts in your model's responses, not just errors.

Eval Drift: When Your Tests No Longer Represent Reality

Eval drift happens when your evaluation datasets and metrics silently diverge from what your system actually encounters in production. You built a golden test set six months ago from a sample of real queries. It was representative then. Since then, your user base grew, use patterns shifted, and your product added new features — but the eval set stayed frozen.

This is more insidious than test rot in traditional software because the evals still pass. A classifier that scores 94 percent on your benchmark might be scoring 78 percent on the queries users are actually sending today. You won't know until someone manually audits a batch of production outputs and discovers the gap.

The problem compounds because eval results drive decisions. If your evals say the system is improving, you ship with confidence. If the evals are measuring the wrong thing, you're confidently shipping degradation. Teams that discovered this pattern through production analysis of 1,200 LLM deployments found that successful organizations treat evaluation as infrastructure, not as a one-time setup task.

Signs your evals have drifted:

Your eval scores are trending up but user satisfaction is flat or declining.
Your test set hasn't been refreshed in more than 90 days.
New product features or user segments aren't represented in your evaluation data.
You're measuring intermediate quality metrics (BLEU scores, cosine similarity) but not end-to-end task completion.

What to do about it:

Implement continuous eval set refresh by sampling recent production traffic, anonymizing it, and incorporating it into your test suite on a regular cadence.
Track the distribution of your eval set against production traffic and alert when they diverge significantly.
Use the LLM-as-judge pattern to scale evaluation beyond what manual review can cover, but calibrate your judge against human ratings at least monthly.
Measure outcomes, not proxies. If your AI feature helps users complete a task, measure task completion rate — not embedding similarity or response fluency in isolation.

Embedding Lock-in: The Migration You Can't Afford

Embedding lock-in occurs when your vector indexes, trained on one embedding model, become prohibitively expensive to migrate. You chose an embedding model, generated vectors for your entire document corpus, built indexes, tuned your retrieval thresholds, and optimized your chunking strategy around that model's characteristics. Everything works. Then a significantly better embedding model comes out, or your provider changes pricing, or you need to support a new language.

Migrating means re-embedding your entire corpus — potentially thousands of dollars in compute and days of processing. But re-embedding is just the start:

Your chunking strategy was optimized for the old model's context window.
Your similarity thresholds were calibrated to the old model's distance distribution.
Your retrieval pipeline's precision-recall tradeoffs were tuned to the old model's behavior.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

AI Technical Debt: Four Categories That Never Show Up in Your Sprint Retro

Prompt Rot: When Yesterday's Instructions Break Tomorrow's Model

Eval Drift: When Your Tests No Longer Represent Reality

Embedding Lock-in: The Migration You Can't Afford

Recommended Reading

About Tian Pan

Prompt Rot: When Yesterday's Instructions Break Tomorrow's Model​

Eval Drift: When Your Tests No Longer Represent Reality​

Embedding Lock-in: The Migration You Can't Afford​

Recommended Reading

About Tian Pan

Prompt Rot: When Yesterday's Instructions Break Tomorrow's Model

Eval Drift: When Your Tests No Longer Represent Reality

Embedding Lock-in: The Migration You Can't Afford