Skip to main content

The Three Hidden Debts Killing Your AI System

· 10 min read
Tian Pan
Software Engineer

Your AI feature shipped on time. Users are using it. Everything looks fine — until one quarter later when a support ticket reveals the system has been confidently wrong for weeks, your evaluation suite caught nothing, and the vector index is silently returning stale results. Nothing broke. The system returned 200 OK the whole time.

This is what AI technical debt looks like. Unlike a failing unit test or a stack overflow, it degrades softly and probabilistically. You don't get a crash — you get subtle quality erosion. Three distinct liabilities drive most of this: prompt debt, eval debt, and embedding debt. Each accumulates independently. Each compounds the others. And most engineering teams are carrying all three.

Prompt Debt: When Your Instructions Become Legacy Code

Prompt debt is the accumulated mess of fragile, undocumented, un-owned instructions that quietly diverge from intended behavior over time. It forms the same way other technical debt does — through shortcuts, copied templates, and changes made faster than governance can keep up — but it fails in ways traditional code doesn't.

A function with a bug returns the wrong value. A prompt with debt returns a subtly wrong answer in a confident tone, passes all your existing tests (if you have any), and blends into normal output. The bug surface is the entire open-ended space of natural language.

Here's a failure pattern that plays out repeatedly in production: a team member copies an existing customer support prompt to spin up a new workflow. The source prompt contained outdated cancellation policy language from six months ago. For weeks, the AI system correctly waives fees that current policy would have charged. No alert fires. The drift is invisible until a revenue audit catches it. Nothing was "broken" — the system did exactly what the prompt said.

A 2025 research study analyzing LLM project repositories found prompt debt represents 6.61% of all detected technical debt, making it the most prevalent category. More concerning: once prompt debt appears, only 49.1% of it gets removed — the lowest removal rate of any debt type. Prompts that accumulate issues tend to stay broken.

The root problem is that most teams treat prompts like configuration, not like code. They aren't version-controlled. They don't have owners. They accumulate in notebooks, Slack threads, and hardcoded strings scattered across a codebase. When the underlying model changes or a business rule updates, the prompts don't.

Addressing prompt debt without a rewrite:

The goal isn't to achieve perfect prompts — it's to stop accumulating invisible ones. Three changes matter most:

First, centralize storage. Move prompts out of application code into a system where they can be retrieved, updated, and audited without a deployment. This decouples prompt iteration velocity from engineering release cycles.

Second, establish versioning. Treat prompts as first-class assets tracked in version control. Maintain a compatibility matrix that records which prompt versions work with which model versions. When you upgrade from GPT-4o to a newer model, you need to know which prompts regressed.

Third, assign ownership. Prompt sprawl is an ownership problem as much as a technical one. Every production prompt should have a named owner responsible for keeping it current. "Ownership matters more than tooling" — without it, governance frameworks become bureaucracy around abandoned prompts.

Eval Debt: The Test Suite You Keep Meaning to Build

Eval debt is the gap between what your AI system actually does and your ability to detect when it starts doing something different. It accumulates when teams prioritize shipping over building evaluation infrastructure, and it compounds because every week without proper evals is a week where regressions go undetected.

The central problem isn't knowing which scoring technique to use. It's building the operational loop that turns production failures into reproducible test cases. Most teams have one of three pieces — a prototype they manually tested, production monitoring that fires on error rates, or an eval suite they built at launch. They rarely have all three connected.

The failure mode looks like this: your system regresses on a class of edge cases. Customers notice. You investigate, find the issue, fix it. Three months later, a similar change triggers the same regression. You investigate again. The pattern repeats because nothing captured that original failure as a durable test case that would have caught the second incident.

Research data on this is stark: 67% of organizations using AI at scale reported at least one critical quality issue linked to model behavior misalignment that went undetected for over a month. Model accuracy can degrade within days of a provider model update. Eval debt makes detection lag a structural feature, not an accident.

Addressing eval debt without a rewrite:

Build evaluation infrastructure in three connected layers.

The first layer is a golden dataset — 50 to 200 curated input-output pairs drawn from real production traffic. Not synthetic examples. Actual inputs your system receives, annotated with what a correct response looks like. Start with 25 to 50 cases from your most critical failure modes. Include edge cases that have burned you before: non-English inputs, adversarial prompts, incomplete or ambiguous queries.

The second layer is an offline evaluation suite that runs against this dataset before every deployment. The evaluation rubric should decompose into specific, measurable attributes — not a single "quality" score, but separate flags for correctness, format adherence, policy compliance, and whatever dimensions matter for your specific system. Automated judge infrastructure using a frontier model to score outputs against rubrics is practical and scales well.

The third layer connects production monitoring back to the dataset. When production monitoring identifies a failure, the workflow should end with that failure converted into a new test case added to the golden dataset. Every incident that doesn't generate a regression test is an incident waiting to recur.

Integrate offline evals into CI/CD as quality gates. A deployment that regresses the golden dataset by more than a threshold should not reach production — the same way a deployment that fails unit tests doesn't.

Embedding Debt: The Vector Store Growing Stale in the Dark

Embedding debt is what you accumulate when you treat embedding model selection as a one-time decision. Vectors computed against one model are not meaningfully comparable to vectors computed against a different one. As models evolve, as providers deprecate versions, and as the semantic meaning of your domain vocabulary shifts, your vector store silently diverges from the queries running against it.

The problem has three dimensions that compound each other.

Model drift: When you upgrade your embedding model — or when a provider updates the model underneath you — previously stored vectors become misaligned. Running the same text through two model versions can produce similarity scores of 0.78, meaning your search index now returns results that are not the most relevant ones. Old embeddings fail to capture current context.

Semantic drift: Word meanings evolve. A corpus of product descriptions from 2020 that uses "reel" in a fishing context will misfire against queries about Instagram Reels in 2024. Domain-specific terms shift in meaning as business context changes. The vectors encoded your data's meaning at a point in time; that meaning may no longer match what users intend.

Storage multiplication: Organizations maintaining multiple embedding versions — across languages, tasks, and model iterations — can easily reach 60 million vectors for a system with 10 million records when accounting for three tasks and two model versions. At that scale, re-embedding becomes a $8,000 to $15,000 computation exercise, plus egress costs that at 600GB can exceed $300 before you've computed a single new vector.

This cost structure creates lock-in. Teams that understand embedding debt avoid it by staying on their original embedding model long past the point where a newer model would improve retrieval quality.

Addressing embedding debt without a rewrite:

Recent work on embedding space adaptation provides a practical path. A drift-adapter — a lightweight mapping function trained on small paired samples of old and new embedding outputs — can bridge the two spaces without re-encoding the corpus. At query time, new embeddings get transformed into the legacy space before searching an unchanged index. The approach achieves 95 to 99 percent of the quality you'd get from full re-indexing at a fraction of the cost, with less than 10 microseconds of latency overhead.

For teams that can absorb migration cost but need to avoid downtime, shadow indexing provides a cleaner path: warm up a new index in parallel while serving queries from the old one, then swap traffic atomically when the new index is ready.

The prevention pattern is monitoring. Track the semantic distribution of queries over time. Alert when the similarity distribution between incoming queries and stored vectors begins degrading — this is the signal that embedding drift is affecting retrieval quality before customers notice.

How the Three Debts Compound Each Other

These liabilities don't accumulate in isolation. They interact in ways that amplify each other.

Prompt debt makes eval debt worse. Fragile, undocumented prompts make it harder to design evaluation rubrics that remain stable. When a prompt changes without versioning, it silently invalidates existing test cases — your eval suite now tests the old behavior, not the current one.

Eval debt lets embedding debt hide. When retrieval quality degrades because of stale embeddings, it rarely triggers an error — it just affects answer quality in ways that manual spot-checking misses. Without an evaluation pipeline that measures retrieval accuracy systematically, embedding drift accumulates undetected.

Embedding debt drives prompt debt. When retrieval quality degrades, teams respond by hacking the prompt — adding instructions to compensate for poor context, adding more examples to steer the model toward better outputs. These hacks layer technical debt on top of technical debt.

The net result is a system that degrades in all three dimensions simultaneously while appearing operational. Traditional monitoring — latency, error rate, throughput — shows green dashboards while users experience steadily declining quality.

A Refactoring Roadmap That Doesn't Require Starting Over

The good news is that all three types of debt can be addressed incrementally, in parallel with shipping features.

Weeks 1-3 — Inventory and baseline: Map existing prompts, classify them by production risk. Audit your evaluation coverage against the failures you've actually had in production. Check when your embedding model was last updated and whether your provider has deprecated it. Establish cost and quality baselines so you can measure improvement.

Weeks 4-8 — Core governance: Move prompts into centralized storage with versioning. Build an initial golden dataset from production failures. Document your embedding model versions and put a refresh schedule on the calendar.

Weeks 9-16 — Operational integration: Integrate offline evals into CI/CD as quality gates. Set up the production failure → test case feedback loop. Deploy shadow indexing or a drift-adapter if embedding model updates are overdue.

Ongoing: Assign prompt ownership formally. Establish who reviews prompt changes before production. Run evaluation coverage reviews quarterly the same way you run capacity planning.

None of this requires halting feature development. The architectural pattern is the same across all three debt types: abstract the fragile thing behind an interface that can be governed, monitored, and upgraded independently.

AI systems don't fail the way traditional software fails. They fail softly, probabilistically, and with confidence. The absence of alerts is not evidence of health — it's evidence that your monitoring wasn't built to catch quality erosion. Prompt debt, eval debt, and embedding debt are the three most common reasons AI systems quietly stop working well. Treat them with the same urgency you'd treat an unmigrated database schema or a test suite that hasn't run in six months.

The debt is already there. The question is whether you find it before your users do.

References:Let's stay in touch and Follow me for more thoughts and updates