Skip to main content

14 posts tagged with "technical-debt"

View all tags

The 12-Month AI Feature Cliff: Why Your Production Models Decay on a Calendar Nobody Marked

· 11 min read
Tian Pan
Software Engineer

A feature ships at 92% pass rate. The launch deck celebrates it. Twelve months later the same feature is at 78% — no incident report, no failed deploy, no single change to point at, just a slow erosion that nobody owned watching for. The team blames "hallucinations" or "user behavior shift," picks a junior engineer to investigate, and sets a quarterly OKR to "improve quality." The OKR misses. The feature ships an apologetic dialog telling users the AI sometimes makes mistakes. Six months after that, it's deprecated and replaced with a new version that ships at 91% pass rate, and the cycle starts again.

This isn't bad luck. It's the second clock that AI features run on, the one that nobody marks on the release calendar at launch. Conventional software has feature decay too — dependency drift, codebase rot, the slow accumulation of half-applied refactors — but those decay on a clock the engineering org already understands and budgets for. AI features have all of that, plus a parallel set of decay sources that conventional amortization assumptions don't model: model deprecations, vendor weight rotations, distribution shift in user inputs, prompt patches that compound, judge calibration drift, and the quiet aging of an eval set that no longer represents what production traffic looks like.

The architectural realization that has to land — before the next AI feature ships, not after — is that AI features have a non-zero baseline maintenance cost. The feature isn't done when it launches. It's enrolled in a maintenance schedule it can't escape, and the team that didn't budget for that schedule is going to discover it the hard way.

Prompt Asset Depreciation: The Maintenance Schedule Your AI Team Doesn't Keep

· 9 min read
Tian Pan
Software Engineer

Engineering leaders are comfortable with the idea that code rots. Dependencies need updating, infrastructure has lifecycle management, certificates expire on a calendar nobody disputes. Yet the prompt repository gets treated as a write-once-read-many artifact — even though it defines how your product talks to a probabilistic engine that ships behavior changes every six weeks.

The system prompt tuned six months ago against the model that was current then is still in production. The few-shot examples chosen against a tokenizer that has since changed are still being injected on every call. The reranker prompt was tuned against an embedding endpoint the vendor deprecated last quarter. Nobody scheduled a review. Nobody is going to.

This is not a hypothetical failure mode. When one team migrated their prompt suite — meticulously stabilized against GPT-4-32k — to GPT-4.1 and GPT-4.5-preview, only 95.1% and 97.3% of their regression tests passed. A 3-5% silent quality regression is not a rounding error in production; at any non-trivial scale it is a customer-visible degradation that nobody on the team intentionally shipped. And those are the teams that even had a regression test suite. The median team's "regression test" is whatever vibes the on-call engineer formed during the last incident.

The category we are missing is prompt asset depreciation: a maintenance discipline that treats every production prompt as a depreciating asset with a known lifespan, not a constant.

Why AI-Generated Comments Rot Faster Than the Code They Describe

· 11 min read
Tian Pan
Software Engineer

When an agent writes a function and a comment in the same diff, the comment is not documentation. It is a paraphrase of the code at write-time, generated by the same model from the same context, and it is silently wrong the first time the code shifts. The function gets refactored, an argument changes type, an early-return gets added, the comment stays. By next quarter, the comment is encoding a specification that no longer matches the code, and the next reader trusts the comment because the comment is easier.

This is an old failure mode — humans-edit-code-comments-stay-stale — but agents accelerate it across three dimensions at once. Comment volume goes up because agents add a doc block to every function whether it needs one or not. The comments are grammatically perfect, so reviewers don't flag them as low-quality. And the comments paraphrase the code in different terms than the code actually executes, so they look like documentation but encode a second specification that drifts independently of the first.

The 'We'll Add Evals Later' Trap: How Measurement Debt Compounds

· 9 min read
Tian Pan
Software Engineer

Every team that ships an AI feature without evals tells themselves the same story: we'll add measurement later, after we find product-market fit, after the prompt stabilizes, after the next release. Six months later, the prompt has been touched by four engineers and two product managers, the behavior is load-bearing for three customer integrations, and the team discovers that "adding evals later" means reconstructing intent from production logs they never structured for that purpose. The quarter that was supposed to be new features becomes a quarter of archaeology.

This isn't a planning mistake. It's a compounding one. The team that skipped evals to ship faster is the same team that will spend twelve weeks rebuilding eval infrastructure from incomplete traces, disagreeing about what "correct" meant in February, and quietly removing features nobody can prove still work. The cost of catching up exceeds the cost of building in — not by a little, but by a multiplier that grows with every prompt edit that shipped without a regression check.

LLM-as-Compiler Is a Metaphor Your Codebase Can't Survive

· 10 min read
Tian Pan
Software Engineer

The pitch is seductive: describe the behavior in English, the model emits the code, ship it. Prompts become the source, artifacts become the target, and the LLM sits between them like gcc with a friendlier front-end. If that framing held, the rest of software engineering — review, refactoring, architecture — would be downstream of prompt quality. It does not hold. And the codebases built on the assumption that it does start failing in a pattern that is now boring to diagnose: around month six, nobody can explain why a particular function looks the way it does, and every incremental change produces a wave of duplicates.

The compiler metaphor is the root cause, not vibe coding, not model quality, not prompt skill. It is a category error that quietly excuses teams from doing the work that keeps a codebase coherent over years. When you believe the model is a compiler, the generated code is an implementation detail, the same way assembly is an implementation detail of a C program. When you are actually running a team of non-deterministic, context-limited collaborators, the generated code is the asset — and the prompts are closer to Slack messages than to source.

Vibe Code at Scale: Managing Technical Debt When AI Writes Most of Your Codebase

· 9 min read
Tian Pan
Software Engineer

In March 2026, a major e-commerce platform lost 6.3 million orders in a single day — 99% of its U.S. order volume gone. The cause wasn't a rogue deployment or a database failure. An AI coding tool had autonomously generated and deployed code based on outdated internal documentation, corrupting delivery time estimates across every marketplace. The company had mandated that 80% of engineers use the tool weekly. Adoption metrics were green. Engineering discipline was not.

This is what vibe coding at scale actually looks like. Not the fast demos that ship in four days. The 6.3 million orders that vanish on day 365.

The Vibe Coding Productivity Plateau: Why AI Speed Gains Reverse After Month Three

· 8 min read
Tian Pan
Software Engineer

In a controlled randomized trial, developers using AI coding assistants predicted they'd be 24% faster. They were actually 19% slower. The kicker: they still believed they had gotten faster. This cognitive gap — where the feeling of productivity diverges from actual delivery — is the early warning signal of a failure mode that plays out over months, not hours.

The industry has reached near-universal AI adoption. Ninety-three percent of developers use AI coding tools. Productivity gains have stalled at around 10%. The gap between those numbers is not a tool problem. It is a compounding debt problem that most teams don't notice until it's expensive to reverse.

The Three Silent Clocks of AI Technical Debt

· 10 min read
Tian Pan
Software Engineer

Traditional technical debt announces itself. A slow build, a failing test, a lint warning that's been suppressed for six months — all of these are symptoms you can grep for, assign to a ticket, and schedule into a sprint. AI-specific debt is different. It accumulates in silence, in the gaps between deploys, and it degrades your system's behavior before anyone notices that the numbers have moved.

Three debt clocks are ticking in most production AI systems right now. The first is the prompt that made sense when a specific model version was current. The second is the evaluation set that was representative of user behavior when it was assembled, but no longer is. The third is the index of embeddings still powering your retrieval layer, generated from a model that has since been deprecated. Each clock runs independently. All three compound.

The Prompt Debt Spiral: How One-Line Patches Kill Production Prompts

· 9 min read
Tian Pan
Software Engineer

Six months into production, your customer-facing LLM feature has a system prompt that began as eleven clean lines and has grown to over 400 tokens of conditional instructions, hedges, and exceptions. Quality is measurably worse than at launch, but every individual change seemed justified at the time. Nobody knows which clauses conflict with each other, or whether half of them are still necessary. Nobody wants to touch it.

This is the prompt debt spiral — and most teams in production are already inside it.

The AI-Generated Code Maintenance Trap: What Teams Discover Six Months Too Late

· 11 min read
Tian Pan
Software Engineer

The pattern is almost universal across teams that adopted coding agents in 2023 and 2024. In month one, velocity doubles. In month three, management holds up the productivity metrics as evidence that AI investment is paying off. By month twelve, the engineering team can't explain half the codebase to new hires, refactoring has become prohibitively expensive, and engineers spend more time debugging AI-generated code than they would have spent writing it by hand.

This isn't a story about AI code being secretly bad. It's a story about how the quality characteristics of AI-generated code systematically defeat the organizational practices teams already had in place — and how those practices need to change before the debt compounds beyond recovery.

The Three Hidden Debts Killing Your AI System

· 10 min read
Tian Pan
Software Engineer

Your AI feature shipped on time. Users are using it. Everything looks fine — until one quarter later when a support ticket reveals the system has been confidently wrong for weeks, your evaluation suite caught nothing, and the vector index is silently returning stale results. Nothing broke. The system returned 200 OK the whole time.

This is what AI technical debt looks like. Unlike a failing unit test or a stack overflow, it degrades softly and probabilistically. You don't get a crash — you get subtle quality erosion. Three distinct liabilities drive most of this: prompt debt, eval debt, and embedding debt. Each accumulates independently. Each compounds the others. And most engineering teams are carrying all three.

When the Prompt Engineer Leaves: The AI Knowledge Transfer Problem

· 9 min read
Tian Pan
Software Engineer

Six months after your best prompt engineer rotates off to a new project, a customer-facing AI feature starts misbehaving. Response quality has degraded, the output format occasionally breaks, and there's a subtle but persistent tone problem you can't quite name. You open the prompt file. It's 800 words of natural language. There's no changelog, no comments, no test cases. The person who wrote it knew exactly why every phrase was there. That knowledge is gone.

This is the prompt archaeology problem, and it's already costing teams real money. A national mortgage lender recently traced an 18% accuracy drop in document classification to a single sentence added to a prompt three weeks earlier during what someone labeled "routine workflow optimization." Two weeks of investigation, approximately $340,000 in operational losses. The author of that change had already moved on.