Skip to main content

Your System Prompt Grows After Every Incident — and Nobody Deletes a Line

· 8 min read
Tian Pan
Software Engineer

Open the system prompt of any agent that has been in production for a year. Scroll to the bottom. You will find a sediment layer of sentences that read like apologies: "Never invent order numbers." "Do not promise refunds you cannot confirm." "If the user is in Germany, do not mention the legacy plan." Each one is a fossil. Each one marks the exact moment something went wrong in production, someone got paged, and the fastest available fix was to add a sentence.

Nobody deletes those sentences. Not because they are still earning their place, but because deleting one means proving a negative — proving the model will not regress on a bug that may have been fixed three model versions ago. No one can prove that, so the line stays. The system prompt becomes an append-only log of past incidents, and it costs you tokens on every single call, forever.

This is the quietest form of technical debt in an AI system, because it does not look like debt. It looks like diligence.

How the Prompt Becomes a Graveyard

The accretion pattern is almost identical to how legacy code rots, with one crucial difference: legacy code at least has the decency to throw an error when it breaks.

A typical lifecycle: an agent ships with a clean 400-token system prompt. Week three, it confidently fabricates a tracking number for a customer. Incident, postmortem, and the remediation is one line — "Never generate tracking numbers; only repeat numbers returned by the lookup tool." Week seven, it offers a discount that does not exist. Another line. Week twelve, a compliance review finds it discussing a deprecated product, so in goes a paragraph about which SKUs are off-limits. Eighteen months later the prompt is 4,000 tokens of defensive rules, few-shot examples, and conditional carve-outs, and changing one sentence in the second paragraph mysteriously alters the tone of the closing summary.

Practitioners have a name for the end state: the prompt graveyard — a collection of similar-but-subtly-different mega-prompts that nobody dares delete because each might contain one crucial instruction buried deep inside. The result is a maintenance nightmare where a trivial change requires hours of careful editing and a full regression sweep.

The reason this happens is structural, not lazy. Adding a sentence is a five-minute fix that demonstrably resolves the incident in front of you. Removing a sentence is an open-ended research project with no clear success criterion. The economics of an active incident always favor accretion. Every individual decision is rational. The aggregate is a liability.

Why It Is Worse Than Legacy Code

Three properties make prompt accretion nastier than the code equivalent.

It has no test coverage. When you add a defensive instruction, you are adding a behavioral requirement with zero automated verification that it still holds — or that it ever did. Code has a compiler and a test suite that fail loudly when an assumption breaks. A prompt instruction just sits there. It might be doing nothing. It might be actively fighting another instruction added six months later. You cannot tell by reading it, and most teams never check.

It is silent when it stops working. Business logic that lives in a prompt rather than in code is rarely versioned, hard to audit, and easy to resurface accidentally when someone copies an old prompt into a new agent. Because the model is flexible, a stale instruction keeps "kind of working" long after it should raise an alarm — until the day the model's reasoning shifts and no one can explain why behavior changed. There is no stack trace pointing at line 47 of the system prompt.

It taxes every request. This is the part that shows up on the invoice. Every turn an agent takes, it re-sends the entire system prompt — instructions, tool definitions, carve-outs — to the model. A 2,000-token system prompt over a 50-turn session is 100,000 tokens of instruction reprocessing that produces no new value. Multiply by your daily request volume and the defensive sediment has a real, recurring line-item cost. Prompt caching softens the bill, but caching a paragraph that no longer does anything is just paying a discounted rate for dead weight.

Defensive Instructions Are Not Free Even When They Work

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates