The Prompt Made Sense Last Year: Institutional Knowledge Decay in AI Systems
There's a specific kind of dread that hits when you inherit an AI system from an engineer who just left. The system prompts are hundreds of lines long. There's a folder called evals/ with 340 test cases and no README. A comment in the code says # DO NOT CHANGE THIS — ask Chen and Chen is no longer reachable.
You don't know why the customer support bot is forbidden from discussing pricing on Tuesdays. You don't know which eval cases were written to catch a regression from six months ago versus which ones are just random examples. You don't know if the guardrail blocking certain product categories was a legal requirement, a compliance experiment, or something someone added because a VP saw one bad output.
The system still works. For now. But you can't safely change anything.
This is institutional knowledge decay in AI systems — and it's structurally different from the legacy code problem engineers have complained about for decades. Understanding why it's different is the first step toward building systems that outlast the people who built them.
Why AI Systems Decay Faster Than Code
Traditional software has a saving grace: it's static. The function you wrote in 2021 still does what it did in 2021. You might not know why it was written that way, but at least the behavior is stable and inspectable. You can read the code.
AI systems don't have this property. Three things make them uniquely vulnerable to knowledge decay:
Model updates break prompts silently. A carefully crafted prompt that worked reliably in January may behave differently in March because the underlying model changed. Unlike a deprecated library that throws a compiler error, model updates arrive invisibly. The prompt still runs. The output just drifts in subtle ways nobody notices until something important breaks.
Prompts are fragile in ways code isn't. Changing a single sentence in a system prompt — adding the word "concise," tweaking the tone instruction, reordering two paragraphs — can cascade into unexpected behavior changes. The dependency is semantic, not syntactic. There's no type system to catch regressions.
Evals go stale without anyone knowing. An evaluation dataset that covers "95% of production error cases" encodes a claim that was true on the day it was written. As models improve, as product scope expands, as edge cases shift, that claim silently becomes false. Unlike deprecated APIs, there's no deprecation warning.
Research into AI-assisted engineering found that approximately 20-25% of architectural decisions have expired evidence within two months — and 86% of stale assumptions are discovered only reactively, during incidents, rather than through systematic tracking.
The Three Artifacts That Decay
Institutional knowledge in AI systems is concentrated in three categories of artifacts. Each decays differently.
System Prompts
System prompts are the most obvious artifact, but the rationale behind them is the real knowledge. Why does the instruction come at the end? (Because LLMs predict the next token — if you put instructions first, the model might start generating content before it reads them.) Why is the role defined as "a helpful but skeptical assistant" rather than just "a helpful assistant"? (Because in testing, the more helpful variant hallucinated citations 30% more often.) Why are there delimiters around the user input section? (Because without them, certain injection patterns succeeded in early testing.)
Strip away the engineers who made those choices and you have a prompt that appears arbitrary. Every modification becomes a gamble because you don't know what invariants you're violating.
A taxonomy of prompt defects identifies maintainability failures — hard-coded prompts, insufficient testing, poor documentation — as a distinct failure category alongside the more obvious semantic and structural defects. The maintainability failure often causes the other failures, because undocumented prompts get modified incorrectly.
Evaluation Datasets
Evaluation datasets are knowledge crystallized into test cases. Each test case encodes a judgment: this input should produce this kind of output, and here's why we care enough to measure it.
When the rationale is undocumented, the eval set becomes cargo-cult engineering. Teams pass tests they don't understand, fail tests they can't explain, and add new cases based on vibes rather than principled coverage analysis. Worse, when a model update changes behavior in ways that make previously failing cases now pass, no one knows whether that's a genuine improvement or a masked regression.
The Eval Factsheets framework — a structured approach to evaluation documentation — proposes five dimensions every eval dataset should document: context (who created it and why), scope (what capabilities it measures), structure (data composition and sources), method (operational procedures), and alignment (reliability and validity claims). Few teams document even one of these systematically.
Guardrails
Guardrails are where the knowledge decay is most dangerous. A guardrail that blocks certain outputs exists because someone decided that benefit outweighed cost. That decision usually involves context: a legal review, a specific incident, a compliance requirement, a product principle. When the context is gone, you're left with a rule that can't be questioned.
The problem isn't just that guardrails become opaque — it's that they become immovable. Teams are reluctant to remove or modify rules they don't understand. So guardrails accumulate. Rules designed for failure modes that no longer exist stay in production. Rules that made sense for an old model version create friction with a new one. The system becomes increasingly brittle, not because the guardrails are wrong, but because no one can tell which ones are still right.
- https://arxiv.org/html/2601.21116v1
- https://arxiv.org/html/2603.14805
- https://engineering.fb.com/2026/04/06/developer-tools/how-meta-used-ai-to-map-tribal-knowledge-in-large-scale-data-pipelines/
- https://arxiv.org/html/2512.04062v1
- https://arxiv.org/html/2509.14404v1
- https://launchdarkly.com/blog/prompt-versioning-and-management/
- https://tettra.com/article/tribal-knowledge/
- https://ifor.ai/blog/the-silent-exit-how-ai-preserves-institutional-knowledge-english-37b69
- https://hoeijmakers.net/model-cards-system-cards/
