Skip to main content

The Prompt Made Sense Last Year: Institutional Knowledge Decay in AI Systems

· 10 min read
Tian Pan
Software Engineer

There's a specific kind of dread that hits when you inherit an AI system from an engineer who just left. The system prompts are hundreds of lines long. There's a folder called evals/ with 340 test cases and no README. A comment in the code says # DO NOT CHANGE THIS — ask Chen and Chen is no longer reachable.

You don't know why the customer support bot is forbidden from discussing pricing on Tuesdays. You don't know which eval cases were written to catch a regression from six months ago versus which ones are just random examples. You don't know if the guardrail blocking certain product categories was a legal requirement, a compliance experiment, or something someone added because a VP saw one bad output.

The system still works. For now. But you can't safely change anything.

This is institutional knowledge decay in AI systems — and it's structurally different from the legacy code problem engineers have complained about for decades. Understanding why it's different is the first step toward building systems that outlast the people who built them.

Why AI Systems Decay Faster Than Code

Traditional software has a saving grace: it's static. The function you wrote in 2021 still does what it did in 2021. You might not know why it was written that way, but at least the behavior is stable and inspectable. You can read the code.

AI systems don't have this property. Three things make them uniquely vulnerable to knowledge decay:

Model updates break prompts silently. A carefully crafted prompt that worked reliably in January may behave differently in March because the underlying model changed. Unlike a deprecated library that throws a compiler error, model updates arrive invisibly. The prompt still runs. The output just drifts in subtle ways nobody notices until something important breaks.

Prompts are fragile in ways code isn't. Changing a single sentence in a system prompt — adding the word "concise," tweaking the tone instruction, reordering two paragraphs — can cascade into unexpected behavior changes. The dependency is semantic, not syntactic. There's no type system to catch regressions.

Evals go stale without anyone knowing. An evaluation dataset that covers "95% of production error cases" encodes a claim that was true on the day it was written. As models improve, as product scope expands, as edge cases shift, that claim silently becomes false. Unlike deprecated APIs, there's no deprecation warning.

Research into AI-assisted engineering found that approximately 20-25% of architectural decisions have expired evidence within two months — and 86% of stale assumptions are discovered only reactively, during incidents, rather than through systematic tracking.

The Three Artifacts That Decay

Institutional knowledge in AI systems is concentrated in three categories of artifacts. Each decays differently.

System Prompts

System prompts are the most obvious artifact, but the rationale behind them is the real knowledge. Why does the instruction come at the end? (Because LLMs predict the next token — if you put instructions first, the model might start generating content before it reads them.) Why is the role defined as "a helpful but skeptical assistant" rather than just "a helpful assistant"? (Because in testing, the more helpful variant hallucinated citations 30% more often.) Why are there delimiters around the user input section? (Because without them, certain injection patterns succeeded in early testing.)

Strip away the engineers who made those choices and you have a prompt that appears arbitrary. Every modification becomes a gamble because you don't know what invariants you're violating.

A taxonomy of prompt defects identifies maintainability failures — hard-coded prompts, insufficient testing, poor documentation — as a distinct failure category alongside the more obvious semantic and structural defects. The maintainability failure often causes the other failures, because undocumented prompts get modified incorrectly.

Evaluation Datasets

Evaluation datasets are knowledge crystallized into test cases. Each test case encodes a judgment: this input should produce this kind of output, and here's why we care enough to measure it.

When the rationale is undocumented, the eval set becomes cargo-cult engineering. Teams pass tests they don't understand, fail tests they can't explain, and add new cases based on vibes rather than principled coverage analysis. Worse, when a model update changes behavior in ways that make previously failing cases now pass, no one knows whether that's a genuine improvement or a masked regression.

The Eval Factsheets framework — a structured approach to evaluation documentation — proposes five dimensions every eval dataset should document: context (who created it and why), scope (what capabilities it measures), structure (data composition and sources), method (operational procedures), and alignment (reliability and validity claims). Few teams document even one of these systematically.

Guardrails

Guardrails are where the knowledge decay is most dangerous. A guardrail that blocks certain outputs exists because someone decided that benefit outweighed cost. That decision usually involves context: a legal review, a specific incident, a compliance requirement, a product principle. When the context is gone, you're left with a rule that can't be questioned.

The problem isn't just that guardrails become opaque — it's that they become immovable. Teams are reluctant to remove or modify rules they don't understand. So guardrails accumulate. Rules designed for failure modes that no longer exist stay in production. Rules that made sense for an old model version create friction with a new one. The system becomes increasingly brittle, not because the guardrails are wrong, but because no one can tell which ones are still right.

How Meta Approached This Problem at Scale

Meta's internal data pipeline tooling illustrates what systematic knowledge extraction looks like in practice. Their AI agents lacked understanding of undocumented design patterns, cross-module dependencies, and naming conventions scattered across thousands of codebase files. The tribal knowledge existed only in engineers' heads.

Their solution: a pre-compute engine using 50+ specialized AI agents to systematically extract and encode that knowledge. Each agent answered specific questions per module — what this component configures, what the common modification patterns are, what non-obvious patterns cause failures, what cross-module dependencies exist. The context files that resulted were deliberately compact: 25-35 lines each, structured for navigability rather than comprehensiveness.

The results were significant. AI context coverage expanded from 5% to 100% of code modules. Tasks that previously required ~two days of engineer consultation completed in ~30 minutes. Quality scores improved measurably across three rounds of review.

The key design principle — "compass, not encyclopedia" — applies directly to AI system documentation. You're not trying to capture everything; you're trying to give the next engineer a reliable orientation point so they can reason about changes safely.

The Documentation Primitives That Actually Work

Several concrete artifacts address this problem. None require significant tooling investment to start.

Prompt rationale files. Store a companion file alongside each system prompt that documents: why the structure is ordered the way it is, what specific phrasings were tested and rejected, what model version the prompt was optimized for, and what changed between major versions. This is the equivalent of the commit message for your prompt — the context that makes the code meaningful.

Eval provenance logs. For each evaluation dataset, document the coverage claim ("designed to catch regression from the March incident where the bot gave pricing on Tuesdays"), the composition rationale (why these examples, in this proportion), and any model version dependencies. When you add a test case, add a sentence explaining what failure mode it guards against.

Guardrail justification comments. Treat guardrails like security controls: document the threat model. What failure mode does this rule prevent? What was the triggering incident or decision? What's the threshold for reconsidering it? An expiration date or review trigger ("re-evaluate if we upgrade to the next major model version") makes the maintenance contract explicit.

Decision logs with epistemic markers. Not all knowledge is equally reliable. A decision based on a controlled experiment with 10,000 examples is different from a decision based on one engineer's intuition. Research into AI-assisted engineering proposes a three-level epistemic classification: conjecture (L0), substantiated claims with direct evidence (L1), and corroborated claims with multiple independent validations (L2). Decisions documented with their epistemic level are easier to revisit appropriately — you know how much evidence you'd need to override them.

Keeping Documentation Alive

Documentation that isn't maintained becomes a liability — it gives false confidence while the actual system drifts. Three practices help:

Automated evaluation triggers on model updates. Configure your evaluation pipeline to run automatically when the underlying model version changes. This won't tell you what broke or why, but it tells you that something changed — which is the minimum signal needed to know documentation needs review.

Prompt versioning as a first-class practice. Treat prompt changes like code changes: version controlled, reviewed, with rationale captured in commit messages or change logs. The question "why did we change this?" should always be answerable from the repository history.

Explicit validity windows. Some documentation is valid indefinitely; other documentation expires. A note that says "this eval set was designed for the GPT-4 context window" should trigger review when the model changes. Making validity conditions explicit — rather than hoping someone will remember to check — is the difference between documentation that helps and documentation that misleads.

The Organizational Question

Documentation doesn't happen without ownership. The deeper question is: who is responsible for ensuring AI systems remain interpretable as the team changes?

The emerging pattern at organizations that have thought about this carefully is a separation between infrastructure and content ownership. A platform team owns the versioning infrastructure, the evaluation pipeline, the tooling that makes documentation possible. Individual AI system owners — the teams that built the customer support bot, the code review assistant, the recommendation system — own the documentation for their specific systems, including prompts, evals, and guardrails.

What doesn't work: treating documentation as a one-time activity at project completion. AI systems are not shipped and done. They require ongoing maintenance as models update, as product requirements evolve, as edge cases are discovered. The documentation has to be treated as a living artifact with the same maintenance cadence as the code.

What You Can Do Today

If you're inheriting an undocumented AI system, start with archaeology before modification. Read every part of the system prompt and write down what you think each section does and why. Test your theories. Document the gaps explicitly — "this section's purpose is unclear" is more useful than silence because it signals what needs to be resolved.

If you're building an AI system, write the documentation before you're tempted to skip it. The time when you remember why you made a choice is immediately after you made it. Three months later, you won't remember. Six months later, you might not be on the team.

The prompt made sense when Chen wrote it. It can still make sense when the person who's never met Chen needs to change it — but only if Chen documented why.

That's not a documentation problem. It's an engineering discipline problem. And like most engineering discipline problems, it's much cheaper to address at the beginning than to fix after it breaks.

References:Let's stay in touch and Follow me for more thoughts and updates