Skip to main content

AI Succession Planning: What Happens When the Team That Knows the Prompts Leaves

· 11 min read
Tian Pan
Software Engineer

The engineer who built your customer support AI leaves for another job. On their last day, you do an offboarding interview and ask them to document what they know. They write a few paragraphs explaining how the system works. Six months later, customer satisfaction scores start slipping. Someone suggests tightening the tone of the system prompt. Another engineer makes the edit, runs a few manual tests, and ships it. Three weeks later, you discover that a specific phrasing in the original system prompt was load-bearing in ways nobody knew — it was the only thing preventing the model from over-escalating tickets on Friday afternoons, a pattern the original engineer had noticed and quietly fixed with a single sentence.

No one knew that sentence existed for a reason. It looked like implementation detail. It was actually institutional knowledge.

This is the AI succession problem. Unlike traditional software, where git blame surfaces who changed a function and a good commit message explains why, LLM-based systems carry their reasoning in natural language that looks the same whether it encodes critical intent or was written in five minutes. The documentation gap is structural, not accidental. And it compounds with every model upgrade, every prompt edit, and every engineer who touches the system without understanding what the original authors were trying to do.

Why Prompts Resist Traditional Documentation Practices

In software engineering, the intent behind code is partially self-documenting. Function names, type signatures, and test cases communicate what a piece of code is supposed to do. When those signals are insufficient, engineers write comments or docstrings. The intent can be inferred from the implementation.

Prompts don't have this property. A 500-token system prompt is a wall of natural language. Its internal structure is invisible. Constraints and affordances are woven together in prose. The difference between "You are a helpful assistant" and "You are a helpful assistant. When a user expresses frustration, acknowledge the feeling before attempting to resolve the issue" is invisible at a glance — until the first version produces a measurable drop in satisfaction scores and you spend two weeks A/B testing your way back to the second.

This is the core problem: the semantic intent of a prompt is not recoverable from its text alone. To understand why a prompt is phrased a specific way, you need to know:

  • What behavior the original author was trying to elicit
  • What failure modes they were guarding against
  • What evaluation results led to that particular phrasing
  • What alternatives were tried and rejected

None of that lives in the prompt file. It lives in the head of whoever wrote it.

Research on software knowledge management suggests that roughly 42% of institutional knowledge resides with individual employees rather than in documented systems. For AI teams, that number is likely worse, because the tooling for capturing prompt-level knowledge barely existed before 2024.

The Prompt Debt Spiral

When knowledge about prompts goes undocumented, you accumulate a specific kind of technical debt that doesn't look like technical debt. The system keeps working. Evals may even pass. But the substrate of understanding that makes confident changes possible has eroded.

The first symptom is prompt hoarding. Engineers are reluctant to edit system prompts they didn't write because they can't be sure what they'd break. So the prompt grows. New instructions get appended rather than integrated. Contradictions accumulate. The prompt becomes an archaeology site where every sentence has a different vintage and no one knows which ones are still load-bearing.

The second symptom is model migration paralysis. When your LLM provider deprecates a model and you need to migrate, the instructions that worked on the old model need to be re-evaluated on the new one. If you don't understand why each instruction exists, you can't make principled decisions about what to keep, what to rewrite, and what to test. Teams that have documented their prompts migrate in days. Teams that haven't migrate in months — or defer the migration until the old model is unavailable.

The third symptom is the diff problem. When a prompt change causes a regression, git diff shows you what changed. It does not show you why the previous version was correct. Without documentation of intent, you're debugging by intuition, which is slow and unreliable.

Prompt Archaeology: Reconstructing What You've Lost

For teams that are already in debt, the first step is archaeology: attempting to reconstruct the reasoning behind existing prompts before it's too late.

The approach is systematic. Start with the most critical prompts — the ones that touch user-facing behavior or handle high-stakes decisions. For each one, schedule a working session with the original authors while they're still available. The goal is not to document what the prompt says (the text is already there) but to document what it's for.

The key questions for each prompt:

  • What specific behavior failure prompted this instruction?
  • What alternatives were tried?
  • What evaluation or user feedback validated this phrasing?
  • Which sentences are you most uncertain about changing, and why?

These answers should live in a decision log adjacent to the prompt file. Not a wiki, not a slide deck — something that's versioned alongside the prompt itself and visible to whoever opens the file next.

Even partial archaeology is valuable. A single paragraph explaining the original intent of a system prompt's most counterintuitive instruction is worth more than a hundred lines of general documentation about the AI system's goals.

Eval-as-Documentation: Encoding Intent in Tests

The most durable form of prompt documentation is not prose — it's evaluations. A well-designed eval suite encodes what the prompt is supposed to do in a form that's executable, versioned, and regression-tested on every change.

This reframes the role of evals. They're not just quality checks; they're the specification of intent. When an eval tests that the model doesn't escalate routine tickets, that test communicates something that no comment in a prompt file could communicate as reliably: this behavior matters, we measured it, and we've committed to maintaining it.

The disciplines that make eval-as-documentation work:

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates