Skip to main content

The AI Onboarding Gap: Why Engineers Can't Learn What They Can't Test

· 11 min read
Tian Pan
Software Engineer

A new engineer joins an AI-heavy team. On their third day, they see a prompt with an awkward double negation in the system instructions. It looks like a bug. They clean it up — the kind of small polish any reasonable person would do. Two hours later, customer-facing classification accuracy on a critical pipeline drops from 91% to 74%. Nobody has any idea why.

This scenario plays out in some form at almost every team building on LLMs. The new engineer isn't careless. The prompt did look wrong. But that double negation was load-bearing in a way that only the person who wrote it — after weeks of experimentation — actually understood. And they never wrote that understanding down.

This is the AI onboarding gap: the chasm between what an AI codebase appears to do and what it actually does, and why that gap is invisible until someone falls into it.

Why AI Codebases Resist Traditional Debugging

When a senior engineer on a traditional codebase tells a new hire "don't touch X without running the full test suite," that advice is actionable. The test suite runs in minutes. It either passes or it doesn't. If someone's change breaks something, git bisect can locate the exact commit. Causation is traceable.

None of this works cleanly for LLM-based systems.

The fundamental problem is non-determinism. The same prompt, same model, same temperature-zero settings can produce meaningfully different outputs across runs. This isn't a bug — it's how the technology works. But it makes the basic engineering ritual of "did my change break something?" surprisingly difficult to answer. A single test run is not evidence. Even a perfect pass doesn't mean your change is safe.

This creates what you might call the unbisectable failure problem. When a production LLM system starts behaving strangely, the regression might have been introduced by:

  • A prompt change (the obvious candidate, but not always the right one)
  • A model version upgrade that changed behavior on edge cases
  • A shift in the distribution of inputs from users
  • Context length changes that caused earlier instructions to be weighted differently
  • Interaction effects between prompt components that weren't visible in isolation

A new engineer debugging a regression will typically check the obvious things — did someone change the prompt? Did someone change the code? — and miss the non-obvious ones. They don't yet have the mental model to know that the model interprets instructions differently when they appear after a long conversation versus at the start. They don't know that your RAG system's retrieval threshold was tuned specifically to compensate for a quirk in how a particular model handles uncertainty. They can't know, because that knowledge lives nowhere in the codebase.

The Tribal Knowledge Problem at Its Most Acute

Every engineering organization accumulates tribal knowledge — the implicit understanding of why things are the way they are that lives in individual heads rather than in documentation. AI engineering teams accumulate it at an exceptional rate, because the iteration loop for prompt development is fast, informal, and leaves no trail.

When a developer writes a function and it works, the test coverage tells you it works. When a developer refines a prompt and it works, the evidence is usually "I ran it a few times and it looked good." The reasoning that led to the final prompt — the failed variants, the discovered failure modes, the counterintuitive phrasings that work — rarely gets written down.

Consider what a new engineer actually needs to know to safely work on an LLM pipeline:

  • Why does the system prompt use this specific structure rather than a more natural one?
  • Which parts of the prompt are sensitive to paraphrasing, and which aren't?
  • What failure modes were discovered during development, and how did the current prompt address them?
  • Are there known edge cases where the model reliably breaks, and was that a deliberate tradeoff?
  • What model version was this prompt tuned against, and how much does behavior vary across versions?

None of this is in the code. The code shows what the prompt is. It says nothing about what it was tried against, what it replaced, or what would happen if you changed it.

This is more extreme than the tribal knowledge problem in traditional software because the "why" is often empirical, not logical. You can't reason your way to understanding why a particular prompt phrasing works — you have to observe it over many runs. If that observation never got recorded, it's gone.

The Artifacts That Make AI Systems Teachable

Teams that handle this well have figured out that AI systems need a different class of documentation than traditional systems — not README files and architecture diagrams, but artifacts that capture the behavioral evidence that normally exists only in a developer's memory.

Golden datasets are the most important of these. A golden dataset is a curated collection of inputs paired with expected outputs or evaluation criteria — the canonical test cases that represent how the system is supposed to behave. Unlike unit tests, they don't assert exact string equality; they define acceptable behavior ranges. A strong golden dataset for an AI system covers happy paths, known edge cases, adversarial inputs, and out-of-scope queries.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates