Skip to main content

The First 90 Days for an AI Engineer: An Onboarding Playbook That Survives the Six-Week Doc Rot

· 12 min read
Tian Pan
Software Engineer

The new hire opens the onboarding doc. It points at a service architecture diagram from eleven months ago, a Confluence page titled "Our LLM Stack" last edited in October, and a Notion table of "model providers we use." Nothing in any of these documents tells them which prompt was tuned against which failure mode, which eval cases were added after which incident, which judge was recalibrated when the model bumped from 4.5 to 4.6, or why the system prompt for the support agent has a strange three-line preamble nobody wants to touch. Two weeks in, they ship a "small prompt cleanup" PR that removes the preamble. The eval suite passes. Production accuracy drops four points within a day.

The standard new-hire onboarding playbook — read the architecture doc, set up your laptop, do your first PR by week two — was built for engineers who join services. AI engineers join a different artifact. The thing they're going to be editing isn't a 5,000-line Go service that some staff engineer wrote; it's a 30-line prompt that survived eleven incidents and seventeen eval-driven rewrites, and the meaning of those thirty lines lives in the heads of two people on the team. Your onboarding doc cannot capture that, and trying to write a longer doc is the wrong fix.

The right fix is a ramp-up sequence that puts new hires in front of the institutional knowledge in motion — at eval reviews, in prompt diffs they ship under supervision, owning a judge calibration end-to-end — so they learn the same way the senior engineers learned: by watching, by failing in a safe way, and by building a small piece of the surface themselves. Below is a 90-day playbook that treats AI engineering onboarding as a distinct discipline from service onboarding, with milestones that target the actual artifacts a working AI engineer touches.

Why the Architecture Diagram Lies (And Why You Stop Updating It)

Service onboarding works because services have a shape that survives time. A service's API contract, its data model, its deploy topology — these are slow-moving artifacts that a diagram can capture and that a doc can describe. An AI feature has none of that stability at the layer that matters. The interesting state lives in the prompt repo, the eval suite, the judge prompts, the model version locks, and the dataset of incident replays. All of these are edited at a velocity that makes a written doc stale within weeks.

Meta's engineering team ran into a version of this problem at the codebase level when they tried to give AI agents enough context to be productive in their internal monorepo. Their solution was telling: they built a pre-compute layer of context files maintained by a swarm of specialized agents, with periodic validation runs to detect stale references and auto-repair drift, because they found that context that decays is worse than no context at all. The lesson generalizes — a six-week-old onboarding doc that points at a prompt that no longer exists is more harmful than no doc, because the new hire trusts it and acts on it.

So the onboarding artifacts you maintain should be the ones that decay slowly. The shape of the eval suite is stable. The pattern of how prompts are organized in the repo is stable. The structure of judge calibration as a process is stable. What's unstable — the contents of the prompts, the specific eval cases, the current judge rubric — should be learned by reading the live artifact in the repo, not by reading a copy of it in a doc that nobody has agreed to keep updated.

This reframing matters because it tells you what the onboarding doc should contain: pointers, processes, and provenance, not snapshots. The doc tells the new hire how to find the prompts, how the eval suite is structured, and where the incident log lives. The current state of those artifacts is read from the repo and from the people who own them.

Days 1 to 30: Shadow Before You Touch

The first thirty days should feel slower than service onboarding would suggest. A new hire on a backend team usually has their first PR merged in week one or two — usually a config tweak, a documentation fix, or a small refactor. The equivalent here is dangerous. A new hire who pushes a "small prompt cleanup" before they've internalized the failure-mode history of that prompt is doing the AI equivalent of refactoring a function whose tests don't fully cover its behavior.

The thirty-day goal is comprehension, not contribution. Concretely, that means:

  • Shadow at least three eval reviews. These are the meetings where the team looks at the latest eval run, decides which regressions matter, and assigns owners. Sitting in two or three of these teaches the new hire how the team distinguishes a noisy 1-point drop from a load-bearing regression, what "this looks like a judge problem, not a model problem" sounds like in practice, and which evals are treated as canaries versus which are treated as comprehensive.
  • Read the last six months of prompt-repo commit history for one prompt the team owns. Not all the prompts — one prompt, deeply. Each commit message should ideally point at the failure mode or eval delta that motivated it. If commit messages don't, the new hire's first written artifact should be a retro-doc that reconstructs the why of each non-trivial commit from Slack threads, eval logs, and conversations with the original author. This document is for them and only them.
  • Sit with the on-call rotation for a week. Most AI features fail in ways that don't trip the alerting that already exists — eval-on-traffic drops four points, judge calibration drifts after a model version bump, a downstream feature starts citing a fact that was hallucinated. Watching the on-call engineer triage these teaches the new hire what "broken" actually looks like for this stack, which is rarely a stack trace.

None of this produces a merged PR. That is the point. The artifact at day 30 is a new hire who, when shown a prompt diff in review, can ask the right questions — "did we re-run the regression eval on the long-context slice," "is the judge stable across this prompt change," "does this commit reference an incident we should add an eval case for." If they can ask those questions, they're ready for guided contribution. If they can't, another two weeks of shadowing is cheaper than one bad merge.

Days 31 to 60: Guided Contribution and the Three Required Artifacts

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates