Skip to main content

3 posts tagged with "team-process"

View all tags

The First 90 Days for an AI Engineer: An Onboarding Playbook That Survives the Six-Week Doc Rot

· 12 min read
Tian Pan
Software Engineer

The new hire opens the onboarding doc. It points at a service architecture diagram from eleven months ago, a Confluence page titled "Our LLM Stack" last edited in October, and a Notion table of "model providers we use." Nothing in any of these documents tells them which prompt was tuned against which failure mode, which eval cases were added after which incident, which judge was recalibrated when the model bumped from 4.5 to 4.6, or why the system prompt for the support agent has a strange three-line preamble nobody wants to touch. Two weeks in, they ship a "small prompt cleanup" PR that removes the preamble. The eval suite passes. Production accuracy drops four points within a day.

The standard new-hire onboarding playbook — read the architecture doc, set up your laptop, do your first PR by week two — was built for engineers who join services. AI engineers join a different artifact. The thing they're going to be editing isn't a 5,000-line Go service that some staff engineer wrote; it's a 30-line prompt that survived eleven incidents and seventeen eval-driven rewrites, and the meaning of those thirty lines lives in the heads of two people on the team. Your onboarding doc cannot capture that, and trying to write a longer doc is the wrong fix.

The Prompt Author Identity Problem: Three Roles Editing the Same File

· 13 min read
Tian Pan
Software Engineer

Pull up the git blame on any year-old production system prompt and you will find something the engineering team is not ready to admit: the file has three authors, none of whom share a definition of what a "change" is. The engineer who refactored the instruction blocks last month logged the commit as "no functional change, just reordering for clarity." The product manager who reads the file once a quarter would describe the same diff as "you rewrote the voice — customers will notice." The ML engineer running the regression suite would call it "you broke few-shot example three, and the eval has been red ever since."

All three are right. The prompt is simultaneously code, spec, and hyperparameter, and every team that ships an AI feature long enough discovers that the file's commit history is a slow-motion three-way authorship dispute that CODEOWNERS does not capture and the diff viewer does not surface.

The Eval Bus Factor: When the Person Who Defined 'Correct' Walks Out the Door

· 10 min read
Tian Pan
Software Engineer

A team I worked with recently lost their senior ML engineer. Two weeks later, the eval suite was still green on every PR — 847 cases, all passing, judge agreement at 92%. Six weeks later, a customer found a regression that should have been caught by the very first eval case in the support-quality bucket. When the team went to debug, nobody could explain why that case had been written, what failure mode it was supposed to catch, or why the judge prompt graded it on a 1–4 scale instead of binary. The case was still passing. It just wasn't testing anything anyone could name.

This is the eval bus factor: the silent failure mode where the person who decided what "correct" means for your AI feature was also the person who curated the test cases, calibrated the judge, and absorbed every implicit labeling tradeoff in their head. When they leave, the suite remains green but stops generating reliable promote/reject signal — because nobody else can extend it, debug a flaky judge, or evaluate whether a new failure mode belongs in the test set.