Skip to main content

2 posts tagged with "runbook"

View all tags

Quarterly Model Migration: Make It a Calendar Event, Not a Fire Drill

· 11 min read
Tian Pan
Software Engineer

The deprecation email arrives on a Tuesday afternoon. The model your billing pipeline has depended on for fourteen months is now on a sixty-day timer. The prompt was tuned by an engineer who left in March. The eval suite hasn't been re-baselined since launch. The customer-success team is asking why "the AI feels different" on two enterprise accounts. Nobody put this on the roadmap, and nobody will own it cleanly, because in your org's mental model this is a one-off project — even though it is the fourth one this year.

Every team running an AI feature in production runs into the same realization within eighteen months: the foundation-model provider is operating on a deprecation cadence that the team did not plan for, and the team's migration response keeps being a reactive scramble triggered by a notification email. The fix is not a better playbook for the next migration — there are already plenty of those, and your team has probably written one. The fix is to stop treating migration as a project and start treating it as a recurring operational primitive. Put it on the calendar.

The Five-Surface Triage Tree: An AI On-Call Playbook for Pages That Don't Fit Your Runbook

· 12 min read
Tian Pan
Software Engineer

The page fires at 2:47 AM. The agent is sending wrong-tone replies to customer support tickets, the latency dashboard is flat, the error rate is normal, and there is nothing to roll back because nothing was deployed in the last twelve hours. The on-call engineer opens the runbook, scrolls past "restart the worker pool" and "scale the queue," reaches the bottom, and finds nothing that maps to the page in front of them. They start reading the system prompt at 3:04 AM. They are still reading it at 3:31 AM.

This is the new failure shape, and the rotation that was designed for "high latency means restart the pod, elevated 5xx means roll back the deploy, queue depth growing means scale the worker pool" is not equipped to handle it. The first instinct — roll back the deploy — is wrong because nothing was deployed: the model upgraded silently behind a versioned alias, a third-party tool's response shape drifted, the prompt version skewed across regions, or the eval set went stale weeks ago and the regression has been compounding the whole time. The page is real. The runbook is silent. AI on-call is its own discipline now, and trying to retrofit it into the existing rotation produces playbooks whose first step is silence on the call while everyone reads the prompt for the first time.