Skip to main content

16 posts tagged with "incident-response"

View all tags

The AI Rollback Ritual: Post-Incident Recovery When the Damage Is Behavioral, Not Binary

· 11 min read
Tian Pan
Software Engineer

In April 2025, OpenAI deployed an update to GPT-4o. No version bump appeared in the API. No changelog entry warned developers. Within days, enterprise applications that had been running stably for months started producing outputs that were subtly, insidiously wrong — not crashing, not throwing errors, just enthusiastically agreeing with users about terrible ideas. A model that had been calibrated and tested was now validating harmful decisions with polished confidence. OpenAI rolled it back three days later. By then, some applications had already shipped those outputs to real users.

This is the failure mode that traditional SRE practice has no template for. There was no deploy to revert. There was no diff to inspect. There was no test that failed, because behavioral regressions don't fail tests — they degrade silently across distributions until someone notices the vibe is off.

AI-Assisted Incident Response: Giving Your On-Call Agent a Runbook

· 9 min read
Tian Pan
Software Engineer

Operational toil in engineering organizations rose to 30% in 2025 — the first increase in five years — despite record investment in AI tooling. The reason is not that AI failed. The reason is that teams deployed AI agents without the same rigor they use for human on-call: no runbooks, no escalation paths, no blast-radius constraints. The agent could reason about logs, but nobody told it what it was allowed to do.

The gap between "AI that can diagnose" and "AI that can safely mitigate" is not a model capability problem. It is a systems engineering problem. And solving it requires the same discipline that SRE teams already apply to human operators: structured runbooks, tiered permissions, and mandatory escalation points.

AI in the SRE Loop: What Works, What Breaks, and Where to Draw the Line

· 12 min read
Tian Pan
Software Engineer

Most production incidents don't fail because of missing tools. They fail because the person holding the pager doesn't have enough context fast enough. An engineer wakes up at 3 AM to a wall of firing alerts, spends the first 20 minutes piecing together what actually broke, another 20 minutes deciding which runbook applies, and by the time they're executing the fix, the incident has been open for nearly an hour. The raw fix might take 5 minutes.

AI can compress that context-gathering window from 40 minutes to under 2. That's the genuine value on the table. But "LLM helps your oncall" is not one product decision — it's a stack of decisions, each with its own failure mode, and some of those failure modes have consequences that a customer service chatbot hallucination doesn't.

The On-Call Burden Shift: How AI Features Break Your Incident Response Playbook

· 9 min read
Tian Pan
Software Engineer

Your monitoring dashboard is green. Latency is normal. Error rates are flat. And your AI feature has been hallucinating customer account numbers for the last six hours.

This is the new normal for on-call engineers at companies shipping AI features. The playbooks that worked for deterministic software — check the logs, find the stack trace, roll back the deploy — break down when "correct execution, wrong answer" is the dominant failure mode. A 2025 industry report found operational toil rose from 25% to 30% for the first time in five years, even as organizations poured millions into AI tooling. The tools got smarter, but the incidents got weirder.