Skip to main content

23 posts tagged with "sre"

View all tags

Model Rollback Velocity: The Seven-Hour Gap Between 'This Upgrade Is Wrong' and 'Old Model Fully Restored'

· 12 min read
Tian Pan
Software Engineer

The playbook for a bad code deploy is a sub-minute revert. The playbook for a bad config push is a sub-second flag flip. The playbook for a bad model upgrade is whatever the on-call invents at 09:14, and on a typical day it takes seven hours to finish. During those seven hours the regression keeps compounding — wrong answers ship to customers, support tickets pile up, and the dashboard shows a slow gradient rather than a clean cliff back to green.

The reason the gap is seven hours is not that the team is slow. It is that "rollback" for a model upgrade is not the same primitive as "rollback" for code. It is closer to a database schema migration: partial, hysteretic, and not reversible by pressing the button you wish existed. The team that wrote its incident playbook around a button does not have the controls the actual rollback requires.

This post is about what those controls look like, why they have to be paid for in advance, and what you find out about your platform the first time you try to roll back a model under load.

Your On-Call Rotation Needs an AI-Literacy Prerequisite Before It Pages Anyone at 2am

· 12 min read
Tian Pan
Software Engineer

A platform engineer with eight years of incident-response experience opens a 2am page that says "AI assistant degraded — error rate 12%." She checks the model latency dashboard: green. She checks the model API status page: green. She checks the deploy log: nothing shipped in the last 72 hours. She does what any competent on-call does next — she pages the AI team. The AI engineer wakes up, opens the trace dashboard the platform engineer didn't know existed, sees that a single retrieval tool has been timing out for the last four hours because a downstream search index lost a replica, and resolves the incident in eleven minutes. The AI engineer goes back to bed at 3:14am. The retrospective the next morning records "AI feature outage, resolved by AI team." Nobody writes down the actual lesson, which is that the on-call engineer could have triaged this in five minutes if she had ever been taught what an AI feature's failure surface looks like.

This is the rotation tax that AI features quietly impose on every engineering org I've worked with in the last two years. The shared on-call rotation that worked beautifully for a stack of stateless services and a few databases breaks down the moment one of those "services" is an LLM-backed feature. The on-call playbook your SRE team built across a decade of post-mortems is calibrated for a world where "something is broken" decomposes into CPU, memory, network, deploys, and dependency timeouts. AI features add three more axes — the model, the prompt, the retrieval pipeline — and four more shapes of failure that don't show up on the dashboards your on-call was trained to read.

The Agent Flight Recorder: Capture These Fields Before Your First Incident

· 12 min read
Tian Pan
Software Engineer

The first time an agent goes sideways in production — it deletes the wrong row, emails the wrong customer, burns $400 of inference on a single task, or tells a regulated user something legally exposed — the team opens the logs and discovers what they actually have: a CloudWatch stream of tool-call names with truncated arguments, a "user prompt" field that captured only the latest turn, and no record of which model version actually ran. The provider rolled the alias forward two weeks ago. The system prompt lives in a config service that wasn't snapshotted. Temperature wasn't logged because the framework default was 0.7 and "everyone knows that." The tool result that triggered the bad action exceeded the log line size and got truncated to "...".

You cannot reconstruct the decision. You can only guess. Six months later you have a pile of "why did it do that" reports with no answers, and the team starts treating the agent like weather — something that happens to you, not something you debug.

The flight recorder discipline is the cheapest thing you will ever ship that prevents this, and the most expensive thing you will ever ship if you wait until the first incident to start. The fields below are the bare minimum, the storage shape is non-negotiable, and the sampling and privacy boundaries have to be designed alongside — not retrofitted.

Agent SLOs Without Ground Truth: An Error Budget for Outputs You Can't Grade in Real Time

· 11 min read
Tian Pan
Software Engineer

Your agent platform has met its 99.9% "response success" SLO every quarter for a year. Tickets are up 40%. Retention on the agent-touched cohort is down. The on-call rotation is bored, the product manager is panicking, and the executive review keeps asking why the dashboard says everything is fine while the support queue says everything is on fire. The dashboard isn't lying. It's just measuring the wrong thing — because the SRE who wrote the SLO defined success as "the model API returned 200," and that was the only definition of success the telemetry could express in the first place.

This is the central problem of agent reliability engineering: the success signal is not a status code. It is a judgment about whether the agent did the right thing for a specific task, and that judgment is unavailable at request time, often unavailable at session time, and sometimes only resolvable days later when the user files a ticket, edits the output, or quietly stops coming back. You cannot put a 200-vs-500 boolean on a column that doesn't exist yet.

The reflex is to wait for ground truth before declaring an SLO. This is wrong. Reliability does not pause while you build a labeling pipeline. The right move is to write an error budget against proxies you know are imperfect, name them as proxies, set the policy that governs how the team responds when they trip, and back-fill ground truth into the calculation as you produce it. This post is about how to do that without lying to yourself.

The Five-Surface Triage Tree: An AI On-Call Playbook for Pages That Don't Fit Your Runbook

· 12 min read
Tian Pan
Software Engineer

The page fires at 2:47 AM. The agent is sending wrong-tone replies to customer support tickets, the latency dashboard is flat, the error rate is normal, and there is nothing to roll back because nothing was deployed in the last twelve hours. The on-call engineer opens the runbook, scrolls past "restart the worker pool" and "scale the queue," reaches the bottom, and finds nothing that maps to the page in front of them. They start reading the system prompt at 3:04 AM. They are still reading it at 3:31 AM.

This is the new failure shape, and the rotation that was designed for "high latency means restart the pod, elevated 5xx means roll back the deploy, queue depth growing means scale the worker pool" is not equipped to handle it. The first instinct — roll back the deploy — is wrong because nothing was deployed: the model upgraded silently behind a versioned alias, a third-party tool's response shape drifted, the prompt version skewed across regions, or the eval set went stale weeks ago and the regression has been compounding the whole time. The page is real. The runbook is silent. AI on-call is its own discipline now, and trying to retrofit it into the existing rotation produces playbooks whose first step is silence on the call while everyone reads the prompt for the first time.

The Coding Agent Autonomy Curve: Reading Is Free, Merging Is Incident-Class

· 11 min read
Tian Pan
Software Engineer

The discourse on coding agents keeps collapsing to a binary: autonomous or supervised, YOLO mode or hand-on-the-wheel, --dangerously-skip-permissions or "approve every keystroke." That framing is a category error. A coding agent does not perform "an action." It performs a sequence of actions whose costs span at least seven orders of magnitude — from reading a file (free, undoable, no side effect) to merging to main (irreversible without a revert PR) to rolling out a binary to a fleet (six-figure incident-class). Treating that range with one autonomy switch is like setting a single speed limit for both a parking lot and a freeway.

The team that ships "the agent can do everything" without mapping each action to its blast radius is one prompt-injection-bearing GitHub comment away from a postmortem — and we already have public examples of that exact failure mode. Anthropic's Claude Code Security Review, Google's Gemini CLI Action, and GitHub Copilot Agent were all confirmed in 2026 to be hijackable through specially crafted PR titles and issue bodies, in an attack pattern the researchers named "Comment and Control." The agents weren't broken in some abstract sense. They executed a high-tier action — pushing code, opening a PR — on the basis of a low-trust input the autonomy tier had silently flattened into "all the same."

What follows is the discipline that has to land: a per-action curve, gates that scale with the tier, rollback velocity matched to blast class, and an eval program that tests for tool-composition escalation rather than single-action failure.

GPU Capacity Planning When Demand Is a Cliff, Not a Curve

· 10 min read
Tian Pan
Software Engineer

The first time an agent platform falls over, the postmortem usually contains a sentence that reads something like: "We had eight weeks of headroom on Friday. By Monday afternoon, we were at 140% of provisioned capacity." Nobody is lying. The capacity model was correct, applied to a workload it was never designed for. Classical capacity planning assumes demand grows along a smooth curve where weekly seasonality is the dominant signal and the worst case is a Black Friday you can plan against six months out. Agent workloads break that assumption hard.

The shape of agent demand is not a curve. It is a cliff. Three things produce the cliff and they compound. A single enterprise customer onboarding can shift baseline by 10x overnight on a contractual notice you've already signed. An agent loop can amplify a tiny increase in user activity into a fanout-multiplied surge that hits inference 30x harder than the user-facing graph suggests. A single product change — enabling tool use, lengthening context, switching to a larger model — can move per-task token consumption by an order of magnitude with no change in user count.

If your capacity planning is in QPS and your headroom budget is "75% utilization is healthy," you are not planning. You are gambling that none of those three cliffs lands on the same week.

Your SRE Postmortem Template Is Missing Six Fields That Decide Every LLM Incident

· 11 min read
Tian Pan
Software Engineer

The first time you run an LLM incident through a classic SRE postmortem template, the template wins and the incident loses. Timeline, contributing factors, mitigation, prevention — every field is filled in, every box ticked, and at the end of the document nobody can answer the only question that matters: which variable actually moved? Not the deploy event. Not the infra fault. Not the code change. The prompt revision, the model slice the router picked, the judge configuration scoring the eval that failed to fire, the retrieval index state that was serving when the quality complaints landed, the tool schema versions the planner was composing, the traffic mix that hit during the bad window. None of those have a row.

The SRE template wasn't designed for systems where the source of truth is an observed behavior rather than a code path. The variables that move silently in an LLM stack are the ones the template never had to enumerate. Borrowing the template anyway is what produces the "we don't know what changed" postmortem that files itself under "investigating" forever.

Load Shedding Was Built for Humans. Agents Amplify the Storm You're Shedding

· 11 min read
Tian Pan
Software Engineer

A 503 to a human is a "try again later" page and a coffee break. A 503 to an agent is a 250-millisecond setback before retry one of seven, and the planner is already asking the LLM whether a different tool can sneak around the failed dependency. The first behavior gives an overloaded service room to recover. The second behavior is what an overloaded service has nightmares about: thousands of correlated retries, each one cheaper and faster than a human's, half of them fanning out into the next dependency over because the planner decided that was a creative workaround.

Load shedding — the discipline of dropping low-priority work to keep the high-priority path alive — was designed in an era when the principal sending traffic was a human at a keyboard or a well-behaved service with a hand-tuned retry policy. Both of those assumptions break the moment a fleet of agents shows up. The agent retries faster, retries from more places at once, replans around the failure, and treats your 503 as a load-balancing hint instead of as the cooperative back-pressure signal you meant it to be.

This piece is about why the standard load-shedding playbook doesn't survive contact with agentic clients, what primitives the upstream service needs in order to actually shed agent traffic, and what the agent itself has to do — at the tool layer and at the planner — to stop being the hostile traffic in someone else's incident report.

The Agent Paged Me at 3 AM: Blast-Radius Policy for Tools That Reach Humans

· 12 min read
Tian Pan
Software Engineer

The first time an agent pages your on-call four times in an hour because it's looping on a malformed alert signal, leadership learns something the security team already knew: "tool access" and "ability to create human work" were the same permission, and you granted it without either a safety review or a product-ownership review. Nobody owned the question of who's allowed to interrupt a human at 3 AM, because nobody framed it as a question. It was framed as a Slack integration.

The 2026 agent stack has made this failure mode cheap to reach. Anthropic's MCP servers, OpenAI's Agents SDK, and the whole class of vendor-shipped action tools have collapsed the distance between "the model decided to do a thing" and "a human got woken up." Most teams ship those integrations the same way they ship a database client: scope a token, drop in the SDK, write a system prompt, ship. The blast radius of a database client is a row count. The blast radius of a PagerDuty client is a person's sleep.

Your AI Product Needs an SRE Before It Needs Another Model

· 9 min read
Tian Pan
Software Engineer

The sharpest pattern I see in struggling AI teams is the gap between how sophisticated their model stack is and how primitive their operations are. A team will run three frontier models in production behind custom routing logic, a RAG pipeline with eight retrieval stages, and an agent that calls twenty tools. They will also have no on-call rotation, no SLOs, no runbooks, and a #incidents Slack channel where prompts are hotfixed live by whoever happens to be awake. The product is operating on 2026 model infrastructure and 2012 operational infrastructure, and every week the gap costs them another outage.

The instinct when this hurts is to reach for the model lever. Quality dipped? Try the new release. Latency spiked? Switch providers. Hallucinations in production? Add another guardrail prompt. None of this fixes the underlying problem, which is that nobody owns the system's reliability as a discipline. What these teams actually need — usually before they need another applied scientist — is their first SRE.

AI Incident Response Playbooks: Why Your On-Call Runbook Doesn't Work for LLMs

· 10 min read
Tian Pan
Software Engineer

Your monitoring dashboard shows elevated latency, a small error rate spike, and then nothing. Users are already complaining in Slack. A quarter of your AI feature's responses are hallucinating in ways that look completely valid to your alerting system. By the time you find the cause — a six-word change to a prompt deployed two hours ago — you've had a slow-burn incident that your runbook never anticipated.

This is the defining challenge of operating AI systems in production. The failure modes are real, damaging, and invisible to conventional tooling. An LLM that silently hallucinates looks exactly like an LLM that's working correctly from the outside.