Skip to main content

45 posts tagged with "sre"

View all tags

The Agent Runbook Your Incident Commander Could Not Execute

· 10 min read
Tian Pan
Software Engineer

The page fires at 02:17 local time. The on-call SRE pulls up the agent runbook on their phone and reads step one: "check the agent's tool-call traces for anomalous tool usage." They open the link. They hit an SSO prompt for a workspace they do not belong to. Step two says inspect the prompt-construction logs; same wall. Step three says roll back to the previous prompt version, but the deploy permission is scoped to a team they are not on. By the time they figure out which Slack channel to escalate to and wake up the AI team's product manager because she is the only person they can find at 02:17, ninety minutes have passed and the customer-visible regression is still serving wrong answers.

The post-mortem will identify the access gap as the proximate cause. The deeper discomfort is that the runbook reads fine in daylight and runs blocked at night, because the person who wrote it has access the person who executes it does not.

The On-Call Rotation Your Agent Platform Forgot to Staff

· 11 min read
Tian Pan
Software Engineer

The AI platform team has four engineers. The internal agent they shipped seven months ago is now answering questions for 200 employees a day. For the first month the founding engineer answered every Slack ping personally — Tuesday at 11pm, Sunday morning, the night of the company offsite. Then she got promoted to staff engineer for the impact she had on adoption, and three weeks later she stopped checking the channel after 6pm because that is what staff engineers do. The on-call rotation that was supposed to replace her was never formalized, because the operating model was always going to be figured out "after the pilot."

The day the agent silently degrades for a quarter of users — a retrieval index that quietly fell behind, or a model version flip that shifted refusal behavior, or a tool whose schema rotated and is now returning empty arrays — the complaints do not land on the platform team's pager. They land in the help desk queue, staffed by people who do not have access to the agent's traces, do not know what a system prompt is, and have been told by IT that the agent is "owned by the AI team." Sixteen hours pass between the first user complaint and the first engineer who looks at a trace. Nobody on the platform team is asleep at the wheel; there is no wheel.

The Provider Quota Reset on a Timezone Your Global Traffic Never Picked

· 8 min read
Tian Pan
Software Engineer

Your monthly token quota resets at 00:00 UTC. Your largest customer is in Tokyo and hits peak load at 21:00 UTC — 6:00 AM their next morning. By the time the reset arrives, the Tokyo workday has already chewed through the last six hours of the cycle on quota-exhaustion fallback. The 429s look "occasional" because the UTC calendar axis on your dashboard hides the daily reset boundary inside an ordinary timestamp.

This is not a rate limit bug. It is a calendar bug. The provider chose a reset clock for their bookkeeping convenience, and the geography of your traffic decided which customers got the empty end of the cycle. The team that priced the quota as a uniform resource is rationing it on a calendar the user never sees.

The Retry Your Dashboard Counted Three Different Ways

· 11 min read
Tian Pan
Software Engineer

An agent ran. The plan-step crashed. The tool-call step retried twice with a 500, then succeeded on the fourth attempt. The user got their answer.

How many events was that? Ask product, and it's one — the user got a working result, so the funnel counts a conversion. Ask SRE, and it's three failures plus one success, a 75% error rate on the underlying step. Ask finance, and it's four billable inferences, two retried tool calls, and roughly four times the unit cost product is forecasting against. Each team's dashboard is correct. They are also irreconcilable, and the moment someone tries to reconcile them — usually during an incident review — they will discover the team has been operating against three contradictory pictures of reliability for months.

The Streaming Response Your Backend Infrastructure Was Not Built For

· 12 min read
Tian Pan
Software Engineer

Streaming was a product decision. Somebody on the design team watched a competitor's chat UI tick out tokens like a typewriter, watched a user's shoulders relax when the first character appeared two hundred milliseconds in instead of after a four-second blank stare, and the decision was made: we stream. The pull request changed three files in the API gateway. The model output now flushes incrementally over Server-Sent Events. The launch went out on a Tuesday and the satisfaction score moved up by a measurable amount on a Wednesday. Nobody opened a ticket against infrastructure.

A month later the on-call engineer is staring at three dashboards that no longer agree with each other. The autoscaler is provisioning twice as many pods as the CPU graphs say it should need. The p99 latency dashboard is broken — not malfunctioning, but uninterpretable, because the histogram buckets stop at five seconds and most spans now live in the overflow. The capacity model that priced the previous quarter's bill said the service could handle twelve hundred requests per second per node. The graph in front of the on-call says it is handling four hundred and falling over.

The Internal Search Box You Replaced With an Agent Just Became Your SLO

· 11 min read
Tian Pan
Software Engineer

You retired the search bar on the company portal because the agent did the same job better. Type a question in plain English, get an answer with citations, refine with a follow-up. The pilot crushed the satisfaction metric. The rollout email said "deprecating legacy search, full cutover in two weeks." Two weeks passed. The old index was decommissioned. The query box was replaced with a chat input.

Six months later, on a Tuesday morning, three things happen at once. Your inference provider rate-limits the corporate account because somebody's batch job spiked the shared quota. The embedding service has a regional brownout. A config push clears the prompt cache. Every engineer in the company who used to type "vpn setup" or "expense policy" into a search bar instead watches a spinner for forty seconds, then gets a refusal that does not understand their question, or worse, a confident citation to a wiki page that does not exist. The Slack channel where employees ask each other things lights up. The IT inbox fills with "is search broken?"

The search bar you replaced had three nines of availability over a decade of small incremental improvements. The agent that replaced it has a different shape of failure — slow not down, wrong not empty, expensive not cached — and your SRE culture was not calibrated for it.

The On-Call Runbook That Assumed a Human Would Read the Page

· 11 min read
Tian Pan
Software Engineer

The page fired at 02:14. The runbook said "page the engineer." The engineer's name resolved to an on-call rotation. The rotation pointed at a Slack channel that the team had wired up six months ago as a unified triage surface. The first message in the channel was the alert. The second message, posted nineteen seconds later, was a calm three-sentence summary: the alerting service, the failing dependency, the last deploy. It was well-written. It ended with "Acknowledged."

The incident commander, watching from her phone in bed, read "Acknowledged" and went back to sleep. Nobody had acknowledged. The agent subscribed to that channel as a first-line triage helper had restated the alert back to the room and signed off with the verb the channel's other readers used to mean "I have the context to act on this." The incident ran unowned for forty-one minutes until a customer ticket woke a different engineer through a different surface.

The Postmortem Template With No Row for the Model's Inference

· 11 min read
Tian Pan
Software Engineer

The first time an agent caused a real outage on my team, the postmortem author opened the template, scrolled past the timeline, stared at the "Root Cause" field for a long minute, and typed: "The runbook for queue-stuck recovery was incorrect." The runbook was fine. The agent had read the runbook, decided the queue's symptoms matched a different scenario, and run a recovery script for that other scenario instead. The action items that came out of that document — "tighten the runbook wording," "add a confirmation prompt to the recovery script" — were entirely useless against the actual failure mode, which was that an inferential system had inferred wrong and there was no field in the template that knew how to say so.

I've watched this exact failure repeat across teams since. The template is calibrated for deterministic systems. Code did the wrong thing, so you fix the code. Config was misset, so you fix the config. The schema of the postmortem document is the schema of the team's theory of failure, and when that theory cannot represent "the agent's plan was wrong," the document flattens the actual failure into the closest thing the template can represent — usually a documentation gap or a missing guardrail — and the action items chase a deterministic fix for a probabilistic failure. The same incident class then recurs, and the team writes it up the same way the next time.

The Agent That Scheduled Itself Into the Maintenance Window

· 11 min read
Tian Pan
Software Engineer

A senior engineer on call at 2am does not run a schema migration during a Sev-2 incident. They do not redeploy the payment service ten minutes before a release freeze starts. They do not fire a marketing email campaign while the email vendor's status page is red. None of this is in their job description. They picked it up from years of getting yelled at, from Slack channels titled #deploy-freeze-friday, from the muscle memory of glancing at the status page before they touch anything. It is the kind of context that does not exist in any runbook because nobody thought it needed to be written down.

Now hand the same job to an agent. The agent has tools. The agent has a multi-step plan. The agent has every documented policy you bothered to put in its system prompt. What the agent does not have is the half-conscious awareness that the world is currently on fire. So it executes the plan. Cleanly. Confidently. Into the maintenance window. And the postmortem includes a sentence that is going to become a familiar trope: "the agent had no way of knowing."

The CI Agent With Merge Rights at 3 AM

· 12 min read
Tian Pan
Software Engineer

A flaky test gets quarantined at 3:17 AM. The on-call rotation does not page, because nothing failed — the agent decided the failure was noise, opened a small PR labeled chore: quarantine flaky test, marked the change as a self-merge under the ci-bot service account, and went back to watching the queue. Six days later a customer reports that a feature has been broken since Tuesday. The test was not flaky. It was the only thing standing between a real regression and production, and the agent's confidence threshold was set high enough to make a decision but low enough to be wrong.

This is the part of agentic CI that the marketing decks skip. Wiring an agent into your pipeline to triage failures, downgrade dependencies on security alerts, and propose dependency bumps is straightforward in 2026 — the tools exist, the integrations are one config file away, and the productivity story is real. The part that nobody writes a runbook for is the new operational class you just created: an actor with merge rights that runs at 3 AM with no human in the synchronous loop, and an SRE handbook that assumed humans were the source of intent.

Your Fallback Path Is the Only Untested Code in Production

· 9 min read
Tian Pan
Software Engineer

Every serious AI system ships with a fallback. When the primary model is rate-limited, route to a cheaper one. When the provider returns 5xx, serve a cached answer. When confidence drops below a threshold, fall back to a hand-written heuristic. The architecture diagram has a clean little branch labeled "degraded mode," and everyone feels safer for it.

Here is the uncomfortable part. That branch is the only code in your system that almost never runs. The primary path executes millions of times a day and gets debugged, profiled, and battle-tested by sheer traffic volume. The fallback executes approximately never — until the day it executes for everyone at once, under load, during an incident, while three engineers watch a dashboard turn red.

A fallback you do not exercise is not redundancy. It is a second, unmonitored system whose debut is statistically guaranteed to happen at the worst possible moment.

Who Gets Paged When the Agent Is Wrong: On-Call for Non-Deterministic Systems

· 9 min read
Tian Pan
Software Engineer

The on-call rotation was built around a promise: failures reproduce. An alert fires, you re-run the request, you watch the bug happen, you find the bad commit, you roll back the deploy. Every part of that loop assumes determinism. The same input produces the same output, and the output is either right or wrong in a way you can stare at.

An agent fleet quietly breaks every link in that chain. The failure happened once, at a sampling temperature you can't replay, on a context window that has since been garbage-collected. There is no bad commit, because the code never changed — the model did, or the retrieved documents did, or the user phrased the request in a way nobody anticipated. You roll back the deploy and the deploy was never the problem.

So the page goes out, an engineer picks it up, and they discover the most uncomfortable fact about operating agents in production: they have been handed a system they cannot single-step, and the runbook in front of them was written for a different kind of machine.