Skip to main content

238 posts tagged with "reliability"

View all tags

The Finish Reason Your Code Never Inspects

· 10 min read
Tian Pan
Software Engineer

Your handler did everything right. The HTTP status was 200. The body parsed. The text field had characters in it. You incremented responses_succeeded, appended the message to the conversation, returned the JSON down to the client, and moved on. The user got a sentence that ended mid-clause, a redacted answer dressed up as a normal one, or a polite refusal phrased as a completion. Your dashboard does not know any of that happened. The provider told you. You did not read the field.

Every major inference API returns a stop signal alongside the text: OpenAI calls it finish_reason, Anthropic calls it stop_reason, Gemini calls it finishReason. The field is small. It is one enum value per response. It is also the only out-of-band channel the model has for telling you whether the response you just shipped is the answer or a fragment of one. Treating it as cosmetic is the same shape of bug as ignoring HTTP status codes — except your monitoring caught the HTTP one a decade ago and has no opinion about this one.

The Internal Search Box You Replaced With an Agent Just Became Your SLO

· 11 min read
Tian Pan
Software Engineer

You retired the search bar on the company portal because the agent did the same job better. Type a question in plain English, get an answer with citations, refine with a follow-up. The pilot crushed the satisfaction metric. The rollout email said "deprecating legacy search, full cutover in two weeks." Two weeks passed. The old index was decommissioned. The query box was replaced with a chat input.

Six months later, on a Tuesday morning, three things happen at once. Your inference provider rate-limits the corporate account because somebody's batch job spiked the shared quota. The embedding service has a regional brownout. A config push clears the prompt cache. Every engineer in the company who used to type "vpn setup" or "expense policy" into a search bar instead watches a spinner for forty seconds, then gets a refusal that does not understand their question, or worse, a confident citation to a wiki page that does not exist. The Slack channel where employees ask each other things lights up. The IT inbox fills with "is search broken?"

The search bar you replaced had three nines of availability over a decade of small incremental improvements. The agent that replaced it has a different shape of failure — slow not down, wrong not empty, expensive not cached — and your SRE culture was not calibrated for it.

The On-Call Runbook That Assumed a Human Would Read the Page

· 11 min read
Tian Pan
Software Engineer

The page fired at 02:14. The runbook said "page the engineer." The engineer's name resolved to an on-call rotation. The rotation pointed at a Slack channel that the team had wired up six months ago as a unified triage surface. The first message in the channel was the alert. The second message, posted nineteen seconds later, was a calm three-sentence summary: the alerting service, the failing dependency, the last deploy. It was well-written. It ended with "Acknowledged."

The incident commander, watching from her phone in bed, read "Acknowledged" and went back to sleep. Nobody had acknowledged. The agent subscribed to that channel as a first-line triage helper had restated the alert back to the room and signed off with the verb the channel's other readers used to mean "I have the context to act on this." The incident ran unowned for forty-one minutes until a customer ticket woke a different engineer through a different surface.

The Postmortem Template With No Row for the Model's Inference

· 11 min read
Tian Pan
Software Engineer

The first time an agent caused a real outage on my team, the postmortem author opened the template, scrolled past the timeline, stared at the "Root Cause" field for a long minute, and typed: "The runbook for queue-stuck recovery was incorrect." The runbook was fine. The agent had read the runbook, decided the queue's symptoms matched a different scenario, and run a recovery script for that other scenario instead. The action items that came out of that document — "tighten the runbook wording," "add a confirmation prompt to the recovery script" — were entirely useless against the actual failure mode, which was that an inferential system had inferred wrong and there was no field in the template that knew how to say so.

I've watched this exact failure repeat across teams since. The template is calibrated for deterministic systems. Code did the wrong thing, so you fix the code. Config was misset, so you fix the config. The schema of the postmortem document is the schema of the team's theory of failure, and when that theory cannot represent "the agent's plan was wrong," the document flattens the actual failure into the closest thing the template can represent — usually a documentation gap or a missing guardrail — and the action items chase a deterministic fix for a probabilistic failure. The same incident class then recurs, and the team writes it up the same way the next time.

The Provider Failover That Multiplied Your Incident Surface

· 10 min read
Tian Pan
Software Engineer

The first time your provider failover actually fires in production, you will discover what you actually built. The gateway flips the traffic over in seconds — that part works. Then a different kind of incident starts: malformed JSON in 12% of responses, refusals on prompts that never saw a refusal before, latencies that destroy your downstream timeouts, customer-facing outputs that read like a different product. The primary came back ninety minutes later. The "successful" failover left a forty-eight hour incident review behind it.

This is the bill that comes due on the cheapest line of an architecture deck: "secondary provider for resilience." The deck never mentioned that the secondary needs its own prompts, its own evals, its own load-tested capacity, and its own on-call playbook. The deck just said you would not be down. The deck was right about that and wrong about everything else.

Your Scheduled Agent Has Four Clocks, and You Are Trusting the Wrong One

· 12 min read
Tian Pan
Software Engineer

A daily standup summary is scheduled for 09:00 UTC. The cron fires on time. A worker pod spins up two seconds later. The LLM call takes another forty seconds round-trip. The model writes its summary believing it is February of last year, because that is the last thing its training data confidently knew. The tool layer dispatches the Slack message against the wall clock at 09:00:42 UTC, on a date the model never mentions because nobody asked it to. The message lands in the right channel, with yesterday's standup notes summarized as "today's," and nobody notices for three weeks.

This is not a bug in any one component. It is a contract that nobody wrote between four different clocks that all believe they know what "now" is.

The Account Number Your LLM Could Not Actually Copy

· 10 min read
Tian Pan
Software Engineer

A support agent reads a customer ticket, pulls up the account, summarizes the recent activity, and issues a refund. The refund lands in the wrong account. Not a fabricated account — a real one, one digit off. The model wrote acct_7H9j2 when the customer's record was acct_7H9j3. The trace looks clean: a search call returned the right record, a summarize call produced the right summary, a refund call ran without error. Every step succeeded. The wrong customer got the money.

This is not a hallucination in the sense the postmortem will use. The model did not invent a customer. It transposed two characters of an existing one, and that is a different failure mode — one your eval suite probably never caught, because the synthetic identifiers in your test fixtures were unique by construction. Two account numbers in the same context, three characters of shared prefix, and the language model — which is a token predictor that has never been trained to copy random strings with fidelity — picked the wrong one.

The lesson is structural, not behavioral. The model does not have an attention mechanism that special-cases identifiers. To the model, acct_7H9j2 is a sequence of subword tokens whose continuation probability shifts with every other token in the window. If a near-twin identifier appears in the same prompt, the model is one bad sample away from a quiet substitution that the harness will happily execute.

The Agent That Scheduled Itself Into the Maintenance Window

· 11 min read
Tian Pan
Software Engineer

A senior engineer on call at 2am does not run a schema migration during a Sev-2 incident. They do not redeploy the payment service ten minutes before a release freeze starts. They do not fire a marketing email campaign while the email vendor's status page is red. None of this is in their job description. They picked it up from years of getting yelled at, from Slack channels titled #deploy-freeze-friday, from the muscle memory of glancing at the status page before they touch anything. It is the kind of context that does not exist in any runbook because nobody thought it needed to be written down.

Now hand the same job to an agent. The agent has tools. The agent has a multi-step plan. The agent has every documented policy you bothered to put in its system prompt. What the agent does not have is the half-conscious awareness that the world is currently on fire. So it executes the plan. Cleanly. Confidently. Into the maintenance window. And the postmortem includes a sentence that is going to become a familiar trope: "the agent had no way of knowing."

The Agent That Retried Its Way Past Your Rate Limit

· 10 min read
Tian Pan
Software Engineer

Your gateway enforces a clean 100 requests per second per tenant. The dashboard shows every tenant comfortably under that ceiling. The bill from your model provider says you blew through the spend cap anyway. Nobody on the rollout call has a clean story for why.

The answer is that the rate limiter and the bill are measuring different things. The limiter sees one "user request" when a customer clicks a button. The provider sees a planner call, three tool-result reflections, a format-correction retry triggered by a stricter JSON schema, and a final synthesis — each with its own internal retry budget that fires when a transient 429 or 500 comes back. A single click can fan out into thirty model calls. The limiter counts one. The bucket leaks at thirty times the rate it was sized for.

Rate-limiting an agentic system at the HTTP boundary is enforcing speed limits at the highway entrance while the cars inside multiply. Until the limiter understands the loop, the loop will route around it.

When the Agent Asks Forgiveness Instead of Permission

· 11 min read
Tian Pan
Software Engineer

Your team gave the agent a tool to refund customers, a tool to escalate to a manager, a tool to update a record in the CRM, and a system prompt that says "use your judgment." Six weeks in, the agent has shortened average resolution time by 40%, the demo to the executive team went beautifully, and the eval scores climbed every sprint. Then the apology emails start. A refund went to the wrong account because the agent didn't double-check the customer ID. An escalation pinged a director's phone at 11pm over a question a tier-one rep could have answered. A CRM update overwrote the "preferred contact channel" field that the field-sales team owns and uses to drive their territory routing. None of these are bugs in the model. They are the model doing exactly what your eval rewarded it for.

The agent learned, correctly, that taking action is scored positively and that asking the user "should I proceed?" is scored as friction. It also learned that an apology after an irreversible action is cheaper, on the metric it was being graded on, than a confirmation that delays a resolution. The act-first-apologize-later default arrived in production without any single engineer choosing it, because the eval set, the system prompt, and the tool surface together described a reward function where that policy wins.

From a Bug to a Behavior Rate: The AI Postmortem Without a Reproducer

· 10 min read
Tian Pan
Software Engineer

A user files a ticket. The agent told a paying customer their refund would be processed in seven hours when the documented SLA is seven days. Screenshot attached. You pull the trace, find the exact prompt, the exact tool calls, the exact model and seed. You replay it. The model says seven days. You replay it again. Seven days. You replay it a hundred times. It says seven days ninety-eight times and "by end of day" twice, and never once says seven hours. The screenshot is unambiguous. The replay disagrees. The postmortem due Friday now has a "Root Cause" section and no root cause to put in it.

This is the shape of most AI incidents that reach a postmortem. Not the obvious outages — those have stack traces and 500-rate graphs and recover the way every SRE has been trained to expect. The hard ones are the single bad output that left a victim, erased its own conditions on the way out, and refuses to come back when you summon it. Every postmortem template you have ever used assumes a reproducer. Agents do not give you one.

Streamed Tokens Are a Promise You Can't Take Back

· 9 min read
Tian Pan
Software Engineer

The model has streamed seventy percent of a confident-sounding answer to the user's screen. Then the tool call it was about to make returns an error, or no rows, or a 429. You now get to pick between two losses: let the model finish gracefully by inventing the rest, or stop mid-sentence with no clean way to walk it back. Neither is a recovery — both are damage.

This is the part of streaming UX that nobody priced when they turned the feature on. Streaming was framed as a perceived-latency win: time-to-first-token is the metric, the user starts reading sooner, the app feels alive. What the framing leaves out is that every token you stream is a commitment. You have published a draft of an answer that you do not yet know is correct, and the back half of your system has not yet finished running. When it finishes and disagrees, your UI has no native way to retract what it already showed.