45 posts tagged with "sre"

The Streaming Response That Returns 200 Then Fails: How Mid-Stream Errors Break Your SLOs

May 16, 2026 · 10 min read

Software Engineer

Your availability dashboard says 99.95%. Your users say the answer stopped mid-sentence. Both are correct, and that is the problem.

The HTTP-era reliability stack was built on a single assumption: the status code arrives at the end of a request and summarizes its fate. A 200 means success. A 5xx means retry. The load balancer counts the ratio, the SLO dashboard aggregates it, the alerting fires on the burn rate. Every layer of that stack reads the header and trusts it.

Streaming inverts the assumption. The moment your server flushes the first token, it has already committed to a 200. Everything that goes wrong after that — a provider timeout at token 400, a content filter trip mid-paragraph, a dropped TCP connection, a malformed tool-call fragment — happens after the verdict has been rendered and cannot be retracted. The request failed. The status code says it succeeded. And nothing in your reliability tooling is built to notice the difference.

The Agent That Refuses to Fail Loud: How Over-Eager Fallbacks Hide Production Regressions

May 14, 2026 · 11 min read

Tian Pan

Software Engineer

Your status page is green. Your error rate is zero. Your p95 latency looks slightly better than last week. And quietly, eval-on-traffic dropped four points last Tuesday and nobody knows why for nine days, because by the time the regression rolled past the alerting threshold there were four interleaved root causes layered on top of each other and the team couldn't tell which one started the slide.

This is the dominant failure mode of mature agentic systems in 2026, and it's not a bug in any single component. It's the cumulative effect of a defensive stack the team built deliberately, one well-intentioned safety net at a time. The primary model returns garbage; the retry succeeds. The retry fails; the cheaper fallback model answers. The fallback's output is malformed; the wrapper rewrites it into a plausible shape. The wrapper logs a soft warning. Nobody alerts on the soft warning. The user receives an answer that's correct-looking, smoothly delivered, and quietly worse than the system was designed to produce.

The robustness layer worked. The quality story collapsed. And the alerting was built for the world before the robustness layer existed.

The AI Gateway Is the SPOF Nobody Named

May 14, 2026 · 10 min read

Tian Pan

Software Engineer

The pitch sounded responsible. "Let's not hardcode OpenAI everywhere — we'll put a thin abstraction in front, then we can swap providers if we need to." Two years later, that thin abstraction is a service with its own deploy pipeline, its own SRE on-call, an eval gate that blocks bad prompts, a semantic cache that saves seven figures a year, a retry policy with provider-specific backoffs, an observability schema every dashboard depends on, and a key vault holding the credentials for six model vendors. Every AI feature in the company terminates there.

It is also, almost by accident, the single point of failure with the worst blast radius in the stack. When the primary LLM provider goes down — and in 2025 OpenAI was tracked having 294 outage events since January, with Anthropic logging 184.5 hours of total customer impact in December alone — the gateway routes around it and most users never notice. When the gateway itself dies, every AI feature in every product simultaneously stops, the failover that was supposed to fire never gets a chance, and the postmortem opens with "the abstraction layer we built to insulate us from provider outages was the outage."

On-Call at 3am for an AI Feature That Didn't 500

May 14, 2026 · 12 min read

Tian Pan

Software Engineer

The pager goes off at 3:02 AM. You squint at your phone expecting the usual: a database failover, a CDN edge that wandered off, a 500 spike from a service nobody touched in eight months. Instead the alert reads: summarizer.eval-on-traffic.helpfulness rolling-1h: 4.21 → 4.05 (Δ -0.16). No HTTP error. No latency spike. No service is down. Every request the system served in the last hour returned a 200 with a body that parsed cleanly. And yet something is unmistakably worse than it was at midnight, and the rotation expects you to figure out what.

This is the on-call shift the standard runbook wasn't written for. The thing that broke didn't break — it regressed. The error budget you've been tracking for years is denominated in availability and latency, and the failure mode that paged you isn't visible in either. The page is real, the customer impact is real, and your usual diagnostic loop — check the deploy log, check the dependency graph, find the bad release, roll it back — runs into a wall the moment you realize that "the bad release" might be a 30-line system-prompt diff that landed at 4 PM yesterday and looked completely innocuous in code review.

AI Ops Is Not Platform Engineering: How Running LLM Services Breaks Your SRE Playbook

May 5, 2026 · 10 min read

Tian Pan

Software Engineer

Your SRE team is excellent at running microservices. They've mastered blue-green deployments, canary rollouts, distributed tracing, SLO burn-rate alerts, and postmortem culture. Then someone ships an LLM-powered feature, and within a week an incident happens that none of those practices were designed to handle: the model starts generating plausible-sounding but structurally wrong outputs, no error is logged, no health check fails, and users have been silently getting garbage for four hours before anyone noticed.

This isn't a skills gap. It's an architectural gap. Running LLM services is a distinct operational discipline from running microservices, and the practices that don't transfer will burn your team if you don't identify them explicitly.

The Inference Fleet: Applying SRE Discipline to Multi-Provider LLM Dependencies

May 4, 2026 · 11 min read

Tian Pan

Software Engineer

Here is a failure mode that does not show up on any dashboard until it is too late: your production system is silently degrading because a secondary LLM provider started returning malformed responses three days ago, nobody owns that provider in your on-call rotation, and the only signal is a slow uptick in user-reported errors that your support team has not yet escalated. You find out when a customer cancels.

This is not a model quality problem. It is an operational discipline problem. And it is becoming more common as production AI stacks grow from a single OpenAI integration into a multi-provider, multi-endpoint sprawl that nobody designed as a fleet — but that is what it has become.

Your AI Feature Needs a Kill Switch That Isn't a Deploy

May 2, 2026 · 13 min read

Tian Pan

Software Engineer

Picture the scene: it is 2:14 a.m., the on-call engineer's phone is buzzing, and the AI feature that ships your flagship product surface is confidently telling enterprise customers that their account number is "tomato soup." The model provider pushed a routing change, your prompt got truncated by a quietly upgraded tokenizer, or the retrieval index regenerated against a corrupted parquet file — the cause does not matter yet. What matters is the ten-minute clock until someone screenshots an output and posts it to LinkedIn.

If your only response is "revert the deploy and wait for CI," you have already lost. A standard pipeline rollback is twenty to forty minutes from page to recovery, and the bad outputs do not pause politely while the green checkmark renders. By the time the new container is healthy, the screenshot is in a thread, the support inbox has fifty tickets, and the trust you spent six months building is being audited by people who never use the product.

The teams that contain these incidents in five minutes instead of five hours did not get lucky. They built a kill switch before they needed one — a primitive that lets the on-call engineer disable the AI path in seconds without a deploy, without a merge, and without anyone touching the production binary. This post is about what that primitive looks like for AI features specifically, why the deterministic-software version of it is insufficient, and what has to be true the day before the incident for the response to work the night of.

Personalization Belongs in a Dotfile, Not a Vector Store

May 2, 2026 · 12 min read

Tian Pan

Software Engineer

The first time a product team needs per-user agent behavior, somebody usually says "we should fine-tune" or "let's wire up persistent memory." A week later they have a vector database, a feedback-loop pipeline, and a roadmap item to monitor learned-state drift. They have built an ML system to solve a problem that, in nine cases out of ten, is a config file.

Look at what users are actually asking for: terser responses, bullets instead of prose, my company's name in the disclaimer, default to my preferred model, don't escalate to a human under $100, here is the project I am working on this week, never use emoji. None of that needs a model that has learned anything. It needs settings. The dotfile pattern — a versioned, declarative, per-user configuration repo — solved this for shells, editors, and CLIs forty years ago, and it is the right shape for AI agents in 2026.

The Human Review Queue Is Your P0 SLA: When HITL Becomes the Bottleneck

May 2, 2026 · 11 min read

Tian Pan

Software Engineer

The first incident is rarely an outage. It's a Slack message from someone in customer success: "Hey, are we OK? Five customers in the last hour escalated tickets that have been sitting in 'awaiting review' for over a day." You check the model latency dashboard. Green. You check the agent's success rate. Green. You check the cost-per-call graph. Healthy. Everything you instrumented is fine. The thing that's broken is a queue your monitoring stack doesn't know exists, staffed by people whose calendars your capacity planner doesn't read, governed by an SLA that nobody has ever written down.

That queue is your human-in-the-loop escalation path. You added it three months ago "for safety" — the agent would defer to a human reviewer on the small fraction of cases where its confidence was low or the action was high-stakes. At launch it caught maybe a dozen items a day. The ops team handled them between other tasks. It was a backstop, not a system. Today it's processing thousands of items, the median time-to-resolution has tripled, and the customers waiting in line are quietly churning. The HITL path didn't fail. It just stopped being treated like production.

The Monday Morning AI Degradation Your Dashboard Treats As Noise

May 2, 2026 · 10 min read

Tian Pan

Software Engineer

Pull up your AI feature's latency and quality dashboards and squint. The line is mostly flat with occasional spikes your team has been calling "noise" or "provider weirdness" for months. Now break that same data out by hour-of-day and day-of-week. The noise resolves into a face: every Monday between 9 and 11am Eastern, your p95 latency is 30–60% worse than it is on a Saturday night, your cache hit rate dips 10–20 points, your retry rate doubles, and your token spend per task quietly climbs. The dashboard wasn't lying. It was averaging.

Most teams discover this pattern the way you discover a slow leak: by tracing the cost back from a quarterly bill nobody can explain. The instinct is to call it provider flakiness, file a ticket with the inference vendor, and move on. But the pattern isn't really about your LLM provider. It's about the fact that your AI feature now sits on top of a stack of shared, time-of-day-sensitive systems — the model API, the embedding API, the dependent SaaS tools your agent calls, the customer's own infrastructure on the receiving end of webhooks — and the cyclic load patterns of every one of them compose. You inherited the diurnal curve of an entire dependency chain, and your dashboard is showing you the average of all of them.

Model Rollback Velocity: The Seven-Hour Gap Between 'This Upgrade Is Wrong' and 'Old Model Fully Restored'

April 28, 2026 · 12 min read

Tian Pan

Software Engineer

The playbook for a bad code deploy is a sub-minute revert. The playbook for a bad config push is a sub-second flag flip. The playbook for a bad model upgrade is whatever the on-call invents at 09:14, and on a typical day it takes seven hours to finish. During those seven hours the regression keeps compounding — wrong answers ship to customers, support tickets pile up, and the dashboard shows a slow gradient rather than a clean cliff back to green.

The reason the gap is seven hours is not that the team is slow. It is that "rollback" for a model upgrade is not the same primitive as "rollback" for code. It is closer to a database schema migration: partial, hysteretic, and not reversible by pressing the button you wish existed. The team that wrote its incident playbook around a button does not have the controls the actual rollback requires.

This post is about what those controls look like, why they have to be paid for in advance, and what you find out about your platform the first time you try to roll back a model under load.

Your On-Call Rotation Needs an AI-Literacy Prerequisite Before It Pages Anyone at 2am

April 28, 2026 · 12 min read

Tian Pan

Software Engineer

A platform engineer with eight years of incident-response experience opens a 2am page that says "AI assistant degraded — error rate 12%." She checks the model latency dashboard: green. She checks the model API status page: green. She checks the deploy log: nothing shipped in the last 72 hours. She does what any competent on-call does next — she pages the AI team. The AI engineer wakes up, opens the trace dashboard the platform engineer didn't know existed, sees that a single retrieval tool has been timing out for the last four hours because a downstream search index lost a replica, and resolves the incident in eleven minutes. The AI engineer goes back to bed at 3:14am. The retrospective the next morning records "AI feature outage, resolved by AI team." Nobody writes down the actual lesson, which is that the on-call engineer could have triaged this in five minutes if she had ever been taught what an AI feature's failure surface looks like.

This is the rotation tax that AI features quietly impose on every engineering org I've worked with in the last two years. The shared on-call rotation that worked beautifully for a stack of stateless services and a few databases breaks down the moment one of those "services" is an LLM-backed feature. The on-call playbook your SRE team built across a decade of post-mortems is calibrated for a world where "something is broken" decomposes into CPU, memory, network, deploys, and dependency timeouts. AI features add three more axes — the model, the prompt, the retrieval pipeline — and four more shapes of failure that don't show up on the dashboards your on-call was trained to read.

About Tian Pan