Skip to main content

159 posts tagged with "reliability"

View all tags

The Demo Was a Single Seed: Why Your AI Rollout Is a Variance Problem, Not a Polish Problem

· 11 min read
Tian Pan
Software Engineer

The exec demo went perfectly. The model answered the curated question, the agent completed the workflow, the screen recording is saved on the company drive, and the launch date is now in the calendar. Six weeks later the rollout craters and the post-mortem narrative writes itself: the model needed more polish, the prompt needed more iteration, the team underestimated the work between prototype and production.

That narrative is wrong, and it's expensive, because it sends the team back to do more of the work that already failed. The demo wasn't an under-polished version of production. It was a single sample from a distribution the team never measured. The wow moment was one realization out of thousands the model would generate against the same input, and the team shipped the best one as if it were the typical one. The gap between demo and prod isn't quality drift. It's variance the team hadn't yet seen.

This reframing matters because the fix for a variance problem looks nothing like the fix for a polish problem. Polish says "iterate the prompt, tune the model, hire a better PM." Variance says "you don't know what you have until you sample it n times across the input distribution." The two diagnoses produce different roadmaps, different budgets, and different incident patterns. The teams that ship reliably in 2026 know which problem they have.

Cross-Team Agent SLAs Don't Compose: The 99% Math Your Org Forgot to Budget

· 11 min read
Tian Pan
Software Engineer

Team A's agent advertises a 99% success rate. Team B's agent advertises 99%. The new joint workflow that calls both lands at 98% on a good day, 96% on a bad one — and the team that owns the joint workflow is now the de facto SRE for two systems they don't own, can't reproduce locally, and didn't write the eval set for. Each upstream team is hitting its SLO. The composite product is missing its SLO. Nobody's pager is ringing on the right side of the boundary.

This is the math of independent failure rates, and it has been hiding in plain sight ever since the org started letting agents call each other. Five components at 99% reliability give you 95% end-to-end. Ten components give you 90%. A 20-step process at 95% per-step succeeds 36% of the time — more than half of operations fail before completion. By the time a workflow chains 50 components — not unusual once an enterprise agent starts calling sub-agents that call tool agents — a system where every individual piece is "99% reliable" will fail roughly four out of ten requests.

Researchers analyzing five popular multi-agent frameworks across more than 150 tasks identified failure rates between 41% and 87%, with the top three failures being step repetition, reasoning–action mismatch, and unawareness of termination conditions — and unstructured multi-agent networks have been observed to amplify errors up to 17× compared to single-agent baselines. The math isn't subtle. The problem is that the org's SLO sheets, dashboards, on-call rotations, and PRDs are still scoped one agent at a time.

The Human Attention Budget Is the Constraint Your HITL System Silently Overspends

· 10 min read
Tian Pan
Software Engineer

The 50th decision your reviewer makes this morning is not the same quality as the first. The architecture diagram does not show this. The capacity model does not show this. The dashboard tracking "approvals per hour" actively hides it. And yet the entire premise of your human-in-the-loop system — that a person catches what the model gets wrong — is silently degrading from the moment the queue begins to fill.

Most HITL designs treat reviewer time as an infinite, fungible resource. The team sets a confidence threshold, routes everything below it to a human queue, and declares the system "safe." Six weeks later, the approval rate has crept up to 96%, the queue is twice as deep as the staffing model assumed, and a sample audit shows that reviewers are clicking "approve" on edge cases they would have flagged on day one. The system has not failed. It has rubber-stamped its way into looking like it is working.

The 70% Reliability Uncanny Valley: Where AI Features Go to Lose User Trust

· 12 min read
Tian Pan
Software Engineer

A feature that fails 70% of the time is harmless. The user learns within a week that they have to verify every output, treats the system as an unreliable assistant, and adjusts. A feature that succeeds 70% of the time is worse than that. It is right often enough that the user stops verifying, and wrong often enough that the failures are concentrated, visible, and personal. The user's mental model collapses into "I cannot tell when to trust this" — which, as a product experience, is strictly worse than "I know not to trust this."

This is the 70% uncanny valley, and it is where most AI features built in the last two years live. The team measures aggregate accuracy, watches the number cross some "good enough" threshold, and ships. The realized user experience does not improve monotonically with that number. Between roughly 60% and 85% accuracy, the product gets worse as it gets more accurate, because the cost of a wrong answer the user did not think to check exceeds the value of a right answer they no longer have to verify.

The team that ships at 70% without designing for the predictability problem is not shipping a worse version of a 95% product. They are shipping a different product entirely: one whose primary failure mode is silent.

Agent Memory Drift: Why Reconciliation Is the Loop You're Missing

· 11 min read
Tian Pan
Software Engineer

The most dangerous thing your long-running agent does is also the thing it does most confidently: answer from memory. The customer's address changed last Tuesday. The ticket the agent thinks is "open" was closed yesterday by a human. The product feature the agent has tidy explanatory notes about shipped in a different shape than the spec the agent read three weeks ago. None of this is hallucination in the textbook sense — the model is recalling exactly what it stored. The world simply moved while the agent was looking elsewhere.

Most teams treat memory like a write problem: what should the agent remember, how do we summarize, what's the embedding strategy, how do we keep the store from blowing up. That framing produces architectures that grow more confident as they grow more wrong. The harder problem — the one that determines whether your agent stays useful past week three — is reconciliation: the explicit, ongoing loop that compares what the agent thinks is true against what the underlying systems say is true right now.

Agent SLOs Without Ground Truth: An Error Budget for Outputs You Can't Grade in Real Time

· 11 min read
Tian Pan
Software Engineer

Your agent platform has met its 99.9% "response success" SLO every quarter for a year. Tickets are up 40%. Retention on the agent-touched cohort is down. The on-call rotation is bored, the product manager is panicking, and the executive review keeps asking why the dashboard says everything is fine while the support queue says everything is on fire. The dashboard isn't lying. It's just measuring the wrong thing — because the SRE who wrote the SLO defined success as "the model API returned 200," and that was the only definition of success the telemetry could express in the first place.

This is the central problem of agent reliability engineering: the success signal is not a status code. It is a judgment about whether the agent did the right thing for a specific task, and that judgment is unavailable at request time, often unavailable at session time, and sometimes only resolvable days later when the user files a ticket, edits the output, or quietly stops coming back. You cannot put a 200-vs-500 boolean on a column that doesn't exist yet.

The reflex is to wait for ground truth before declaring an SLO. This is wrong. Reliability does not pause while you build a labeling pipeline. The right move is to write an error budget against proxies you know are imperfect, name them as proxies, set the policy that governs how the team responds when they trip, and back-fill ground truth into the calculation as you produce it. This post is about how to do that without lying to yourself.

Cancel-Safe Agents: The Side Effects Your Stop Button Already Shipped

· 11 min read
Tian Pan
Software Engineer

A user clicks Stop because the agent misunderstood the request. The UI flashes "stopped." By the time the spinner disappears, the agent has already sent two emails, scheduled a Tuesday meeting on the user's calendar, opened a draft pull request against the wrong branch, and queued a Slack message that is racing the cancellation signal through the tool layer. The model has obediently stopped generating tokens. The world has not stopped reacting to the tokens it generated thirty seconds ago.

This is the failure mode nobody covered in the agent demo. Cancellation in synchronous code was already a hard problem with a generation of cooperative-cancellation theory behind it: Go contexts, Python's asyncio.cancel, structured concurrency with task groups, the whole grammar of "ask politely, escalate carefully, don't leave resources behind." Agents take that already-hard problem and add a layer on top: the planner does not know that the user revoked authorization between step 4 and step 5, and the tools it kicked off in step 4 do not get a memo when step 5 is cancelled. Stop is a UI affordance. The system underneath stop has to be designed.

GPU Capacity Planning When Demand Is a Cliff, Not a Curve

· 10 min read
Tian Pan
Software Engineer

The first time an agent platform falls over, the postmortem usually contains a sentence that reads something like: "We had eight weeks of headroom on Friday. By Monday afternoon, we were at 140% of provisioned capacity." Nobody is lying. The capacity model was correct, applied to a workload it was never designed for. Classical capacity planning assumes demand grows along a smooth curve where weekly seasonality is the dominant signal and the worst case is a Black Friday you can plan against six months out. Agent workloads break that assumption hard.

The shape of agent demand is not a curve. It is a cliff. Three things produce the cliff and they compound. A single enterprise customer onboarding can shift baseline by 10x overnight on a contractual notice you've already signed. An agent loop can amplify a tiny increase in user activity into a fanout-multiplied surge that hits inference 30x harder than the user-facing graph suggests. A single product change — enabling tool use, lengthening context, switching to a larger model — can move per-task token consumption by an order of magnitude with no change in user count.

If your capacity planning is in QPS and your headroom budget is "75% utilization is healthy," you are not planning. You are gambling that none of those three cliffs lands on the same week.

JSON Mode Is a Dialect, Not a Standard: The Silent Breakage in Your Fallback Path

· 11 min read
Tian Pan
Software Engineer

The first time I watched a fallback router cause a worse incident than the outage it was trying to mitigate, the postmortem document had a header that read: "Primary degraded for 11 minutes. Fallback degraded our parser for 6 days." Nobody had written code wrong. Nobody had skipped the schema review. The integration tests against the secondary provider had been green when the fallback was wired up, eighteen months earlier. What had happened in between was that one of the two providers had quietly tightened its enum coercion policy, and the contract our downstream parsers had been written against — a contract we believed was "JSON Schema, more or less" — had drifted from a shared standard into two slightly incompatible dialects.

This is the failure mode I keep seeing, and it keeps surprising teams that should know better. "JSON mode" sounds like a feature you turn on. It is not. It is a contract you maintain — separately, against every provider you might route to — and the contract drifts every quarter as vendors evolve their structured-output stacks. The "drop-in replacement" your provider docs gestured at when you signed the contract is, in production, a maintained translation layer whose absence converts your fallback path into a paper compliance artifact: present in the architecture diagram, broken on the day you needed it.

Load Shedding Was Built for Humans. Agents Amplify the Storm You're Shedding

· 12 min read
Tian Pan
Software Engineer

A 503 to a human is a "try again later" page and a coffee break. A 503 to an agent is a 250-millisecond setback before retry one of seven, and the planner is already asking the LLM whether a different tool can sneak around the failed dependency. The first behavior gives an overloaded service room to recover. The second behavior is what an overloaded service has nightmares about: thousands of correlated retries, each one cheaper and faster than a human's, half of them fanning out into the next dependency over because the planner decided that was a creative workaround.

Load shedding — the discipline of dropping low-priority work to keep the high-priority path alive — was designed in an era when the principal sending traffic was a human at a keyboard or a well-behaved service with a hand-tuned retry policy. Both of those assumptions break the moment a fleet of agents shows up. The agent retries faster, retries from more places at once, replans around the failure, and treats your 503 as a load-balancing hint instead of as the cooperative back-pressure signal you meant it to be.

This piece is about why the standard load-shedding playbook doesn't survive contact with agentic clients, what primitives the upstream service needs in order to actually shed agent traffic, and what the agent itself has to do — at the tool layer and at the planner — to stop being the hostile traffic in someone else's incident report.

Replan, Don't Retry: Why Most Agent Errors Aren't Transient

· 10 min read
Tian Pan
Software Engineer

A calendar-write returns 409 Conflict. The framework's default error handler kicks in: backoff 200ms, retry. Same conflict. Backoff 400ms, retry. Same conflict. Backoff 800ms, retry. By the time the agent gives up and tells the user "I couldn't book the meeting," it has burned three seconds of latency budget proving something the very first response already told it: the slot is taken. The world has not changed. It will not change in 800 milliseconds. Retrying was never going to work, because nothing about this error was transient.

This is the most common error-handling bug in agent systems, and it is hiding in plain sight inside almost every framework that ships today. The retry-with-exponential-backoff pattern was imported wholesale from stateless HTTP clients — where it is exactly correct — into stateful planning loops where it is actively wrong. The right default for a tool error in an agent is not retry. It is replan.

Structured Concurrency for Parallel Tool Fanout: Who Owns Partial Failure?

· 11 min read
Tian Pan
Software Engineer

The moment your agent fans out five parallel tool calls — search across three indexes, query two databases, hit one external API — you have crossed an invisible line. You are no longer writing prompt-and-response code. You are writing a concurrent program. Most agent frameworks pretend you are not, and the bill arrives at 2 AM.

The pretense is comfortable. The planner emits a list of tool calls, the runtime fires them off, the runtime collects whatever comes back, the planner consumes the aggregate. From a thousand feet up it looks like a fan-out / fan-in pipeline, and most teams treat it that way until production teaches them otherwise. The problem is that twenty years of concurrent-programming research — partial-failure semantics, structured cancellation, backpressure, deterministic error attribution — already solved the failure modes you are about to rediscover. Your agent framework, by default, did not import any of it.