252 posts tagged with "reliability"

The Fallback Cascade: Why Your AI Feature Needs Five Failure Modes, Not One

May 2, 2026 · 9 min read

Software Engineer

Most AI features ship with exactly two states: working and broken. The model call succeeds and the feature responds; the model call fails and the user sees an error. This is the equivalent of building a web service with no load balancing, no cache, and a single database replica — technically functional until the moment it isn't.

The difference is that engineers learned database resilience patterns in the 1990s and have internalized them deeply. AI feature resilience is still being discovered the hard way, one production outage at a time. A payment processor lost $2.3M in a four-hour AI outage. A logistics company missed delivery windows for 30,000 packages when their routing model went down. Both failures shared a root cause: when the primary model was unavailable, there was nothing to fall back to.

When Your Agents Disagree: Conflict Resolution Patterns for Parallel AI Systems

May 2, 2026 · 9 min read

Tian Pan

Software Engineer

Here is the uncomfortable fact that multi-agent system designs rarely surface in architecture reviews: when you run two agents over the same task, they will not agree on the answer somewhere between 20% and 40% of the time, depending on task type. Most systems respond to this by silently picking one answer. The logs show a final decision; the intermediate disagreement disappears. Everything looks healthy until something downstream breaks, and you spend three to five times longer debugging it than you would a single-agent failure — because you can't tell which agent was wrong, or even that they disagreed at all.

Disagreement between agents is not a fringe case to handle later. As parallel agent topologies become a standard architecture pattern, conflict resolution graduates from a footnote into a first-class reliability discipline.

Persona Drift in Long-Running Agent Sessions: Why Your Agent Forgets Who It Is

May 2, 2026 · 10 min read

Tian Pan

Software Engineer

Most production agent failures look like model errors. The agent starts a session responding correctly to the system prompt — maintaining the right tone, respecting tool constraints, following the defined workflow. Then somewhere around turn 30 or 40, things subtly shift. The agent starts hedging where it should be direct. It calls tools it was told to avoid. It contradicts a decision it made 15 turns earlier. The system prompt hasn't changed, but the agent's behavior has.

This is persona drift: the progressive divergence between an agent's actual behavior and its original system instructions, caused by how transformers attend to increasingly buried context. Research quantifies it precisely — after 8–12 dialogue turns, persona self-consistency metrics degrade by more than 30%. Single-turn agents achieve roughly 90% task accuracy; multi-turn agents running the same tasks fall to around 65%. That 25-point drop isn't a model quality problem you can prompt your way around. It's an architectural property of how attention works over long sequences, and most teams discover it only after they've shipped a feature that degrades silently for hours before a user finally notices.

The Tail-Tolerant Retry Policy Your LLM Gateway Doesn't Have

May 2, 2026 · 12 min read

Tian Pan

Software Engineer

Pull up your gateway's retry config. Three attempts. Exponential backoff with jitter. Retry on 5xx and timeout. Maximum delay capped at a few seconds. It looks reasonable, and someone copied it from a microservices runbook two years ago. It is also the single largest reason your P99 is twice your P50, your token bill spikes during provider incidents, and a meaningful slice of your users see a thirty-second spinner before silently bouncing.

A retry policy designed for 50ms RPCs does not survive contact with an 8-second LLM call. The shape of the failure is different, the cost of every attempt is different, and the user-perceived clock is different. The default is not safe, it is just familiar. Most teams discover this the same way: a postmortem where the gateway logs a successful response and the customer screenshot shows a frozen UI.

The Brownout Pattern: When Your LLM Provider Is Slow but Not Down

May 1, 2026 · 10 min read

Tian Pan

Software Engineer

The pager that wakes you at 3 a.m. for an outage is the easy one. The provider returned 503 for forty minutes, your fallback kicked in, your runbook fired, your post-mortem writes itself. The pager that does not wake you — the one that lets your support queue fill up over six hours while every dashboard stays green — is the brownout. The provider's API still answers. The status page still says "operational." Your p99 latency has quietly drifted from 2.1 seconds to 14 seconds, your error rate from 0.1% to 4%, and the only people who noticed are the users who already left.

Provider availability is not binary. The fallback story most teams write — "if provider is down, switch to backup" — is a state machine with two states for a continuous variable, and it does not fire when the provider is sad rather than dead. Building for brownouts is a different design problem than building for outages, and almost every production agent harness I have seen ships without solving it.

Tool Call Ordering Is a Partial Order, Not a Set

May 1, 2026 · 10 min read

Tian Pan

Software Engineer

A "create then notify" sequence works in dev. A "notify then create" sequence emits a webhook for an entity that doesn't exist yet, the consumer 404s, and your team spends a week debugging what looks like a flaky integration test. The flake isn't flaky. It's deterministic given a hidden ordering invariant your tool set has and your planner doesn't know about.

This is the shape of most tool-call-ordering bugs in production agents: a tool set that secretly composes as a partial order — some operations must happen before others, others can run in any order — being treated by the planner as an unordered set of capabilities. The model picks an order that worked yesterday. A prompt edit, a model upgrade, or even a different temperature sample picks a different order tomorrow. Both look reasonable to anyone reading the trace. Only one is correct.

The team that doesn't declare the order is shipping a bug surface that the model's prompt sensitivity will eventually find.

Agents as Cron Jobs: When Scheduled Triggers Beat Conversational Loops

April 30, 2026 · 10 min read

Tian Pan

Software Engineer

Most "agents" in production today are background jobs wearing a chat interface. They do not need a user typing into them. They need a trigger, a state file, and a way to resume after the inevitable timeout. The conversational loop — request, tool call, request, tool call, indefinitely — is a demo affordance that quietly became the default execution model, and it is the wrong model for the majority of agentic work that ships.

The decision is not philosophical. It shows up on the bill, in the on-call pager, and in the percentage of runs that finish at all. A conversational loop holds a model session open across many turns, accumulates context, and dies if any link in the chain fails. A scheduled trigger fires at a deterministic boundary, runs to completion or to a checkpoint, and writes its state somewhere durable before exiting. One is a phone call. The other is a job queue. Treating the two as interchangeable is how a $200/month feature becomes a $40,000/month feature without anyone changing the prompt.

The Agent Degraded-Mode Spec Is the Document You Didn't Write

April 28, 2026 · 11 min read

Tian Pan

Software Engineer

When the search index goes stale, the vendor API throttles you, the database read replica falls behind, or a downstream microservice starts returning 503s, your agent has to decide what to do. In most production agent systems, that decision was never made. It was inherited — silently — from whatever the engineer who wrote the tool wrapper happened to type at 4 PM on a Tuesday in week three of the project.

The result is what your customers eventually write for you: a Reddit thread, a support transcript, a quote in a press article. "The assistant told me my balance was $0 when my account was actually fine — turns out their lookup service was down." That paragraph is the degraded-mode spec your team didn't write. It is now public, it is now the customer's, and it is the version your engineering org will spend the next quarter responding to.

Agent Disaster Recovery: When Working Memory Dies With the Region

April 28, 2026 · 12 min read

Tian Pan

Software Engineer

The DR runbook your team rehearses every quarter was written for a stack you no longer fully run. It says: promote the replica, repoint DNS, drain the queue. It assumes state lives in databases, queues, and object storage — places the SRE org has owned, named, and tested for a decade. Then last quarter you shipped an agent. Working memory now lives in the inference provider's session cache, scratchpad files on a worker's local disk, in-flight tool results that haven't been written back, and a partial plan-and-act trace that exists only in the prompt history of one model call. None of that is on the asset register. None of it is in the runbook.

When the region drops, the agent doesn't fail cleanly. It half-completes. The user sees a workflow that started but the failover region cannot resume, the customer's invoice gets sent twice or not at all because the idempotency key lived on the dead worker, and the on-call engineer reads a Slack thread that begins "the orchestrator is up, but..." and ends six hours later with a credit-card chargeback queue.

This is the gap nobody named: agentic features have a state model the existing DR plan doesn't describe. The team that hasn't written that state surface down is one regional outage away from learning what their runbook's silence costs.

The Demo Was a Single Seed: Why Your AI Rollout Is a Variance Problem, Not a Polish Problem

April 28, 2026 · 11 min read

Tian Pan

Software Engineer

The exec demo went perfectly. The model answered the curated question, the agent completed the workflow, the screen recording is saved on the company drive, and the launch date is now in the calendar. Six weeks later the rollout craters and the post-mortem narrative writes itself: the model needed more polish, the prompt needed more iteration, the team underestimated the work between prototype and production.

That narrative is wrong, and it's expensive, because it sends the team back to do more of the work that already failed. The demo wasn't an under-polished version of production. It was a single sample from a distribution the team never measured. The wow moment was one realization out of thousands the model would generate against the same input, and the team shipped the best one as if it were the typical one. The gap between demo and prod isn't quality drift. It's variance the team hadn't yet seen.

This reframing matters because the fix for a variance problem looks nothing like the fix for a polish problem. Polish says "iterate the prompt, tune the model, hire a better PM." Variance says "you don't know what you have until you sample it n times across the input distribution." The two diagnoses produce different roadmaps, different budgets, and different incident patterns. The teams that ship reliably in 2026 know which problem they have.

Cross-Team Agent SLAs Don't Compose: The 99% Math Your Org Forgot to Budget

April 28, 2026 · 11 min read

Tian Pan

Software Engineer

Team A's agent advertises a 99% success rate. Team B's agent advertises 99%. The new joint workflow that calls both lands at 98% on a good day, 96% on a bad one — and the team that owns the joint workflow is now the de facto SRE for two systems they don't own, can't reproduce locally, and didn't write the eval set for. Each upstream team is hitting its SLO. The composite product is missing its SLO. Nobody's pager is ringing on the right side of the boundary.

This is the math of independent failure rates, and it has been hiding in plain sight ever since the org started letting agents call each other. Five components at 99% reliability give you 95% end-to-end. Ten components give you 90%. A 20-step process at 95% per-step succeeds 36% of the time — more than half of operations fail before completion. By the time a workflow chains 50 components — not unusual once an enterprise agent starts calling sub-agents that call tool agents — a system where every individual piece is "99% reliable" will fail roughly four out of ten requests.

Researchers analyzing five popular multi-agent frameworks across more than 150 tasks identified failure rates between 41% and 87%, with the top three failures being step repetition, reasoning–action mismatch, and unawareness of termination conditions — and unstructured multi-agent networks have been observed to amplify errors up to 17× compared to single-agent baselines. The math isn't subtle. The problem is that the org's SLO sheets, dashboards, on-call rotations, and PRDs are still scoped one agent at a time.

The Human Attention Budget Is the Constraint Your HITL System Silently Overspends

April 28, 2026 · 10 min read

Tian Pan

Software Engineer

The 50th decision your reviewer makes this morning is not the same quality as the first. The architecture diagram does not show this. The capacity model does not show this. The dashboard tracking "approvals per hour" actively hides it. And yet the entire premise of your human-in-the-loop system — that a person catches what the model gets wrong — is silently degrading from the moment the queue begins to fill.

Most HITL designs treat reviewer time as an infinite, fungible resource. The team sets a confidence threshold, routes everything below it to a human queue, and declares the system "safe." Six weeks later, the approval rate has crept up to 96%, the queue is twice as deep as the staffing model assumed, and a sample audit shows that reviewers are clicking "approve" on edge cases they would have flagged on day one. The system has not failed. It has rubber-stamped its way into looking like it is working.

About Tian Pan