Skip to main content

182 posts tagged with "reliability"

View all tags

The Disable Switch Is the Real Product: Designing the Non-AI Fallback Path

· 10 min read
Tian Pan
Software Engineer

Every AI feature ships with a moment its team hasn't planned for: the moment it has to be turned off. A model regression lands during the morning standup. A cost spike from a marketing campaign nobody told engineering about doubles the bill in twelve hours. A privacy review flags a prompt-context leak. The provider goes down for ninety minutes. A compliance team waves a flag at noon and the feature has to disappear before the close of business.

The disable switch most teams ship for that moment is "the feature returns an error" — a spinner that never resolves, a banner that says "AI assistant unavailable, try again later." That is a strictly worse user experience than the pre-AI status quo, which is exactly what users will compare you to the moment AI degrades. The status quo had a button. Now they get an apology.

Eval-as-Code: When Your Release Gate Is a Notebook on Someone's Laptop

· 13 min read
Tian Pan
Software Engineer

The number that decides whether a model goes to production is being produced by a Jupyter notebook running on a single engineer's MacBook, against a CSV that lives in a Slack DM, scored by a judge model that nobody pinned. Two weeks later, after the engineer has touched the notebook three more times and the API provider has silently shipped a minor model update, nobody on the team can reproduce the number — including the engineer who originally generated it. And yet that number is the gate. It decided that GPT-4o-mini was good enough to replace GPT-4 in the customer support flow. It decided the new prompt template shipped. It decided the fine-tune was promoted. The team is treating it like a load-bearing artifact and storing it like a sticky note.

This is the eval gap. The industry has spent five years writing about evaluation as a methodology problem — which scoring technique, which judge model, which rubric, which dataset — and almost no time writing about evaluation as an engineering problem. But the moment your eval suite starts gating production releases, it inherits every requirement that the rest of your production stack lives by: reproducibility, version control, ownership, observability, dependency management, latency and reliability budgets, and a pipeline that survives the engineer who built it leaving the team. Most teams skip this layer entirely and discover its absence only after a major incident, usually one where the eval score said green and the customer experience said red.

The Fallback Cascade: Why Your AI Feature Needs Five Failure Modes, Not One

· 9 min read
Tian Pan
Software Engineer

Most AI features ship with exactly two states: working and broken. The model call succeeds and the feature responds; the model call fails and the user sees an error. This is the equivalent of building a web service with no load balancing, no cache, and a single database replica — technically functional until the moment it isn't.

The difference is that engineers learned database resilience patterns in the 1990s and have internalized them deeply. AI feature resilience is still being discovered the hard way, one production outage at a time. A payment processor lost $2.3M in a four-hour AI outage. A logistics company missed delivery windows for 30,000 packages when their routing model went down. Both failures shared a root cause: when the primary model was unavailable, there was nothing to fall back to.

When Your Agents Disagree: Conflict Resolution Patterns for Parallel AI Systems

· 9 min read
Tian Pan
Software Engineer

Here is the uncomfortable fact that multi-agent system designs rarely surface in architecture reviews: when you run two agents over the same task, they will not agree on the answer somewhere between 20% and 40% of the time, depending on task type. Most systems respond to this by silently picking one answer. The logs show a final decision; the intermediate disagreement disappears. Everything looks healthy until something downstream breaks, and you spend three to five times longer debugging it than you would a single-agent failure — because you can't tell which agent was wrong, or even that they disagreed at all.

Disagreement between agents is not a fringe case to handle later. As parallel agent topologies become a standard architecture pattern, conflict resolution graduates from a footnote into a first-class reliability discipline.

Persona Drift in Long-Running Agent Sessions: Why Your Agent Forgets Who It Is

· 10 min read
Tian Pan
Software Engineer

Most production agent failures look like model errors. The agent starts a session responding correctly to the system prompt — maintaining the right tone, respecting tool constraints, following the defined workflow. Then somewhere around turn 30 or 40, things subtly shift. The agent starts hedging where it should be direct. It calls tools it was told to avoid. It contradicts a decision it made 15 turns earlier. The system prompt hasn't changed, but the agent's behavior has.

This is persona drift: the progressive divergence between an agent's actual behavior and its original system instructions, caused by how transformers attend to increasingly buried context. Research quantifies it precisely — after 8–12 dialogue turns, persona self-consistency metrics degrade by more than 30%. Single-turn agents achieve roughly 90% task accuracy; multi-turn agents running the same tasks fall to around 65%. That 25-point drop isn't a model quality problem you can prompt your way around. It's an architectural property of how attention works over long sequences, and most teams discover it only after they've shipped a feature that degrades silently for hours before a user finally notices.

The Tail-Tolerant Retry Policy Your LLM Gateway Doesn't Have

· 12 min read
Tian Pan
Software Engineer

Pull up your gateway's retry config. Three attempts. Exponential backoff with jitter. Retry on 5xx and timeout. Maximum delay capped at a few seconds. It looks reasonable, and someone copied it from a microservices runbook two years ago. It is also the single largest reason your P99 is twice your P50, your token bill spikes during provider incidents, and a meaningful slice of your users see a thirty-second spinner before silently bouncing.

A retry policy designed for 50ms RPCs does not survive contact with an 8-second LLM call. The shape of the failure is different, the cost of every attempt is different, and the user-perceived clock is different. The default is not safe, it is just familiar. Most teams discover this the same way: a postmortem where the gateway logs a successful response and the customer screenshot shows a frozen UI.

The Brownout Pattern: When Your LLM Provider Is Slow but Not Down

· 10 min read
Tian Pan
Software Engineer

The pager that wakes you at 3 a.m. for an outage is the easy one. The provider returned 503 for forty minutes, your fallback kicked in, your runbook fired, your post-mortem writes itself. The pager that does not wake you — the one that lets your support queue fill up over six hours while every dashboard stays green — is the brownout. The provider's API still answers. The status page still says "operational." Your p99 latency has quietly drifted from 2.1 seconds to 14 seconds, your error rate from 0.1% to 4%, and the only people who noticed are the users who already left.

Provider availability is not binary. The fallback story most teams write — "if provider is down, switch to backup" — is a state machine with two states for a continuous variable, and it does not fire when the provider is sad rather than dead. Building for brownouts is a different design problem than building for outages, and almost every production agent harness I have seen ships without solving it.

Tool Call Ordering Is a Partial Order, Not a Set

· 10 min read
Tian Pan
Software Engineer

A "create then notify" sequence works in dev. A "notify then create" sequence emits a webhook for an entity that doesn't exist yet, the consumer 404s, and your team spends a week debugging what looks like a flaky integration test. The flake isn't flaky. It's deterministic given a hidden ordering invariant your tool set has and your planner doesn't know about.

This is the shape of most tool-call-ordering bugs in production agents: a tool set that secretly composes as a partial order — some operations must happen before others, others can run in any order — being treated by the planner as an unordered set of capabilities. The model picks an order that worked yesterday. A prompt edit, a model upgrade, or even a different temperature sample picks a different order tomorrow. Both look reasonable to anyone reading the trace. Only one is correct.

The team that doesn't declare the order is shipping a bug surface that the model's prompt sensitivity will eventually find.

Agents as Cron Jobs: When Scheduled Triggers Beat Conversational Loops

· 10 min read
Tian Pan
Software Engineer

Most "agents" in production today are background jobs wearing a chat interface. They do not need a user typing into them. They need a trigger, a state file, and a way to resume after the inevitable timeout. The conversational loop — request, tool call, request, tool call, indefinitely — is a demo affordance that quietly became the default execution model, and it is the wrong model for the majority of agentic work that ships.

The decision is not philosophical. It shows up on the bill, in the on-call pager, and in the percentage of runs that finish at all. A conversational loop holds a model session open across many turns, accumulates context, and dies if any link in the chain fails. A scheduled trigger fires at a deterministic boundary, runs to completion or to a checkpoint, and writes its state somewhere durable before exiting. One is a phone call. The other is a job queue. Treating the two as interchangeable is how a $200/month feature becomes a $40,000/month feature without anyone changing the prompt.

The Agent Degraded-Mode Spec Is the Document You Didn't Write

· 11 min read
Tian Pan
Software Engineer

When the search index goes stale, the vendor API throttles you, the database read replica falls behind, or a downstream microservice starts returning 503s, your agent has to decide what to do. In most production agent systems, that decision was never made. It was inherited — silently — from whatever the engineer who wrote the tool wrapper happened to type at 4 PM on a Tuesday in week three of the project.

The result is what your customers eventually write for you: a Reddit thread, a support transcript, a quote in a press article. "The assistant told me my balance was $0 when my account was actually fine — turns out their lookup service was down." That paragraph is the degraded-mode spec your team didn't write. It is now public, it is now the customer's, and it is the version your engineering org will spend the next quarter responding to.

Agent Disaster Recovery: When Working Memory Dies With the Region

· 12 min read
Tian Pan
Software Engineer

The DR runbook your team rehearses every quarter was written for a stack you no longer fully run. It says: promote the replica, repoint DNS, drain the queue. It assumes state lives in databases, queues, and object storage — places the SRE org has owned, named, and tested for a decade. Then last quarter you shipped an agent. Working memory now lives in the inference provider's session cache, scratchpad files on a worker's local disk, in-flight tool results that haven't been written back, and a partial plan-and-act trace that exists only in the prompt history of one model call. None of that is on the asset register. None of it is in the runbook.

When the region drops, the agent doesn't fail cleanly. It half-completes. The user sees a workflow that started but the failover region cannot resume, the customer's invoice gets sent twice or not at all because the idempotency key lived on the dead worker, and the on-call engineer reads a Slack thread that begins "the orchestrator is up, but..." and ends six hours later with a credit-card chargeback queue.

This is the gap nobody named: agentic features have a state model the existing DR plan doesn't describe. The team that hasn't written that state surface down is one regional outage away from learning what their runbook's silence costs.

The Demo Was a Single Seed: Why Your AI Rollout Is a Variance Problem, Not a Polish Problem

· 11 min read
Tian Pan
Software Engineer

The exec demo went perfectly. The model answered the curated question, the agent completed the workflow, the screen recording is saved on the company drive, and the launch date is now in the calendar. Six weeks later the rollout craters and the post-mortem narrative writes itself: the model needed more polish, the prompt needed more iteration, the team underestimated the work between prototype and production.

That narrative is wrong, and it's expensive, because it sends the team back to do more of the work that already failed. The demo wasn't an under-polished version of production. It was a single sample from a distribution the team never measured. The wow moment was one realization out of thousands the model would generate against the same input, and the team shipped the best one as if it were the typical one. The gap between demo and prod isn't quality drift. It's variance the team hadn't yet seen.

This reframing matters because the fix for a variance problem looks nothing like the fix for a polish problem. Polish says "iterate the prompt, tune the model, hire a better PM." Variance says "you don't know what you have until you sample it n times across the input distribution." The two diagnoses produce different roadmaps, different budgets, and different incident patterns. The teams that ship reliably in 2026 know which problem they have.