Skip to main content

146 posts tagged with "reliability"

View all tags

Human-in-the-Loop Is a Queue, and Queues Have Dynamics

· 11 min read
Tian Pan
Software Engineer

Teams add human approval to an AI workflow the same way they add if (isDangerous) requireHumanApproval() to a codebase: as a binary switch, checked once at design time, then forgotten. The metric on the architecture diagram is a green checkmark next to "human oversight." The metric that actually matters — how long the human took, whether they read anything, whether the item was still relevant by the time they clicked approve — rarely has a dashboard.

Treat the human approver as a binary switch and you have built a queue without knowing it. And queues have dynamics: backlog that grows faster than you staff, staleness that makes yesterday's decision meaningless, fatigue that turns review into rubber-stamping, and priority inversion that parks the one decision that mattered behind three hundred that didn't. None of this is visible in the architecture diagram. All of it shows up in the incident retro.

Multi-Model Reliability Is Not 2x: The Non-Linear Cost of a Second LLM Provider

· 13 min read
Tian Pan
Software Engineer

The naive calculation goes like this. Our primary provider has 99.3% uptime. Add a second provider with similar independence, and simultaneous failure drops to roughly 0.005%. Multiply cost by two, divide risk by two hundred. Engineering leadership signs off on the 2x budget and the oncall rotation stops paging on provider outages. The spreadsheet says this is the best reliability investment on the roadmap.

Six months later the spreadsheet is wrong. The eval suite takes 3x as long to run, prompt changes need two PRs, the weekly regression report has two columns that disagree with each other, and nobody can remember which provider the staging fallback is currently routing to. The 2x budget is closer to 4–5x once the team tallies the human hours spent keeping both paths calibrated. The second provider is still technically serving traffic, but half the features have been quietly pinned to one side because keeping both in sync stopped being worth it.

This is the multi-model cost trap. The reliability math is correct; the operational math is the part teams get wrong. What follows is the cost decomposition of going multi-provider, the single-provider-with-degraded-mode option most teams should try first, and the narrow set of criteria that actually justify the nonlinear complexity.

No Results Is Not Absence: Why Agents Treat Retrieval Failure as Proof

· 10 min read
Tian Pan
Software Engineer

The most dangerous sentence in an agent transcript is not a hallucination. It is four calm words: "I could not find it." The agent sounds epistemically humble. It sounds like due diligence. It sounds, to any downstream reader or caller, exactly like a fact. And yet the statement carries no information about whether the thing exists. It only carries information about what happened when a specific tool, invoked with a specific query, consulted a specific index that the agent happened to have access to at that moment.

Between those two readings lies a production incident waiting to happen. A support agent tells a customer "we have no record of your order" because a replication lag delayed the write to the read replica by ninety seconds. A coding agent declares "there are no tests for this module" because it searched a directory that did not contain the test folder. A compliance agent replies "no prior violations on file" because the audit index had not ingested last week's report. In each case the agent's output is grammatically a negation, but epistemically it is a shrug that has been re-typed as a claim.

Your OAuth Tokens Expire Mid-Task: The Silent Failure Mode of Long-Running Agents

· 11 min read
Tian Pan
Software Engineer

The first time a production agent runs for forty minutes and hits a 401 on step 27 of 40, the incident review is almost always the same. Someone in the room asks why the token wasn't refreshed. Someone else points out that the refresh logic exists, but it lives in the HTTP client the agent's tool wrapper was never wired into. A third person notices that even if the refresh had fired, two of the agent's parallel tool calls would have tried to rotate the same refresh token at the same instant and blown up the session anyway. Everyone nods. Then the team spends the next week retrofitting credential lifecycle into an architecture that assumed requests finish in 800 milliseconds.

OAuth was designed for a world where an access token outlives the request that uses it. Long-running agents inverted that assumption. The request — really, a chain of tens or hundreds of tool calls orchestrated across minutes or hours — now outlives the token. The industry spent a decade building libraries, proxies, and refresh flows around the short-request assumption, and almost none of it transplants cleanly to agent loops.

Retry Amplification: How a 2% Tool Error Rate Becomes a 20% Agent Failure

· 13 min read
Tian Pan
Software Engineer

The spreadsheet on the oncall doc said the search tool had a 2% error rate. The incident review said the agent platform had a 20% failure rate during the three-hour window. Nobody disagreed with either number. The search team was not at fault. The platform team did not ship a bug. The gap between the two numbers is the whole story, and it is a story about arithmetic, not engineering incompetence.

Retry logic is one of the most borrowed and least adapted patterns in agent systems. Teams copy tenacity decorators from their REST client, stack them at the SDK, the gateway, and the agent loop, and ship. Each layer is individually reasonable. The composition is a siege weapon pointed at the flakiest dependency in the fleet, and it fires hardest at the exact moment that dependency needs the load to drop.

This post is about how that math works, why agent loops amplify it harder than request-response systems, and the retry discipline that keeps transient blips from becoming correlated outages with your own logo on them.

Tool Hallucination Rate: The Probe Suite Your Agent Team Isn't Running

· 9 min read
Tian Pan
Software Engineer

Ask an agent team what their tool-call success rate is and you will get an answer. Ask them what their tool-hallucination rate is and the room goes quiet. Most teams do not track it, and the ones who do usually only count the catastrophic version — a function name that does not exist in the catalog — while the quieter, more expensive variants travel through production unmetered.

A hallucinated tool call is not only when the model invents delete_orphaned_users(older_than="30d") and your dispatcher throws ToolNotFoundError. That is the easy case. The harder case is when the fabricated call shadows into an adjacent real tool through fuzzy matching, or when the tool name is correct but the agent invents an argument your schema happily accepts because you marked it optional. Both of those pass your "did a tool call succeed" dashboard. Neither is what the user asked for.

Tool Manifest Lies: When Your Agent Trusts a Schema Your Backend No Longer Honors

· 10 min read
Tian Pan
Software Engineer

The most dangerous bug in a production agent isn't the one that throws. It's the one where a tool description says returns user_id and the backend quietly started returning account_id two sprints ago, and the model is still happily inventing user_id in downstream reasoning — because the manifest said so, and the few-shot history reinforced it, and nothing in the loop ever fetched ground truth.

This is manifest drift: the slow, silent divergence between what your tool descriptions claim and what your endpoints actually do. It rarely produces stack traces. It produces bad decisions with clean audit trails — the worst class of bug in agent systems.

Agent Fleet Concurrency: Coordinating Dozens of Agents Without Deadlock or the Thundering Herd

· 12 min read
Tian Pan
Software Engineer

Eleven agents started at the same second. Three died before the first tool call returned. That 27% fatality rate was not a model problem, a prompt problem, or a tool problem. It was a scheduling problem — the same kind of problem an operating system solves when fifty processes wake up at once and fight over a single CPU. The difference is that the OS has forty years of accumulated wisdom and the agent runtime has about two.

Anyone who has wired up more than a handful of concurrent LLM workers has seen some version of this. You kick off a scheduled job at 02:00, thirty agents spin up, they all hit the same provider within 200 ms of each other, and most of them fail with a mix of 429s, 502s, and connection resets. The survivors get half the rate budget they were promised because the provider's fair-share logic has already started throttling your API key. By 02:05 the surviving agents finish and your dashboard shows a completion rate that would embarrass a first-year CS student writing their first producer-consumer. Your on-call rotation debates whether to add retries, add a queue, or just run fewer of them.

None of those are the right answer by themselves. The right answer is that a fleet of agents is a small distributed system and needs to be designed like one.

The CAP Theorem for AI Agents: Choosing Consistency or Availability When Your LLM Is the Bottleneck

· 10 min read
Tian Pan
Software Engineer

Every engineer who has shipped a distributed system has stared at the CAP theorem and made a choice: when the network partitions, do you keep serving stale data (availability) or do you refuse to serve until you have a consistent answer (consistency)? The theorem tells you that you cannot have both.

AI agents face an identical tradeoff, and almost nobody is making it explicitly. When your LLM call times out, when a tool returns garbage, when a downstream API is unavailable — what does your agent do? In most production systems, the answer is: it guesses. Quietly. Confidently. And often wrong.

The failure mode isn't dramatic. There's no exception in the logs. The agent "answered" the user. You only find out two weeks later when someone asks why the system booked the wrong flight, extracted the wrong entity, or confidently told a customer a price that no longer exists.

The Compound Accuracy Problem: Why Your 95% Accurate Agent Fails 40% of the Time

· 11 min read
Tian Pan
Software Engineer

Your agent performs beautifully in isolation. You've benchmarked each step. You've measured per-step accuracy at 95%. You demo the system to stakeholders and it looks great. Then you ship it, and users report that it fails almost half the time.

The failure isn't a bug in any individual component. It's the math.

Contract Testing for AI Pipelines: Schema-Validated Handoffs Between AI Components

· 10 min read
Tian Pan
Software Engineer

Most AI pipeline failures aren't model failures. The model fires fine. The output looks like JSON. The downstream stage breaks silently because a field was renamed, a type changed, or a nested object gained a new required property that the next stage doesn't know how to handle. The pipeline runs to completion and reports success. Somewhere in the data warehouse, numbers are wrong.

This is the contract testing problem for AI pipelines, and it's one of the most underaddressed reliability risks in production AI systems. According to recent infrastructure benchmarks, the average enterprise AI system experiences nearly five pipeline failures per month—each taking over twelve hours to resolve. The dominant cause isn't poor model quality. It's data quality and schema contract violations: 64% of AI risk lives at the schema layer.

Graceful Tool-Call Failure: The Error Contract Your Agent UI Is Missing

· 11 min read
Tian Pan
Software Engineer

Every agent demo you've ever seen ended with a clean result. The tool call returned exactly the data the model expected, the response arrived in well under two seconds, and the final answer was crisp and correct. That's the demo. Production is something else.

In production, tools time out. APIs return 403s because a service account was rotated last Tuesday. Third-party enrichment endpoints return a 200 with a body that says {"status": "degraded", "data": null}. OAuth tokens expire at 3 AM on a Saturday. These aren't edge cases — they're the normal operating conditions of any agent that talks to the real world. The failure modes are predictable. The problem is that most agent architectures treat them as afterthoughts, and most agent UIs have no vocabulary for communicating them to users at all.