Skip to main content

238 posts tagged with "reliability"

View all tags

Retry Amplification: How a 2% Tool Error Rate Becomes a 20% Agent Failure

· 13 min read
Tian Pan
Software Engineer

The spreadsheet on the oncall doc said the search tool had a 2% error rate. The incident review said the agent platform had a 20% failure rate during the three-hour window. Nobody disagreed with either number. The search team was not at fault. The platform team did not ship a bug. The gap between the two numbers is the whole story, and it is a story about arithmetic, not engineering incompetence.

Retry logic is one of the most borrowed and least adapted patterns in agent systems. Teams copy tenacity decorators from their REST client, stack them at the SDK, the gateway, and the agent loop, and ship. Each layer is individually reasonable. The composition is a siege weapon pointed at the flakiest dependency in the fleet, and it fires hardest at the exact moment that dependency needs the load to drop.

This post is about how that math works, why agent loops amplify it harder than request-response systems, and the retry discipline that keeps transient blips from becoming correlated outages with your own logo on them.

Tool Hallucination Rate: The Probe Suite Your Agent Team Isn't Running

· 9 min read
Tian Pan
Software Engineer

Ask an agent team what their tool-call success rate is and you will get an answer. Ask them what their tool-hallucination rate is and the room goes quiet. Most teams do not track it, and the ones who do usually only count the catastrophic version — a function name that does not exist in the catalog — while the quieter, more expensive variants travel through production unmetered.

A hallucinated tool call is not only when the model invents delete_orphaned_users(older_than="30d") and your dispatcher throws ToolNotFoundError. That is the easy case. The harder case is when the fabricated call shadows into an adjacent real tool through fuzzy matching, or when the tool name is correct but the agent invents an argument your schema happily accepts because you marked it optional. Both of those pass your "did a tool call succeed" dashboard. Neither is what the user asked for.

Tool Manifest Lies: When Your Agent Trusts a Schema Your Backend No Longer Honors

· 10 min read
Tian Pan
Software Engineer

The most dangerous bug in a production agent isn't the one that throws. It's the one where a tool description says returns user_id and the backend quietly started returning account_id two sprints ago, and the model is still happily inventing user_id in downstream reasoning — because the manifest said so, and the few-shot history reinforced it, and nothing in the loop ever fetched ground truth.

This is manifest drift: the slow, silent divergence between what your tool descriptions claim and what your endpoints actually do. It rarely produces stack traces. It produces bad decisions with clean audit trails — the worst class of bug in agent systems.

Agent Fleet Concurrency: Coordinating Dozens of Agents Without Deadlock or the Thundering Herd

· 12 min read
Tian Pan
Software Engineer

Eleven agents started at the same second. Three died before the first tool call returned. That 27% fatality rate was not a model problem, a prompt problem, or a tool problem. It was a scheduling problem — the same kind of problem an operating system solves when fifty processes wake up at once and fight over a single CPU. The difference is that the OS has forty years of accumulated wisdom and the agent runtime has about two.

Anyone who has wired up more than a handful of concurrent LLM workers has seen some version of this. You kick off a scheduled job at 02:00, thirty agents spin up, they all hit the same provider within 200 ms of each other, and most of them fail with a mix of 429s, 502s, and connection resets. The survivors get half the rate budget they were promised because the provider's fair-share logic has already started throttling your API key. By 02:05 the surviving agents finish and your dashboard shows a completion rate that would embarrass a first-year CS student writing their first producer-consumer. Your on-call rotation debates whether to add retries, add a queue, or just run fewer of them.

None of those are the right answer by themselves. The right answer is that a fleet of agents is a small distributed system and needs to be designed like one.

The CAP Theorem for AI Agents: Choosing Consistency or Availability When Your LLM Is the Bottleneck

· 10 min read
Tian Pan
Software Engineer

Every engineer who has shipped a distributed system has stared at the CAP theorem and made a choice: when the network partitions, do you keep serving stale data (availability) or do you refuse to serve until you have a consistent answer (consistency)? The theorem tells you that you cannot have both.

AI agents face an identical tradeoff, and almost nobody is making it explicitly. When your LLM call times out, when a tool returns garbage, when a downstream API is unavailable — what does your agent do? In most production systems, the answer is: it guesses. Quietly. Confidently. And often wrong.

The failure mode isn't dramatic. There's no exception in the logs. The agent "answered" the user. You only find out two weeks later when someone asks why the system booked the wrong flight, extracted the wrong entity, or confidently told a customer a price that no longer exists.

The Compound Accuracy Problem: Why Your 95% Accurate Agent Fails 40% of the Time

· 11 min read
Tian Pan
Software Engineer

Your agent performs beautifully in isolation. You've benchmarked each step. You've measured per-step accuracy at 95%. You demo the system to stakeholders and it looks great. Then you ship it, and users report that it fails almost half the time.

The failure isn't a bug in any individual component. It's the math.

Contract Testing for AI Pipelines: Schema-Validated Handoffs Between AI Components

· 10 min read
Tian Pan
Software Engineer

Most AI pipeline failures aren't model failures. The model fires fine. The output looks like JSON. The downstream stage breaks silently because a field was renamed, a type changed, or a nested object gained a new required property that the next stage doesn't know how to handle. The pipeline runs to completion and reports success. Somewhere in the data warehouse, numbers are wrong.

This is the contract testing problem for AI pipelines, and it's one of the most underaddressed reliability risks in production AI systems. According to recent infrastructure benchmarks, the average enterprise AI system experiences nearly five pipeline failures per month—each taking over twelve hours to resolve. The dominant cause isn't poor model quality. It's data quality and schema contract violations: 64% of AI risk lives at the schema layer.

Graceful Tool-Call Failure: The Error Contract Your Agent UI Is Missing

· 11 min read
Tian Pan
Software Engineer

Every agent demo you've ever seen ended with a clean result. The tool call returned exactly the data the model expected, the response arrived in well under two seconds, and the final answer was crisp and correct. That's the demo. Production is something else.

In production, tools time out. APIs return 403s because a service account was rotated last Tuesday. Third-party enrichment endpoints return a 200 with a body that says {"status": "degraded", "data": null}. OAuth tokens expire at 3 AM on a Saturday. These aren't edge cases — they're the normal operating conditions of any agent that talks to the real world. The failure modes are predictable. The problem is that most agent architectures treat them as afterthoughts, and most agent UIs have no vocabulary for communicating them to users at all.

Defining Escalation Criteria That Actually Work in Human-AI Teams

· 10 min read
Tian Pan
Software Engineer

Most AI teams can tell you their containment rate — the percentage of interactions the AI handled without routing to a human. Far fewer can tell you whether that number is the right one.

Escalation criteria are the single most important design document in an AI-augmented team, and most teams don't have one. They have a threshold buried in a YAML file and an implicit assumption that the AI knows when it's stuck. That assumption is wrong in both directions: too high a threshold and humans spend their days redoing AI work; too low and users absorb AI errors without recourse. Both failures are invisible until they compound.

Idempotency Is Not Optional in LLM Pipelines

· 10 min read
Tian Pan
Software Engineer

A batch inference job finishes after six minutes. The network hiccups on the response. Your retry logic kicks in. Two minutes later the job finishes again — and your invoice doubles. This is the tamest version of what happens when you apply traditional idempotency thinking to LLM pipelines without adapting it to stochastic systems.

Most production teams discover the problem the hard way: a retry that was supposed to recover from a transient error triggers a second payment, sends a duplicate email, or writes a contradictory record to the database. The fix is not better retry logic — it is a different mental model of what idempotency even means when your core component is probabilistic.

The Last-Mile Reliability Problem: Why 95% Accuracy Often Means 0% Usable

· 9 min read
Tian Pan
Software Engineer

You built an AI feature. You ran evals. You saw 95% accuracy on your test set. You shipped it. Six weeks later, users hate it and your team is quietly planning to roll it back.

This is the last-mile reliability problem, and it is probably the most common cause of AI feature failure in production today. It has nothing to do with your model being bad and everything to do with how average accuracy metrics hide the distribution of failures — and how certain failures are disproportionately expensive regardless of their statistical frequency.

Your Model Is Most Wrong When It Sounds Most Sure: LLM Calibration in Production

· 9 min read
Tian Pan
Software Engineer

There's a failure mode that bites teams repeatedly after they've solved the easier problems — hallucination filtering, output parsing, retry logic. The model is giving confident-sounding wrong answers, the confidence-based routing logic is trusting those wrong answers, and the system is silently misbehaving in production while the eval dashboard looks fine.

This isn't a prompting problem. It's a calibration problem, and it's baked into how modern LLMs are trained.