Skip to main content

238 posts tagged with "reliability"

View all tags

The Agent That Wouldn't Stop: Scope Creep as a Runtime Failure Mode

· 9 min read
Tian Pan
Software Engineer

You asked the agent to fix a flaky test. At minute three, the test passes. At minute four, the agent is reading neighbouring files. At minute nine, it has "improved" a helper that the test never touched, renamed an unrelated parameter for clarity, and started a refactor of the fixture builder. The diff that lands is twelve files and four hundred lines. The original bug is fixed. So is some other code that wasn't broken.

This is not a model getting confused. This is a model doing exactly what its instructions left room for. The task said "fix the bug." It did not say "stop after the bug is fixed." Most agent loops have a defined start and a defined success criterion, and a very fuzzy answer to the third question: when are you done? In a chat session, "done" is whatever the user accepts. In an autonomous loop, "done" is whatever the stopping condition says, and if you didn't write one, the stopping condition is "the model lost interest." That isn't a failure mode you can debug. It's a failure mode you have to design out.

The OOO Auto-Reply Your Agent Did Not Read

· 8 min read
Tian Pan
Software Engineer

Your support agent pages a human at 2 a.m. The human has been out for a week. The OOO message lives in the same inbox the agent is reading. The agent pings the human anyway. The auto-reply lands. The agent thanks it politely and pings again, because the reply did not contain the resolution code it was waiting on. Twelve cycles in, somebody on a different team notices the unread thread is now sixty messages deep and goes manually wake up the on-call.

The agent did exactly what the prompt told it to do. The prompt told it to escalate to a person. The person was a string, not a role. The string did not know about PTO.

Provider Rate Limits Are a Capacity Plan You Never Wrote

· 9 min read
Tian Pan
Software Engineer

The first time your application hits a 429 from a model provider, something important happens, and almost nobody notices it. Not the error itself — the line of code that runs next. Maybe your HTTP client retries with exponential backoff. Maybe it falls back to a smaller model. Maybe it queues the request, or drops it, or surfaces a spinner that never resolves. Whatever it does, that behavior is now your capacity policy. It decides which users get served and which get degraded when demand exceeds supply.

And you almost certainly didn't write it. It was authored by whoever wrote the SDK wrapper, the retry decorator, or the three-line try/except someone copied from a tutorial. The most consequential decision in your system under load — what to do when you can't do everything — is being made by code nobody reviewed as a capacity decision.

This post is an argument for treating that code as what it actually is: a load-shedding policy and a capacity plan. Not an error handler. The 429 is not the problem. The problem is that you have outsourced the design of your system's behavior under contention to a library default.

Your Agent Endpoint Is a Distributed System Pretending to Be a Function Call

· 9 min read
Tian Pan
Software Engineer

The most dangerous line of code in a modern AI application looks completely innocent:

result = await agent.run(user_query)

It reads like a function call. It has a name, it takes an argument, it returns a value. Your IDE autocompletes it. Your type checker is satisfied. And that single await is hiding a remote, multi-hop, partially-failing distributed system behind the syntax of a local procedure. The gap between what the code looks like and what it actually does is where most production agent incidents live.

Your Agent Has No Concept of Business Hours

· 10 min read
Tian Pan
Software Engineer

A support agent at a mid-size SaaS company resolved a billing dispute correctly. It read the ticket, checked the customer's account, found the duplicate charge, issued the refund, and sent a polite confirmation email. Every step was right. The only problem was the timestamp: 3:14 a.m. in the customer's timezone. The customer woke up to a refund notification, assumed their card had been compromised, and opened a fraud case with their bank before anyone at the company was awake to explain.

Nothing in that workflow was a bug in the conventional sense. The agent didn't hallucinate, didn't pick the wrong account, didn't miscalculate the refund. It just had no idea that 3 a.m. is a bad time to tell someone money moved. The model has read more text about human sleep schedules than any person alive, and it still acted as if the recipient were a server endpoint that is awake whenever you call it.

The Bug You Can't Reproduce Because the Model Picked a Different Token

· 10 min read
Tian Pan
Software Engineer

A user files a bug. The summary your agent generated dropped a critical paragraph, or the JSON came back malformed, or the answer was confidently wrong. You open the ticket, copy the request, and replay it. It works. You replay it again. Still works. You mark the ticket "cannot reproduce" and move on.

The bug is still there. It is still happening to real users. You just closed it because your debugging toolchain assumes that a fixed input produces a fixed output — and the component you are debugging samples from a probability distribution.

The Fault You Never Inject: Feeding Your Agent a Tool That Lies

· 10 min read
Tian Pan
Software Engineer

Open the resilience suite for your agent and look at what it actually tests. You will find timeouts. You will find connection drops, 500s, rate-limit responses, malformed JSON, maybe a tool that hangs for thirty seconds before failing. All of it is fault injection in the classic mold: the tool is broken, and the question is whether your agent degrades gracefully.

Now look for the test where the tool is not broken at all. The one where the tool responds in 80 milliseconds, returns perfectly valid JSON against the schema, and the value inside is simply wrong. A balance that is stale by three days. A customer record with two fields swapped. An order quantity with two digits transposed. An empty result list for a query that should have returned forty rows.

You will not find it. Almost nobody injects that fault. And it is the one fault your agent is least equipped to survive, because every other fault announces itself and this one does not.

The GUI Agent That Clicked the Right Button on the Wrong Screen

· 10 min read
Tian Pan
Software Engineer

A computer-use agent takes a screenshot, reasons about it, decides to click the "Confirm" button at pixel (840, 612), and dispatches the click. By the time the cursor lands, a modal has appeared. The pixel that was "Confirm" three seconds ago is now "Delete." The agent did exactly what it planned. It planned against a screen that no longer exists.

This is not a grounding error. The model correctly identified the button. It is not a reasoning error. The plan was sound. It is a timing error — the most under-instrumented failure class in GUI automation — and your test suite almost certainly does not cover it, because your test environment never moves between the observation and the action.

The uncomfortable measurement: one recent study of desktop agents on real Ubuntu workloads found a mean gap of 6.51 seconds between when an agent observes the screen and when it acts on that observation. Six and a half seconds is an eternity for a UI. Notifications fire, lazy lists finish loading, animations settle, focus shifts. The agent's mental model of the screen has a shelf life, and almost no agent framework treats it that way.

Your Retry Logic Is Teaching the Agent the Wrong Lesson

· 10 min read
Tian Pan
Software Engineer

A tool call fails. Your agent framework retries it three times with exponential backoff. The third attempt goes through. The trace shows a green checkmark. Nobody gets paged, no error counter increments, the user gets their answer. By every dashboard you have, the system worked.

It didn't. The tool failed because the agent passed a malformed argument, and the only reason the third try succeeded is that the agent — sampling differently each time — happened to phrase the call correctly on attempt three. You didn't recover from a transient fault. You ran a slot machine until it paid out, then logged the payout and threw away the two pulls that told you the agent was broken.

This is the quiet way retry logic rots an agent system. Retries were designed for a world where the caller is correct and the network is flaky. Agents invert that assumption: the network is mostly fine, and the caller is the unreliable part. When you point a retry policy built for the first world at the second one, it stops being a recovery mechanism and becomes a way to launder bugs into green checkmarks.

The Agent That Memorized Your Bug: Why a Fix Is a Memory-Invalidation Event

· 9 min read
Tian Pan
Software Engineer

A few months ago, one of your downstream APIs returned a malformed timestamp — seconds where it should have been milliseconds, or a null where the schema promised a string. Your agent hit it, reasoned through the breakage, and worked out a fix: multiply by 1000, or fall back to a default, or retry with a different endpoint. It solved the immediate problem. Then it did something quietly consequential. It wrote the workaround down.

Maybe it saved a note to long-term memory: "The billing API returns timestamps in seconds; convert before use." Maybe the interaction got swept into a fine-tuning dataset, and the workaround became a learned reflex. Either way, the agent now carries a belief about the world. And last week, the API team shipped a fix. The timestamps are correct now. Nobody told the agent.

The Approval Queue Nobody Drains

· 10 min read
Tian Pan
Software Engineer

You did the responsible thing. You looked at your agent, identified the actions that could cause real damage — issuing a refund, deleting a record, sending an external email, deploying a config change — and you routed them to a human for approval. Risk-tiered gating. Textbook. The review board signed off.

Then a customer escalation came in three weeks later: an agent task had been "in progress" since the previous Tuesday. Not failed. Not errored. Just sitting in a human approval queue that, it turned out, nobody was actually watching. The agent had done its job, parked the dangerous action behind a gate, and waited. The gate had no owner. The task aged silently in a place where no dashboard pointed and no alarm fired.

The Degradation Signals Your Agent Never Receives

· 9 min read
Tian Pan
Software Engineer

When a downstream API starts to wobble, a human operator finds out a dozen ways before anything actually breaks. The status page flips to yellow. A changelog email lands in the inbox. A warning banner appears in the provider's dashboard. The on-call channel lights up with a 429 someone spotted in the logs. A teammate posts "anyone else seeing slow writes?" None of these are responses to a request. They are the ambient operational signal that surrounds the API, and a human absorbs it almost passively.

An agent calling the same API receives exactly one thing: the response to the request it just made. Status code, headers, body. That is the entire channel. It has no inbox, no dashboard, no Slack, no peripheral vision. It cannot notice that the last ten calls each took twice as long as the ten before. It cannot read the status page, because nobody handed it the URL and it has no standing instruction to look. When the dependency degrades, the agent is the last party in the system to find out — and it usually finds out by failing.

This asymmetry is not a model capability problem. A smarter model does not fix it. The agent is blind to operational signals because the plumbing never delivers them, and most agent stacks ship without anyone noticing the plumbing is missing.