252 posts tagged with "reliability"

The Approval Queue Nobody Drains

May 17, 2026 · 10 min read

Software Engineer

You did the responsible thing. You looked at your agent, identified the actions that could cause real damage — issuing a refund, deleting a record, sending an external email, deploying a config change — and you routed them to a human for approval. Risk-tiered gating. Textbook. The review board signed off.

Then a customer escalation came in three weeks later: an agent task had been "in progress" since the previous Tuesday. Not failed. Not errored. Just sitting in a human approval queue that, it turned out, nobody was actually watching. The agent had done its job, parked the dangerous action behind a gate, and waited. The gate had no owner. The task aged silently in a place where no dashboard pointed and no alarm fired.

The Degradation Signals Your Agent Never Receives

May 17, 2026 · 9 min read

Tian Pan

Software Engineer

When a downstream API starts to wobble, a human operator finds out a dozen ways before anything actually breaks. The status page flips to yellow. A changelog email lands in the inbox. A warning banner appears in the provider's dashboard. The on-call channel lights up with a 429 someone spotted in the logs. A teammate posts "anyone else seeing slow writes?" None of these are responses to a request. They are the ambient operational signal that surrounds the API, and a human absorbs it almost passively.

An agent calling the same API receives exactly one thing: the response to the request it just made. Status code, headers, body. That is the entire channel. It has no inbox, no dashboard, no Slack, no peripheral vision. It cannot notice that the last ten calls each took twice as long as the ten before. It cannot read the status page, because nobody handed it the URL and it has no standing instruction to look. When the dependency degrades, the agent is the last party in the system to find out — and it usually finds out by failing.

This asymmetry is not a model capability problem. A smarter model does not fix it. The agent is blind to operational signals because the plumbing never delivers them, and most agent stacks ship without anyone noticing the plumbing is missing.

The Demo-to-Production Cliff: Why a 90%-Accurate Agent Ships at 0%

May 17, 2026 · 9 min read

Tian Pan

Software Engineer

There is a specific kind of meeting that happens about six weeks after an impressive agent demo. The prototype booked the trip, refactored the module, reconciled the invoices — live, on the first try, in front of stakeholders. Everyone agreed it was ready. Then someone pulled the production numbers, and the agent that "worked" was generating a support ticket every forty completed tasks, a refund every few hundred, and a quiet trail of half-finished states nobody could explain. The project did not get killed. It got stuck. It is still stuck.

This is the demo-to-production cliff, and it is the single most reliable way for an agent project to fail. The cliff is not caused by a bad model or a sloppy team. It is caused by a measurement mistake: treating a 90% success rate as 90% of the way to shipping. It is not. A 90%-accurate agent is a triumphant demo and, for most real workflows, an unshippable product. The MIT NANDA report that made headlines in 2025 — 95% of enterprise GenAI pilots delivering no measurable P&L impact — is this cliff, counted at scale.

The Eval Set Is a Lagging Indicator: Your Green Dashboard Only Knows Last Quarter's Failures

May 17, 2026 · 8 min read

Tian Pan

Software Engineer

Every mature AI team builds its eval suite the same way, and almost nobody says the quiet part out loud. A failure shows up in production. Someone writes a postmortem. An engineer distills the incident into a test case, adds it to the eval suite, and the dashboard goes green again. Repeat this loop for a year and you have a few hundred cases, a satisfying pass rate, and a deeply comforting number to put on a slide.

Here is the quiet part: that suite is a museum. Every exhibit is a failure class the team has already survived. A 98% pass rate certifies your system against the past — against the specific ways it has already broken — and says almost nothing about the novel failure mode that a model migration, a prompt edit, or a shift in user behavior is about to introduce. The eval set is a lagging indicator wearing the costume of a leading one.

The Fallback Model You Never Load-Tested

May 17, 2026 · 8 min read

Tian Pan

Software Engineer

Every resilient LLM design has a line in the config that names a secondary model. It is there because someone, during a design review, asked the right question — "what happens when the primary is down?" — and someone else answered it with a fallback: key. Everyone nodded. The architecture diagram got a second box with a dotted arrow. The compliance doc got a sentence about graceful degradation.

And then nobody touched it again.

The fallback model is the most confidently asserted, least exercised component in most production AI systems. It is named, documented, and diagrammed — and on the day it actually carries traffic, it is also the day it has its first encounter with a real request. You did not build a safety net. You built a second model with an unknown breaking strain, and you will discover that strain at the worst possible moment.

Your Fallback Path Is the Only Untested Code in Production

May 17, 2026 · 9 min read

Tian Pan

Software Engineer

Every serious AI system ships with a fallback. When the primary model is rate-limited, route to a cheaper one. When the provider returns 5xx, serve a cached answer. When confidence drops below a threshold, fall back to a hand-written heuristic. The architecture diagram has a clean little branch labeled "degraded mode," and everyone feels safer for it.

Here is the uncomfortable part. That branch is the only code in your system that almost never runs. The primary path executes millions of times a day and gets debugged, profiled, and battle-tested by sheer traffic volume. The fallback executes approximately never — until the day it executes for everyone at once, under load, during an incident, while three engineers watch a dashboard turn red.

A fallback you do not exercise is not redundancy. It is a second, unmonitored system whose debut is statistically guaranteed to happen at the worst possible moment.

The Happy Path Is the Only Path Your Agent Eval Ever Tested

May 17, 2026 · 10 min read

Tian Pan

Software Engineer

Look at where most agent eval sets come from. Someone builds the agent, demos it to the team, the demo works, and the demo script becomes the eval suite. The cases that pass review are the cases someone already watched pass. The eval set is, almost by construction, a recording of the happy path — the one tool sequence that worked the day the screenshot was taken.

So when the dashboard says the agent scores 94%, what it actually says is: it passes the cases we imagined. It says nothing about the case where the search API returns a 429 in the middle of a multi-step plan, where the user contradicts a constraint they stated two turns ago, or where retrieval comes back empty and the agent has to decide between guessing and admitting it doesn't know. Those cases aren't failing your eval. They were never in it.

This is golden-path bias, and it is the default shape of an agent eval suite unless you fight it deliberately. The fix is not more cases. It is different cases — chosen by failure mode, harvested from production, and stress-tested with deliberate faults.

The Idempotency Key Your Agent Never Sent

May 17, 2026 · 11 min read

Tian Pan

Software Engineer

A customer once got refunded three times for a single return. Not because the model hallucinated a policy, not because a human fat-fingered a form — because the refund tool timed out twice, the agent retried both times, and every retry carried a fresh request with no way for the payment processor to know it had seen this work before. Three clean HTTP 200s. Three real movements of money. The agent did exactly what it was told: when a call fails, try again.

The bug was not in the model. The bug was in a header that was never sent.

Retrying is the single most natural thing an agent does. A tool call returns an error, or worse, returns nothing at all, and the loop's instinct — encoded in the framework, the prompt, or the model's own training — is to try the action again. That instinct is correct for reads and catastrophic for writes. The difference between a resilient agent and one that double-charges customers is not intelligence. It is whether every state-changing tool call carries an idempotency key, and whether the system on the other end actually honors it.

Prompt Caching's Hidden Tax: When a Cache Hit Serves the Wrong User's Context

May 17, 2026 · 11 min read

Tian Pan

Software Engineer

Prompt caching is sold as a free win. Cache the long shared prefix — your system prompt, your tool definitions, your retrieved context — pay full price only for the short tail that changes, and watch the bill drop. The numbers are real: a cache read costs roughly a tenth of a fresh input token, so a workload with a heavy stable prefix can see its input cost fall by 80% or more. Teams adopt it for that reason, tune it for that reason, and report on it with a single metric: cache hit rate, trending up.

What that framing hides is that the boundary you just drew — the line between the cached prefix and the uncached tail — is not a billing knob. It is a correctness boundary. Everything above the cache breakpoint is content the system has decided is interchangeable across requests. If you draw that line to maximize hit rate, you are letting a finance metric decide which facts in your prompt are allowed to be shared between users, between tenants, and across time. That is an isolation decision, and it deserves to be made on purpose.

The failure mode is quiet because it never throws. A cache hit that serves one user's context shaped by another user's profile returns a perfectly well-formed response. A cache hit that serves personalization that was true when the prefix was warmed and false by the time it is reused returns a confident, coherent, wrong answer. Nothing in your latency graph or your error rate moves. The only signal is a hit rate that looks great — because the key is too coarse.

The Rate Limit That Became a Product Decision

May 17, 2026 · 10 min read

Tian Pan

Software Engineer

A rate limit used to be an infrastructure detail. You hit a 429, you retried with backoff, you queued the overflow, and nobody outside the on-call channel ever knew it happened. The user saw a response that was a few hundred milliseconds slower than usual. That was the whole story.

That story no longer holds for agentic features. When an agent hits a provider's tokens-per-minute ceiling halfway through a multi-step plan, the failure does not stay inside the infrastructure. It surfaces as a half-finished answer, a tool loop that stalls before the last call, or a user watching a spinner that will never resolve. The quota stopped being a backend capacity number and became a constraint that product has to design around — the same way product designs around a checkout flow or an empty state.

The Semantic Cache That Confidently Returns the Wrong Answer

May 17, 2026 · 9 min read

Tian Pan

Software Engineer

Two support users ask your agent almost the same question within a minute of each other. The first asks, "What's our refund window for EU orders?" The second asks, "What's our refund window for US orders?" The embeddings of those two sentences sit a hair's breadth apart — same length, same structure, one two-letter token of difference. Your semantic cache, tuned to a similarity threshold that looked perfectly reasonable in the demo, scores them as a match. The second user gets the first user's answer. The EU's 14-day cooling-off period is presented to a US customer as fact, in fluent prose, with no asterisk.

Nobody gets paged for this. The cache returned a 200. Latency was great. The cost dashboard shows a hit, which is the outcome everyone wanted. The only signal that anything went wrong is a customer acting on policy that does not apply to them — and that signal arrives days later, through a refund dispute, not through your monitoring.

This is the failure mode that makes semantic caching different from every cache you have built before. An exact-match cache can be stale, but it is never wrong — the key either matches or it doesn't. A semantic cache trades that guarantee away on purpose. It is designed to return answers for keys it has never seen, and the price of that latency win is a correctness risk that most teams never put a number on.

The Tool That Worked Until Two Agents Called It At Once

May 17, 2026 · 9 min read

Tian Pan

Software Engineer

A tool passes its tests. You called it from one agent, watched it read a record, transform it, write it back, and return a clean result. It did exactly that, every time, for weeks. Then you scaled the agent fleet from one worker to twelve, and a customer reported that their subscription got upgraded twice in the same minute. The tool did not change. The number of things calling it did.

This is the failure mode that single-agent testing cannot catch, because single-agent testing never produces the condition that triggers it. One caller is, by construction, a serial workload. Every concurrency assumption your tool quietly relies on — that nobody else is mid-write when it reads, that a counter it increments is its own, that the draft it is editing will still be there when it saves — holds trivially when there is exactly one caller. The tool is not correct. It is untested. Those are different things, and the difference stays invisible until a second agent shows up.

About Tian Pan