Skip to main content

238 posts tagged with "reliability"

View all tags

The Demo-to-Production Cliff: Why a 90%-Accurate Agent Ships at 0%

· 9 min read
Tian Pan
Software Engineer

There is a specific kind of meeting that happens about six weeks after an impressive agent demo. The prototype booked the trip, refactored the module, reconciled the invoices — live, on the first try, in front of stakeholders. Everyone agreed it was ready. Then someone pulled the production numbers, and the agent that "worked" was generating a support ticket every forty completed tasks, a refund every few hundred, and a quiet trail of half-finished states nobody could explain. The project did not get killed. It got stuck. It is still stuck.

This is the demo-to-production cliff, and it is the single most reliable way for an agent project to fail. The cliff is not caused by a bad model or a sloppy team. It is caused by a measurement mistake: treating a 90% success rate as 90% of the way to shipping. It is not. A 90%-accurate agent is a triumphant demo and, for most real workflows, an unshippable product. The MIT NANDA report that made headlines in 2025 — 95% of enterprise GenAI pilots delivering no measurable P&L impact — is this cliff, counted at scale.

The Eval Set Is a Lagging Indicator: Your Green Dashboard Only Knows Last Quarter's Failures

· 8 min read
Tian Pan
Software Engineer

Every mature AI team builds its eval suite the same way, and almost nobody says the quiet part out loud. A failure shows up in production. Someone writes a postmortem. An engineer distills the incident into a test case, adds it to the eval suite, and the dashboard goes green again. Repeat this loop for a year and you have a few hundred cases, a satisfying pass rate, and a deeply comforting number to put on a slide.

Here is the quiet part: that suite is a museum. Every exhibit is a failure class the team has already survived. A 98% pass rate certifies your system against the past — against the specific ways it has already broken — and says almost nothing about the novel failure mode that a model migration, a prompt edit, or a shift in user behavior is about to introduce. The eval set is a lagging indicator wearing the costume of a leading one.

The Fallback Model You Never Load-Tested

· 8 min read
Tian Pan
Software Engineer

Every resilient LLM design has a line in the config that names a secondary model. It is there because someone, during a design review, asked the right question — "what happens when the primary is down?" — and someone else answered it with a fallback: key. Everyone nodded. The architecture diagram got a second box with a dotted arrow. The compliance doc got a sentence about graceful degradation.

And then nobody touched it again.

The fallback model is the most confidently asserted, least exercised component in most production AI systems. It is named, documented, and diagrammed — and on the day it actually carries traffic, it is also the day it has its first encounter with a real request. You did not build a safety net. You built a second model with an unknown breaking strain, and you will discover that strain at the worst possible moment.

Your Fallback Path Is the Only Untested Code in Production

· 9 min read
Tian Pan
Software Engineer

Every serious AI system ships with a fallback. When the primary model is rate-limited, route to a cheaper one. When the provider returns 5xx, serve a cached answer. When confidence drops below a threshold, fall back to a hand-written heuristic. The architecture diagram has a clean little branch labeled "degraded mode," and everyone feels safer for it.

Here is the uncomfortable part. That branch is the only code in your system that almost never runs. The primary path executes millions of times a day and gets debugged, profiled, and battle-tested by sheer traffic volume. The fallback executes approximately never — until the day it executes for everyone at once, under load, during an incident, while three engineers watch a dashboard turn red.

A fallback you do not exercise is not redundancy. It is a second, unmonitored system whose debut is statistically guaranteed to happen at the worst possible moment.

The Happy Path Is the Only Path Your Agent Eval Ever Tested

· 10 min read
Tian Pan
Software Engineer

Look at where most agent eval sets come from. Someone builds the agent, demos it to the team, the demo works, and the demo script becomes the eval suite. The cases that pass review are the cases someone already watched pass. The eval set is, almost by construction, a recording of the happy path — the one tool sequence that worked the day the screenshot was taken.

So when the dashboard says the agent scores 94%, what it actually says is: it passes the cases we imagined. It says nothing about the case where the search API returns a 429 in the middle of a multi-step plan, where the user contradicts a constraint they stated two turns ago, or where retrieval comes back empty and the agent has to decide between guessing and admitting it doesn't know. Those cases aren't failing your eval. They were never in it.

This is golden-path bias, and it is the default shape of an agent eval suite unless you fight it deliberately. The fix is not more cases. It is different cases — chosen by failure mode, harvested from production, and stress-tested with deliberate faults.

The Idempotency Key Your Agent Never Sent

· 11 min read
Tian Pan
Software Engineer

A customer once got refunded three times for a single return. Not because the model hallucinated a policy, not because a human fat-fingered a form — because the refund tool timed out twice, the agent retried both times, and every retry carried a fresh request with no way for the payment processor to know it had seen this work before. Three clean HTTP 200s. Three real movements of money. The agent did exactly what it was told: when a call fails, try again.

The bug was not in the model. The bug was in a header that was never sent.

Retrying is the single most natural thing an agent does. A tool call returns an error, or worse, returns nothing at all, and the loop's instinct — encoded in the framework, the prompt, or the model's own training — is to try the action again. That instinct is correct for reads and catastrophic for writes. The difference between a resilient agent and one that double-charges customers is not intelligence. It is whether every state-changing tool call carries an idempotency key, and whether the system on the other end actually honors it.

Prompt Caching's Hidden Tax: When a Cache Hit Serves the Wrong User's Context

· 11 min read
Tian Pan
Software Engineer

Prompt caching is sold as a free win. Cache the long shared prefix — your system prompt, your tool definitions, your retrieved context — pay full price only for the short tail that changes, and watch the bill drop. The numbers are real: a cache read costs roughly a tenth of a fresh input token, so a workload with a heavy stable prefix can see its input cost fall by 80% or more. Teams adopt it for that reason, tune it for that reason, and report on it with a single metric: cache hit rate, trending up.

What that framing hides is that the boundary you just drew — the line between the cached prefix and the uncached tail — is not a billing knob. It is a correctness boundary. Everything above the cache breakpoint is content the system has decided is interchangeable across requests. If you draw that line to maximize hit rate, you are letting a finance metric decide which facts in your prompt are allowed to be shared between users, between tenants, and across time. That is an isolation decision, and it deserves to be made on purpose.

The failure mode is quiet because it never throws. A cache hit that serves one user's context shaped by another user's profile returns a perfectly well-formed response. A cache hit that serves personalization that was true when the prefix was warmed and false by the time it is reused returns a confident, coherent, wrong answer. Nothing in your latency graph or your error rate moves. The only signal is a hit rate that looks great — because the key is too coarse.

The Rate Limit That Became a Product Decision

· 10 min read
Tian Pan
Software Engineer

A rate limit used to be an infrastructure detail. You hit a 429, you retried with backoff, you queued the overflow, and nobody outside the on-call channel ever knew it happened. The user saw a response that was a few hundred milliseconds slower than usual. That was the whole story.

That story no longer holds for agentic features. When an agent hits a provider's tokens-per-minute ceiling halfway through a multi-step plan, the failure does not stay inside the infrastructure. It surfaces as a half-finished answer, a tool loop that stalls before the last call, or a user watching a spinner that will never resolve. The quota stopped being a backend capacity number and became a constraint that product has to design around — the same way product designs around a checkout flow or an empty state.

The Semantic Cache That Confidently Returns the Wrong Answer

· 9 min read
Tian Pan
Software Engineer

Two support users ask your agent almost the same question within a minute of each other. The first asks, "What's our refund window for EU orders?" The second asks, "What's our refund window for US orders?" The embeddings of those two sentences sit a hair's breadth apart — same length, same structure, one two-letter token of difference. Your semantic cache, tuned to a similarity threshold that looked perfectly reasonable in the demo, scores them as a match. The second user gets the first user's answer. The EU's 14-day cooling-off period is presented to a US customer as fact, in fluent prose, with no asterisk.

Nobody gets paged for this. The cache returned a 200. Latency was great. The cost dashboard shows a hit, which is the outcome everyone wanted. The only signal that anything went wrong is a customer acting on policy that does not apply to them — and that signal arrives days later, through a refund dispute, not through your monitoring.

This is the failure mode that makes semantic caching different from every cache you have built before. An exact-match cache can be stale, but it is never wrong — the key either matches or it doesn't. A semantic cache trades that guarantee away on purpose. It is designed to return answers for keys it has never seen, and the price of that latency win is a correctness risk that most teams never put a number on.

The Tool That Worked Until Two Agents Called It At Once

· 9 min read
Tian Pan
Software Engineer

A tool passes its tests. You called it from one agent, watched it read a record, transform it, write it back, and return a clean result. It did exactly that, every time, for weeks. Then you scaled the agent fleet from one worker to twelve, and a customer reported that their subscription got upgraded twice in the same minute. The tool did not change. The number of things calling it did.

This is the failure mode that single-agent testing cannot catch, because single-agent testing never produces the condition that triggers it. One caller is, by construction, a serial workload. Every concurrency assumption your tool quietly relies on — that nobody else is mid-write when it reads, that a counter it increments is its own, that the draft it is editing will still be there when it saves — holds trivially when there is exactly one caller. The tool is not correct. It is untested. Those are different things, and the difference stays invisible until a second agent shows up.

When Two Agents Share a Tool: Concurrency Bugs in Multi-Agent Systems

· 9 min read
Tian Pan
Software Engineer

The moment you typed "spin up another agent to handle that in parallel," you became a distributed systems engineer. You probably didn't notice. The framework made it a one-line change, the demo worked, and the latency dropped. But under the hood you just introduced two processes that read and write shared state with no coordination — and every race condition, lost update, and dirty read that has haunted databases for fifty years is now sitting in your agent stack, waiting.

The reason this bites so hard is that the failure doesn't look like a concurrency bug. It looks like one agent being wrong. The output is syntactically valid, the pipeline is green, no exception is thrown — and yet a customer got charged twice, or a file is missing half its expected content, or an agent confidently acted on a number that another agent had already overwritten. You go debug "the dumb agent" and find nothing wrong with its prompt, because the prompt was never the problem.

Halted Is Not a Status: Why Agents Need a Typed Terminal-Reason Protocol

· 10 min read
Tian Pan
Software Engineer

Open the dashboard for an agent fleet and you will see a clean number: completion rate, 94%. Below it, a list of runs, each tagged with one of two states — running, or not running. The 6% that are "not running" all look identical. Some of them finished the task perfectly. Some of them hit a step limit two actions short of done. Some of them caught a tool error and gave up. Some of them decided the task was impossible — correctly. And some of them simply lost the thread and stopped emitting tokens.

Your monitoring cannot tell these apart. It knows the process is no longer running. It does not know why, and "why" is the only thing that matters when you are deciding whether to page someone.