Skip to main content

137 posts tagged with "reliability"

View all tags

Replan, Don't Retry: Why Most Agent Errors Aren't Transient

· 10 min read
Tian Pan
Software Engineer

A calendar-write returns 409 Conflict. The framework's default error handler kicks in: backoff 200ms, retry. Same conflict. Backoff 400ms, retry. Same conflict. Backoff 800ms, retry. By the time the agent gives up and tells the user "I couldn't book the meeting," it has burned three seconds of latency budget proving something the very first response already told it: the slot is taken. The world has not changed. It will not change in 800 milliseconds. Retrying was never going to work, because nothing about this error was transient.

This is the most common error-handling bug in agent systems, and it is hiding in plain sight inside almost every framework that ships today. The retry-with-exponential-backoff pattern was imported wholesale from stateless HTTP clients — where it is exactly correct — into stateful planning loops where it is actively wrong. The right default for a tool error in an agent is not retry. It is replan.

Structured Concurrency for Parallel Tool Fanout: Who Owns Partial Failure?

· 11 min read
Tian Pan
Software Engineer

The moment your agent fans out five parallel tool calls — search across three indexes, query two databases, hit one external API — you have crossed an invisible line. You are no longer writing prompt-and-response code. You are writing a concurrent program. Most agent frameworks pretend you are not, and the bill arrives at 2 AM.

The pretense is comfortable. The planner emits a list of tool calls, the runtime fires them off, the runtime collects whatever comes back, the planner consumes the aggregate. From a thousand feet up it looks like a fan-out / fan-in pipeline, and most teams treat it that way until production teaches them otherwise. The problem is that twenty years of concurrent-programming research — partial-failure semantics, structured cancellation, backpressure, deterministic error attribution — already solved the failure modes you are about to rediscover. Your agent framework, by default, did not import any of it.

Your Provider's 99.9% SLA Is Measured at the Wrong Boundary for Your Agent

· 11 min read
Tian Pan
Software Engineer

A model provider publishes a 99.9% availability SLA. The procurement team frames it as "three nines, four hours of downtime per year, acceptable for a non-tier-zero workload." Six months later the agent feature ships and the on-call dashboard shows a user-perceived task-success rate around 98% — a number nobody wrote into a contract, nobody can find on the provider's status page, and nobody owns. The provider is meeting their SLA. The product is missing its SLO. Both are true at the same time, and the gap is not a bug — it is arithmetic.

The arithmetic is the part most teams skip. A provider's 99.9% is measured against a synchronous-request workload — one user, one prompt, one response, one billing event. An agent does not generate that workload. A single user-perceived task fans out into 8 to 20 inference calls, retries on transient errors, hedges on slow ones, and aggregates partial outputs. Each of those calls is an independent draw against the provider's failure distribution, and the task fails if any essential call fails. The boundary the SLA covers and the boundary the user feels are not the same boundary.

The 95% Reliability Illusion: Why Your 10-Step Agent Fails 40% of the Time

· 12 min read
Tian Pan
Software Engineer

There is a moment in almost every agent project review that ends the conversation. Someone draws a small chart: end-to-end task success rate on the y-axis, number of tool-using steps on the x-axis. The line slopes down hard. The room goes quiet because everyone in it had been arguing about prompts, models, and retrieval strategies — and the chart is saying that none of those debates matter as much as the simple fact that the chain has too many links.

The math is one of the oldest results in reliability engineering, ported into a domain that pretends it is new. If every step in a pipeline succeeds independently with probability p, then n steps in series succeed with probability p to the n. Plug in numbers that sound healthy on a status report: 95% per-step reliability, ten steps, end-to-end success rate of 60%. Twenty steps gets you 36%. Thirty steps gets you 21%. The agent that "works 95% of the time" is the same agent that fails on a third of real user requests, because real user requests are not single steps.

Persona Drift: When Your Agent Forgets Who It's Supposed to Be

· 11 min read
Tian Pan
Software Engineer

The system prompt says "you are a financial analyst — be conservative, never give specific buy/sell advice, always disclose uncertainty." For the first twenty turns, the agent behaves like a financial analyst. By turn fifty, it is recommending specific stocks, mirroring the user's casual tone, and hedging less than it did in turn three. Nobody changed the system prompt. Nobody injected anything malicious. The persona simply eroded under the weight of the conversation, the way a riverbank does when nothing crosses the threshold of "attack" but the water never stops moving.

This is persona drift, and it is the regression your eval suite is not catching. Capability evals measure whether the model can do the task. Identity evals — whether the model is still doing the task the way the system prompt said to do it — barely exist outside of research papers. The result is a class of production failures that look correct turn-by-turn and look wrong only when you read the transcript end to end.

The Agent Capability Cliff: Why Your Model Upgrade Made the Easy 95% Perfect and the Hard 5% Your Worst Quarter

· 11 min read
Tian Pan
Software Engineer

You shipped the new model. Aggregate eval pass rate went from 91% to 96%. Product declared it a win in the all-hands. Six weeks later, the reliability team is having their worst quarter on record — not because there are more incidents, but because every single incident is now the kind that takes three engineers and two days to resolve.

This is the agent capability cliff, and it is one of the most counterintuitive failure modes in production AI. Model upgrades do not raise all tasks uniformly. They concentrate their gains on the bulk of your traffic — the easy and medium cases where the previous model was already correct most of the time — while the long tail of genuinely hard inputs sees only marginal improvement. Your failure surface narrows, but every remaining failure is a capability-frontier case that the previous model also missed and that no cheap prompt engineering will fix.

The cliff is not a flaw in the new model. It is a mismatch between how we measure model improvement (average pass rate on a mixed-difficulty eval set) and what actually lands in on-call rotations (the residual set of the hardest traffic, now unpadded by the easier failures that used to dominate the signal).

Agent Idempotency Is an Orchestration Contract, Not a Tool Property

· 10 min read
Tian Pan
Software Engineer

The support ticket arrives at 9:41 a.m.: "I was charged three times." The trace looks clean. One user message, one planner turn, three calls to charge_card — each with a distinct tool-use ID, each returning 200 OK, each writing a different Stripe charge. The tool has an idempotency key. The backend has a dedup table. The payment processor honors Idempotency-Key. Every layer is idempotent. The customer still paid three times.

This is the shape of the bug that will land on your desk if you build agents long enough. It is not a bug in any tool. It is a bug in the contract between the agent loop and the tools, and that contract almost always lives only in a senior engineer's head.

Silent Success: When Your Agent Says Done and Nothing Actually Happened

· 9 min read
Tian Pan
Software Engineer

The most dangerous line in an agent transcript is the confident one. "I've updated the record." "The invite is sent." "Permissions are applied." Every one of those sentences is a claim, not a fact, and when the tool call behind it rate-limited, timed out, or returned a 500 that the summarization step over-compressed into something reassuring, the claim is all you have. Your telemetry logs the turn as successful because success is whatever the model typed at the top of its final message. The downstream write never committed. Nobody notices for three weeks.

This is the failure class that separates agents from every system that came before them. A traditional service fails with a status code. A traditional batch job fails with a stack trace. An agent fails by continuing to talk. It absorbs the error into its running narrative, rounds it off to make the story coherent, and hands you a paragraph that reads like completion. The user reads the paragraph. Your observability platform indexes the paragraph. The record in the database does not change.

Your AI Product Needs an SRE Before It Needs Another Model

· 9 min read
Tian Pan
Software Engineer

The sharpest pattern I see in struggling AI teams is the gap between how sophisticated their model stack is and how primitive their operations are. A team will run three frontier models in production behind custom routing logic, a RAG pipeline with eight retrieval stages, and an agent that calls twenty tools. They will also have no on-call rotation, no SLOs, no runbooks, and a #incidents Slack channel where prompts are hotfixed live by whoever happens to be awake. The product is operating on 2026 model infrastructure and 2012 operational infrastructure, and every week the gap costs them another outage.

The instinct when this hurts is to reach for the model lever. Quality dipped? Try the new release. Latency spiked? Switch providers. Hallucinations in production? Add another guardrail prompt. None of this fixes the underlying problem, which is that nobody owns the system's reliability as a discipline. What these teams actually need — usually before they need another applied scientist — is their first SRE.

The Cascade Router Reliability Trap: When Cost Optimization Quietly Wrecks Your p95

· 10 min read
Tian Pan
Software Engineer

The cost dashboard is a beautiful green. Spend per request is down 62% since the cascade router shipped. The CFO is happy. The platform team is celebrating. And meanwhile your p95 latency has crept up 40%, your hardest customer just churned because "the bot got dumber on the queries that matter," and the experimentation team has been chasing a phantom regression for two weeks that does not exist.

This is the cascade router reliability trap. It is the quiet failure mode of every "try the cheap model first, escalate if it doesn't work" architecture, and it is one of the most under-discussed second-order effects in production LLM systems. The cost wins are real, measurable, and easy to attribute. The reliability losses are diffuse, statistical, and almost impossible to attribute back to the router that caused them. So the cost wins get celebrated, the reliability losses get blamed on "the model getting worse," and the team optimizes itself into a hole.

The Deadlock Your Agent Can't See: Circular Tool Dependencies in Generated Plans

· 11 min read
Tian Pan
Software Engineer

A planner agent emits seven steps. Each looks reasonable. The orchestrator dispatches them, the first three return values, the fourth waits on the fifth, the fifth waits on the seventh, and the seventh — buried three lines deep in the planner's prose — quietly waits on the fourth. Nothing is locked. No EDEADLK ever fires. The agent burns 40,000 tokens reasoning about why the fourth step "is taking longer than expected" and ultimately gives up with a soft, plausible apology to the user.

This is the deadlock your agent can't see. It is not the textbook deadlock from operating systems class — there are no mutexes, no resource graphs the kernel can introspect, no holders or waiters anyone in your stack would recognize. The dependencies live inside English sentences that the planner produced, the cycles form in latent semantics rather than in any data structure, and the failure mode looks indistinguishable from "the model is thinking hard." Classic deadlock detection is useless here, but the cost is identical: the workflow halts, tokens evaporate, and your trace tells you nothing.

Your Clock-in-Prompt Is a Correctness Boundary, Not a Log Field

· 10 min read
Tian Pan
Software Engineer

A scheduling agent booked a customer's onboarding call for Tuesday instead of Wednesday. The investigation took two days. The prompt was fine. The model was fine. The calendar tool was fine. The bug was that the system prompt carried a current_time field stamped an hour earlier, when the request routed through a cached prefix built just before midnight UTC. By the time the agent parsed "tomorrow at 10 AM" and called the booking tool, "tomorrow" referred to a day that was already "today" for the user in Tokyo.

The agent had no way to notice. It had nothing to notice with. LLMs do not have clocks. They have whatever string you handed them in the prompt, and they treat that string as authoritative the same way they treat the user's question as authoritative — which is to say, completely, without skepticism, without a second source to cross-check against.

Most teams know this in the abstract and still treat the timestamp they inject like a log field: something nice to have, rendered into the system prompt for context, nobody's explicit responsibility, nobody's correctness boundary. That framing is wrong. The timestamp is a correctness boundary. Every agent behavior that depends on "now" — scheduling, expiration, retry windows, "recently," "tomorrow," "in five minutes," freshness checks on retrieved documents — runs on whatever your time plumbing produced, and inherits every bug that plumbing has.