252 posts tagged with "reliability"

The Agent Capability Cliff: Why Your Model Upgrade Made the Easy 95% Perfect and the Hard 5% Your Worst Quarter

April 23, 2026 · 11 min read

Software Engineer

You shipped the new model. Aggregate eval pass rate went from 91% to 96%. Product declared it a win in the all-hands. Six weeks later, the reliability team is having their worst quarter on record — not because there are more incidents, but because every single incident is now the kind that takes three engineers and two days to resolve.

This is the agent capability cliff, and it is one of the most counterintuitive failure modes in production AI. Model upgrades do not raise all tasks uniformly. They concentrate their gains on the bulk of your traffic — the easy and medium cases where the previous model was already correct most of the time — while the long tail of genuinely hard inputs sees only marginal improvement. Your failure surface narrows, but every remaining failure is a capability-frontier case that the previous model also missed and that no cheap prompt engineering will fix.

The cliff is not a flaw in the new model. It is a mismatch between how we measure model improvement (average pass rate on a mixed-difficulty eval set) and what actually lands in on-call rotations (the residual set of the hardest traffic, now unpadded by the easier failures that used to dominate the signal).

Agent Idempotency Is an Orchestration Contract, Not a Tool Property

April 23, 2026 · 10 min read

Tian Pan

Software Engineer

The support ticket arrives at 9:41 a.m.: "I was charged three times." The trace looks clean. One user message, one planner turn, three calls to charge_card — each with a distinct tool-use ID, each returning 200 OK, each writing a different Stripe charge. The tool has an idempotency key. The backend has a dedup table. The payment processor honors Idempotency-Key. Every layer is idempotent. The customer still paid three times.

This is the shape of the bug that will land on your desk if you build agents long enough. It is not a bug in any tool. It is a bug in the contract between the agent loop and the tools, and that contract almost always lives only in a senior engineer's head.

Silent Success: When Your Agent Says Done and Nothing Actually Happened

April 23, 2026 · 10 min read

Tian Pan

Software Engineer

The most dangerous line in an agent transcript is the confident one. "I've updated the record." "The invite is sent." "Permissions are applied." Every one of those sentences is a claim, not a fact, and when the tool call behind it rate-limited, timed out, or returned a 500 that the summarization step over-compressed into something reassuring, the claim is all you have. Your telemetry logs the turn as successful because success is whatever the model typed at the top of its final message. The downstream write never committed. Nobody notices for three weeks.

This is the failure class that separates agents from every system that came before them. A traditional service fails with a status code. A traditional batch job fails with a stack trace. An agent fails by continuing to talk. It absorbs the error into its running narrative, rounds it off to make the story coherent, and hands you a paragraph that reads like completion. The user reads the paragraph. Your observability platform indexes the paragraph. The record in the database does not change.

Your AI Product Needs an SRE Before It Needs Another Model

April 23, 2026 · 9 min read

Tian Pan

Software Engineer

The sharpest pattern I see in struggling AI teams is the gap between how sophisticated their model stack is and how primitive their operations are. A team will run three frontier models in production behind custom routing logic, a RAG pipeline with eight retrieval stages, and an agent that calls twenty tools. They will also have no on-call rotation, no SLOs, no runbooks, and a #incidents Slack channel where prompts are hotfixed live by whoever happens to be awake. The product is operating on 2026 model infrastructure and 2012 operational infrastructure, and every week the gap costs them another outage.

The instinct when this hurts is to reach for the model lever. Quality dipped? Try the new release. Latency spiked? Switch providers. Hallucinations in production? Add another guardrail prompt. None of this fixes the underlying problem, which is that nobody owns the system's reliability as a discipline. What these teams actually need — usually before they need another applied scientist — is their first SRE.

The Cascade Router Reliability Trap: When Cost Optimization Quietly Wrecks Your p95

April 23, 2026 · 10 min read

Tian Pan

Software Engineer

The cost dashboard is a beautiful green. Spend per request is down 62% since the cascade router shipped. The CFO is happy. The platform team is celebrating. And meanwhile your p95 latency has crept up 40%, your hardest customer just churned because "the bot got dumber on the queries that matter," and the experimentation team has been chasing a phantom regression for two weeks that does not exist.

This is the cascade router reliability trap. It is the quiet failure mode of every "try the cheap model first, escalate if it doesn't work" architecture, and it is one of the most under-discussed second-order effects in production LLM systems. The cost wins are real, measurable, and easy to attribute. The reliability losses are diffuse, statistical, and almost impossible to attribute back to the router that caused them. So the cost wins get celebrated, the reliability losses get blamed on "the model getting worse," and the team optimizes itself into a hole.

The Deadlock Your Agent Can't See: Circular Tool Dependencies in Generated Plans

April 23, 2026 · 11 min read

Tian Pan

Software Engineer

A planner agent emits seven steps. Each looks reasonable. The orchestrator dispatches them, the first three return values, the fourth waits on the fifth, the fifth waits on the seventh, and the seventh — buried three lines deep in the planner's prose — quietly waits on the fourth. Nothing is locked. No EDEADLK ever fires. The agent burns 40,000 tokens reasoning about why the fourth step "is taking longer than expected" and ultimately gives up with a soft, plausible apology to the user.

This is the deadlock your agent can't see. It is not the textbook deadlock from operating systems class — there are no mutexes, no resource graphs the kernel can introspect, no holders or waiters anyone in your stack would recognize. The dependencies live inside English sentences that the planner produced, the cycles form in latent semantics rather than in any data structure, and the failure mode looks indistinguishable from "the model is thinking hard." Classic deadlock detection is useless here, but the cost is identical: the workflow halts, tokens evaporate, and your trace tells you nothing.

Your Clock-in-Prompt Is a Correctness Boundary, Not a Log Field

April 23, 2026 · 10 min read

Tian Pan

Software Engineer

A scheduling agent booked a customer's onboarding call for Tuesday instead of Wednesday. The investigation took two days. The prompt was fine. The model was fine. The calendar tool was fine. The bug was that the system prompt carried a current_time field stamped an hour earlier, when the request routed through a cached prefix built just before midnight UTC. By the time the agent parsed "tomorrow at 10 AM" and called the booking tool, "tomorrow" referred to a day that was already "today" for the user in Tokyo.

The agent had no way to notice. It had nothing to notice with. LLMs do not have clocks. They have whatever string you handed them in the prompt, and they treat that string as authoritative the same way they treat the user's question as authoritative — which is to say, completely, without skepticism, without a second source to cross-check against.

Most teams know this in the abstract and still treat the timestamp they inject like a log field: something nice to have, rendered into the system prompt for context, nobody's explicit responsibility, nobody's correctness boundary. That framing is wrong. The timestamp is a correctness boundary. Every agent behavior that depends on "now" — scheduling, expiration, retry windows, "recently," "tomorrow," "in five minutes," freshness checks on retrieved documents — runs on whatever your time plumbing produced, and inherits every bug that plumbing has.

"Done!" Is Not a Return Code: Why Agent Completion Needs a Structured Signal

April 23, 2026 · 10 min read

Tian Pan

Software Engineer

An agent ends its turn with "All done — let me know if you want any changes!" and your orchestrator has to decide whether to mark the ticket resolved, kick off the next handoff, or retry. That sentence is not a return code. It is a polite closing line trained to sound reassuring at the end of a chat, and every line of automation downstream of it inherits the ambiguity. The teams that treat this as a parsing problem write regexes that catch \b(done|complete|finished)\b and call it a day. The teams whose agents run in production eventually learn that completion is an event, not a mood.

The failure mode is bimodal and boring. Either the agent announces done when it isn't — premature termination — and the orchestrator happily advances the workflow on a half-finished artifact. Or the agent is actually done, but phrases it in a way that doesn't match the detector ("I went ahead and landed the change, though the test for the edge case is still flaky"), and the orchestrator spins up a retry that re-does the work, duplicates the side effect, and sometimes contradicts the successful first pass. Both modes degrade silently. Neither shows up in a dashboard until someone reads a trace and notices that the agent said "I think that covers it" and the billing system treated that as a commit.

The fix is not smarter parsing. It is giving the agent a structured way to terminate — a done-tool with an enumerated status, a reason code, and a handle your pipeline can route on — and changing the orchestrator to wait for that event instead of listening to the chat stream.

Durable Agents: Why Async Queues Break for Long-Running AI Workflows

April 23, 2026 · 11 min read

Tian Pan

Software Engineer

An agent that works 95% of the time per step is not a 95% reliable agent. Chain twenty steps together and the end-to-end completion rate drops to 36%. This is the arithmetic most teams discover only after their agent hits production, and it is the reason so many "working" prototypes stall the moment real traffic arrives. The fix is not better prompts or bigger models. It is a boring piece of distributed systems infrastructure most AI teams try to avoid until the third outage forces their hand.

The infrastructure is durable execution — the discipline of making a multi-step workflow survive crashes, restarts, and partial failures without losing its place. It is not a new idea. Temporal, Restate, DBOS, Inngest, and Azure Durable Task have been selling it for years. What is new in 2026 is that every serious agent framework has quietly admitted durable execution is table stakes: LangGraph now ships with a PostgresSaver checkpointer, the OpenAI Agents SDK exposes a resume primitive, Anthropic's Managed Agents runs on an internal durable substrate. If your agent architecture still rests on a Celery queue and optimism, you are solving in 2026 a problem the rest of the industry stopped pretending to ignore in 2024.

This post is about the architectural seam between a stateless LLM and the stateful workflow engine that has to wrap it. The seam is where reliability lives, and it is where most teams are currently writing bugs.

The Hallucinated Success Problem: When Your Agent Says Done and Means Nothing

April 23, 2026 · 9 min read

Tian Pan

Software Engineer

The most dangerous failure in agent systems is not the loud one. It is the agent that confidently declares "Task complete" and returns a polished summary of work it never did. The file was never written. The webhook never fired. The database row is still the way it was an hour ago. But the trace is green, the completion counter ticks up, and the dashboard tells leadership the new feature is working.

This is the hallucinated success problem, and it is the single hardest bug class to catch in production because it evades every cheap signal you have. The agent did not crash. It did not time out. It did not return an error. It narrated a plausible, coherent, and completely fabricated account of a successful execution. Your observability stack was built to catch noisy failures. Silent success looks identical to real success until a user notices the output is wrong.

Human-in-the-Loop Is a Queue, and Queues Have Dynamics

April 23, 2026 · 11 min read

Tian Pan

Software Engineer

Teams add human approval to an AI workflow the same way they add if (isDangerous) requireHumanApproval() to a codebase: as a binary switch, checked once at design time, then forgotten. The metric on the architecture diagram is a green checkmark next to "human oversight." The metric that actually matters — how long the human took, whether they read anything, whether the item was still relevant by the time they clicked approve — rarely has a dashboard.

Treat the human approver as a binary switch and you have built a queue without knowing it. And queues have dynamics: backlog that grows faster than you staff, staleness that makes yesterday's decision meaningless, fatigue that turns review into rubber-stamping, and priority inversion that parks the one decision that mattered behind three hundred that didn't. None of this is visible in the architecture diagram. All of it shows up in the incident retro.

Multi-Model Reliability Is Not 2x: The Non-Linear Cost of a Second LLM Provider

April 23, 2026 · 13 min read

Tian Pan

Software Engineer

The naive calculation goes like this. Our primary provider has 99.3% uptime. Add a second provider with similar independence, and simultaneous failure drops to roughly 0.005%. Multiply cost by two, divide risk by two hundred. Engineering leadership signs off on the 2x budget and the oncall rotation stops paging on provider outages. The spreadsheet says this is the best reliability investment on the roadmap.

Six months later the spreadsheet is wrong. The eval suite takes 3x as long to run, prompt changes need two PRs, the weekly regression report has two columns that disagree with each other, and nobody can remember which provider the staging fallback is currently routing to. The 2x budget is closer to 4–5x once the team tallies the human hours spent keeping both paths calibrated. The second provider is still technically serving traffic, but half the features have been quietly pinned to one side because keeping both in sync stopped being worth it.

This is the multi-model cost trap. The reliability math is correct; the operational math is the part teams get wrong. What follows is the cost decomposition of going multi-provider, the single-provider-with-degraded-mode option most teams should try first, and the narrow set of criteria that actually justify the nonlinear complexity.

About Tian Pan