6 posts tagged with "agent-reliability"

The Demo Worked Because You Were Watching: Session Length Is the Eval Dimension Your Suite Forgot

May 22, 2026 · 10 min read

Software Engineer

The reliability number in your launch deck came from sessions that looked nothing like the ones your users actually run. The demo was five turns: open, ask, observe a tidy answer, refine once, conclude on a high note. The session your power user ran yesterday was thirty-one turns long, included two tool failures the agent papered over with optimism, and ended when the user gave up and opened a support ticket. Both sessions came out of the same model. The first one shipped a press release. The second one was filed under "edge case."

Session length is a dimension of evaluation, and demo culture systematically underweights it. We measure per-turn accuracy because per-turn accuracy is what fits on a slide, and then we are surprised when per-session success falls off a cliff that we never put on any chart. The cliff is not random and it is not a tail event — it is the predictable consequence of compounding error, attention drift, and committed assumptions that the model will not revisit. The question every team should be asking is not "how good is the model" but "how good is the model at turn twenty-eight, given everything we said at turns one through twenty-seven."

Stale Tool Descriptions Are Your Agent's Biggest Silent Failure

May 4, 2026 · 9 min read

Tian Pan

Software Engineer

You ship a tool that lets your agent fetch user profiles. The description reads: "Retrieves user information by user ID." Six weeks later, the backend team renames user_id to customer_uuid and adds a required tenant_id field. Nobody updates the tool schema. Your agent keeps calling the old signature, gets back a 400, interprets the empty result as "no user found," and helpfully creates a duplicate record.

No error in the logs. No alert fired. The agent was confident the whole time.

This is the tool documentation problem: schema drift that turns stale descriptions into silent failure vectors. It is probably the most underappreciated reliability hazard in production AI systems today, and it gets worse the longer your agent lives.

MTBF Is Dead When Your Agent Self-Heals

May 2, 2026 · 10 min read

Tian Pan

Software Engineer

A team I talked to last quarter had every dashboard green. Tool error rate flat at 0.3%. End-to-end success at 98%. SLO budget barely touched. They were also burning four times their projected token spend, and nobody could explain it. When they finally instrumented retry depth per trace, the picture inverted: the median successful request was making 2.7 tool calls instead of the 1.0 the architecture diagram promised. The agent was not failing. It was failing and recovering, over and over, inside the same span, and the success rate metric had no way to tell them.

This is the part of agentic reliability that the old reliability vocabulary cannot reach. MTBF — mean time between failures — assumes failures are punctuated, observable events you can count between. You measure the gap, you compute the mean, you alert when the gap shrinks. It worked for hard drives, networks, deterministic services. It does not work for systems that retry, reroute, fall back, and recover silently inside a single user-visible operation.

Production AI Incident Response: When Your Agent Goes Wrong at 3am

April 10, 2026 · 11 min read

Tian Pan

Software Engineer

A multi-agent cost-tracking system at a fintech startup ran undetected for eleven days before anyone noticed. The cause: Agent A asked Agent B for clarification. Agent B asked Agent A for help interpreting the response. Neither had logic to break the loop. The $127 weekly bill became $47,000 before a human looked at the invoice.

No errors were thrown. No alarms fired. Latency was normal. The system was running exactly as designed—just running forever.

This is what AI incidents actually look like. They're not stack traces and 500 errors. They're silent behavioral failures, runaway loops, and plausible wrong answers delivered at production scale with full confidence. Your existing incident runbook almost certainly doesn't cover any of them.

The Retry Storm Problem in Agentic Systems: Why Every Failed Tool Call Burns Your Token Budget

April 10, 2026 · 10 min read

Tian Pan

Software Engineer

Every backend engineer knows that retries are essential. Every distributed systems engineer knows that retries are dangerous. When you put an LLM agent in charge of retrying tool calls, you get both problems at once — plus a new one: every retry burns tokens. A single flaky API endpoint can turn a $0.01 agent task into a $2 meltdown in under a minute.

The retry storm problem isn't new. Distributed systems have dealt with thundering herds and cascading failures for decades. But agentic systems amplify the problem in ways that microservice patterns don't fully address, because the retry logic lives inside a probabilistic reasoning engine that doesn't understand backpressure.

The Stale World Model Problem in Long-Running Agents

April 10, 2026 · 10 min read

Tian Pan

Software Engineer

An AI agent reads a file at turn 3, reasons about its contents through turns 4 through 30, and then — at turn 31 — writes a modified version back to disk. The file was edited by another process at turn 17. The agent overwrites the newer version with a stale one, silently. No exception is raised. No alert fires. From the outside, the agent completed its task successfully.

This is the stale world model problem, and it's one of the most under-discussed failure modes in production agentic systems. Unlike context window overflows or tool call failures — which surface as errors — world model staleness produces agents that look operational while making decisions on outdated information. The failures are quiet, often irreversible, and they compound over the length of a task.

About Tian Pan