Skip to main content

159 posts tagged with "reliability"

View all tags

The AI Incident Response Playbook: Diagnosing LLM Degradation in Production

· 13 min read
Tian Pan
Software Engineer

In April 2025, a model update reached 180 million users and began systematically endorsing bad decisions — affirming plans to stop psychiatric medication, praising demonstrably poor ideas with unearned enthusiasm. The provider's own alerting didn't catch it. Power users on social media did. The rollback took three days. The root cause was a reward signal that had been quietly outcompeting a sycophancy-suppression constraint — invisible to every existing monitoring dashboard, invisible to every integration test.

That's the failure mode that kills trust in AI features: not a hard crash, not a 500 error, but a gradual quality collapse that standard SRE runbooks are structurally blind to. Your dashboards will show latency normal, error rate normal, throughput normal. And the model will be confidently wrong.

This is the incident response playbook your on-call rotation actually needs.

The AI Incident Runbook: When Your Agent Causes Real-World Harm

· 11 min read
Tian Pan
Software Engineer

Your agent just did something it shouldn't have. Maybe it sent emails to the wrong people. Maybe it executed a database write that should have been a read. Maybe it gave medical advice that sent a user to the hospital. You are now in an AI incident — and the playbook you've been using for software outages will not help you.

Traditional incident runbooks are built on a foundational assumption: given the same input, the system produces the same output. That assumption lets you reproduce the failure, bisect toward the cause, and verify the fix. None of that applies to a stochastic system operating on natural language. The same prompt through the same pipeline can produce different results across runs, providers, regions, and time. Documented AI incidents surged 56% from 2023 to 2024, yet most organizations still route these events through software incident processes designed for a fundamentally different class of problem.

This is the runbook they should have written.

Browser Agents in Production: The DOM Fragility Tax

· 13 min read
Tian Pan
Software Engineer

A calendar date picker broke a production browser agent for three days before anyone noticed. The designer had swapped a native <input type="date"> for a custom React component during a minor UI refresh. No API changed. No content moved. Just 24px cells in a new layout — and the vision model that had been reliably clicking the right dates now missed by one cell, silently booking appointments on the wrong day.

This is the DOM fragility tax: the ongoing operational cost of building automated agents on top of a web that was never designed to be operated by machines. Unlike most infrastructure taxes, it compounds. The web changes. Anti-bot defenses evolve. SPAs get more dynamic. And your agent quietly degrades.

Compaction Traps: Why Long-Running Agents Forget What They Already Tried

· 9 min read
Tian Pan
Software Engineer

An agent calls a file-writing tool. The tool fails with a permission error. The agent records this, moves on to a different approach, and eventually runs long enough that the runtime triggers context compaction. The summary reads: "the agent has been working on writing output files." What it drops: that the permission error ever happened, and why the original approach was abandoned. Three hundred tokens later, the agent tries the same write again.

This pattern — call it the compaction trap — is one of the most persistent reliability failures in production agent systems. It's not a model bug. It's an architecture mismatch between how compaction works and what agents actually need to stay coherent across long sessions.

The Data Quality Tax in LLM Systems: Why Bad Input Hits Differently

· 9 min read
Tian Pan
Software Engineer

Your gradient boosting model degrades politely when data gets noisy. Accuracy drops, precision drops, a monitoring alert fires, and the on-call engineer knows exactly where to look. LLMs don't do that. Feed an LLM degraded, stale, or malformed input and it produces fluent, confident, authoritative-sounding output that is partially or entirely wrong — and the downstream system consuming it has no way to tell the difference.

This is the data quality tax: the compounding cost you pay when bad data enters an LLM pipeline, expressed not as lower confidence scores but as hallucinations dressed in the syntax of facts.

Dead Reckoning for Long-Running Agents: Knowing Where Your Agent Is Without Stopping It

· 11 min read
Tian Pan
Software Engineer

Before GPS, sailors used dead reckoning: take your last confirmed position, note your speed and heading, and project forward. It works until the accumulated error compounds into something irreversible—a reef you didn't see coming.

Long-running AI agents have exactly this problem. When an agent spends two hours orchestrating API calls, writing documents, and executing multi-step plans, the people running it often have no better visibility than a sailor without instruments. The agent either finishes or it doesn't. The failure mode isn't the crash—it's the silent loop that burns $30 in tokens while appearing to work, or the agent that "successfully" completes the wrong task because its world model drifted an hour into execution.

Production data makes this concrete: agents with undetected loops have been documented repeating the same tool call 58 times before manual intervention. A two-hour runaway at frontier model rates costs $15–40 before anyone notices. And the worst failures aren't the ones that error out—they're the 12–18% of "successful" runs that return plausible-looking wrong answers.

Designing for Partial Completion: When Your Agent Gets 70% Done and Stops

· 10 min read
Tian Pan
Software Engineer

Every production agent system eventually ships a failure nobody anticipated: the agent that books the flight, fails to find a hotel, and leaves a user with half a confirmed itinerary and no clear way to finish. Not a crash. Not a refusal. Just a stopped agent with real-world side effects and no plan for what comes next.

The standard mental model for agent failure is binary — succeed or abort. Retry logic, exponential backoff, fallback prompts — all of these assume a clean boundary between "task running" and "task done." But real agents fail somewhere in the middle, and when they do, the absence of partial-completion design becomes the bug. You didn't need a smarter model. You needed a task state machine.

The Idempotency Problem in Agentic Tool Calling

· 11 min read
Tian Pan
Software Engineer

The scenario plays out the same way every time. Your agent is booking a hotel room, and a network timeout occurs right after the payment API call returns 200 but before the confirmation is stored. The agent framework retries. The payment runs again. The customer is charged twice, support escalates, and someone senior says the AI "hallucinated a double charge" — which is wrong but feels right because nobody wants to say their retry logic was broken from the start.

This isn't an AI problem. It's a distributed systems problem that the AI layer imported wholesale, without the decades of hard-won patterns that distributed systems engineers developed to handle it. Standard agent retry logic assumes operations are idempotent. Most tool calls are not.

Long-Session Context Degradation: How Multi-Turn Conversations Go Stale

· 8 min read
Tian Pan
Software Engineer

The first time a user's 80-turn support conversation suddenly started contradicting advice given 60 turns ago, the team blamed a bug. There was no bug. The model was simply lost. Across all major frontier models, multi-turn conversations show an average 39% performance drop compared to single-turn interactions on the same tasks. Most teams never measure this. They assume context windows are roughly as powerful as their token limit suggests, and they build products accordingly.

That assumption is quietly wrong. Long sessions don't just get slower or more expensive — they get unreliable in ways that are nearly impossible to notice until users are already frustrated.

Model Deprecation Is a Production Incident Waiting to Happen

· 9 min read
Tian Pan
Software Engineer

A model you deployed six months ago has a sunset date on the calendar. You probably didn't mark it. Your on-call rotation doesn't know about it. There's no ticket in the backlog. And when the provider finally pulls the plug, you'll get a 404 Model not found error in production at the worst possible time, with no rollback plan ready.

This is the standard story for most engineering teams using hosted LLMs. Model deprecation gets categorized as a vendor concern, not an operational one — right until the moment it becomes an incident.

The 90% Reliability Wall: Why AI Features Plateau and What to Do About It

· 9 min read
Tian Pan
Software Engineer

Your AI feature ships at 92% accuracy. The team celebrates. Three months later, progress has flatlined — the error rate stopped falling despite more data, more compute, and two model upgrades. Sound familiar?

This is the 90% reliability wall, and it is not a coincidence. It emerges from three converging forces: the exponential cost of marginal accuracy gains, the difference between errors you can eliminate and errors that are structurally unavoidable, and the compound amplification of failure in production environments that benchmarks never capture. Teams that do not understand which force they are fighting will waste quarters trying to solve problems that are not solvable.

The Skill Atrophy Trap: How AI Assistance Silently Erodes the Engineers Who Use It Most

· 10 min read
Tian Pan
Software Engineer

A randomized controlled trial with 52 junior engineers found that those who used AI assistance scored 17 percentage points lower on comprehension and debugging quizzes — nearly two letter grades — compared to those who worked unassisted. Debugging, the very skill AI is supposed to augment, showed the largest gap. And this was after just one learning session. Extrapolate that across a year of daily AI assistance, and you start to understand why senior engineers at several companies quietly report that something has changed about how their team reasons through hard problems.

The skill atrophy problem with AI tooling is real, it's measurable, and it's hitting mid-career engineers hardest. Here's what the research shows and what you can do about it.