Skip to main content

238 posts tagged with "reliability"

View all tags

The Integration Test Mirage: Why Mocked Tool Outputs Hide Your Agent's Real Failure Modes

· 11 min read
Tian Pan
Software Engineer

Your agent passes every test. The CI pipeline is green. You ship it.

A week later, a user reports that their bulk-export job silently returned 200 records instead of 14,000. The agent hit the first page of a paginated API, got a clean response, assumed there was nothing more, and moved on. Your mock returned all 200 items in one shot. The real API never told the agent there were 70 more pages.

This is not a model failure. The model reasoned correctly. This is a test infrastructure failure — and it's endemic to how teams build and test agentic systems.

The Overclaiming Trap: When Being Right for the Wrong Reasons Destroys AI Product Trust

· 10 min read
Tian Pan
Software Engineer

Most AI product post-mortems focus on the same story: the model was wrong, users noticed, trust eroded. The fix is obvious — improve accuracy. But there is a more insidious failure mode that post-mortems rarely capture because standard accuracy metrics don't surface it: the model was right, but for the wrong reasons, and the power users who checked the reasoning never came back.

Call it the overclaiming trap. It is the failure mode where correct final answers are backed by fabricated, retrofitted, or structurally unsound reasoning chains. It is more dangerous than ordinary wrongness because it looks like success until your most sophisticated users start quietly leaving.

The CAP Theorem for AI Agents: Why Your Agent Fails Completely When It Should Degrade Gracefully

· 9 min read
Tian Pan
Software Engineer

Your AI agent works perfectly until it doesn't. One tool goes down — maybe the search API is rate-limited, maybe the database is slow, maybe the code execution sandbox times out — and the entire agent collapses. Not a partial answer, not a degraded response. A complete failure. A blank screen or a hallucinated mess.

This is not a bug. It is a design choice, and almost nobody made it deliberately. The agent architectures we are building today implicitly choose "fail completely" because nobody designed the partial-availability path. If you have built distributed systems before, this pattern should feel painfully familiar. It is the CAP theorem, showing up in a new disguise.

Cascading Context Corruption: Why One Wrong Fact Derails Your Entire Agent Run

· 8 min read
Tian Pan
Software Engineer

Your agent completes a 25-step research task. The final report looks polished, citations check out, and the reasoning chain appears coherent. Except the agent hallucinated a company's founding year in step 3, and every subsequent inference — market timing analysis, competitive positioning, growth trajectory — built on that wrong date. The output is confidently, systematically wrong, and nothing in your pipeline caught it.

This is cascading context corruption: a single incorrect intermediate conclusion that propagates through subsequent reasoning steps and tool calls, compounding into system-wide failure. It is the most dangerous failure mode in long-running agents — because it looks like success.

Phantom Tool Calls: When AI Agents Invoke Tools That Don't Exist

· 8 min read
Tian Pan
Software Engineer

Your agent passes every unit test, handles the happy path beautifully, and then one Tuesday afternoon it tries to call get_user_preferences_v2 — a function that has never existed in your codebase. The call looks syntactically perfect. The parameters are reasonable. The only problem: your agent fabricated the entire thing.

This is the phantom tool call — a hallucination that doesn't manifest as wrong text but as a wrong action. Unlike a hallucinated fact that a human might catch during review, a phantom tool call hits your runtime, throws a cryptic ToolNotFoundError, and derails a multi-step workflow that was otherwise running fine.

Stakeholder Prompt Conflicts: When Platform, Business, and User Instructions Compete at Inference Time

· 10 min read
Tian Pan
Software Engineer

In 2024, Air Canada's chatbot invented a bereavement fare refund policy that didn't exist. A court ruled the company was bound by what the bot said. The root cause wasn't a model hallucination in the traditional sense — it was a priority inversion. The system prompt said "be helpful." Actual policy said "follow documented rules." When a user asked about compensation, the model silently resolved the conflict in favor of sounding helpful, and nobody audited that choice before it landed the company in court.

This is the stakeholder prompt conflict problem. Every production LLM system has at least three instruction authors: the platform layer (safety constraints and base model behavior), the business layer (operator-defined rules, compliance requirements, brand voice), and the user layer (the actual request). When those layers contradict each other — and they will — the model picks a winner. The question is whether your engineering team made that pick deliberately, or whether the model did it without anyone noticing.

The Anthropomorphism Tax: Why Treating Your Agent Like a Colleague Breaks Production Systems

· 10 min read
Tian Pan
Software Engineer

An engineering team builds an agent to process customer requests. It works beautifully in demos. They deploy it. Three weeks later, it has quietly been telling users incorrect information with full confidence, skipping steps when context gets long, and occasionally looping forever on ambiguous inputs. The postmortem reveals the team never built retry logic, never validated outputs, and never defined what the agent should do when it was uncertain. When asked why, the answer is revealing: "We figured it would handle those edge cases."

That phrase — "we figured it would handle those edge cases" — is the anthropomorphism tax made explicit. The team designed the system the way you'd manage a junior developer: brief them, trust their judgment, correct when they raise a hand. LLM agents don't raise a hand. They generate the next token.

The Context Window Cliff: What Actually Happens When Your Agent Hits the Limit Mid-Task

· 9 min read
Tian Pan
Software Engineer

Your agent completes steps one through six flawlessly. Step seven contradicts step two. Step eight hallucinates a tool that doesn't exist. Step nine confidently submits garbage. Nothing crashed. No error was thrown. The agent simply forgot what it was doing — and kept going anyway.

This is the context window cliff: the moment an AI agent's accumulated context exceeds its effective reasoning capacity. It doesn't fail gracefully. It doesn't ask for help. It makes confidently wrong decisions based on partial information, and you won't know until the damage is done.

The LLM Forgery Problem: When Your Model Builds a Convincing Case for the Wrong Answer

· 10 min read
Tian Pan
Software Engineer

Your model wrote a detailed, well-structured analysis. Every sentence was grammatically correct and internally consistent. The individual facts it cited were accurate. And yet the conclusion was wrong — not because the model lacked the information to get it right, but because it had already decided on the answer before it started reasoning.

This is not hallucination. Hallucination is when a model fabricates facts. The forgery problem is subtler and, in production systems, harder to catch: the model reaches a conclusion first, then constructs a plausible-sounding chain of evidence to support it. The facts are real. The synthesis is a lie.

Engineers who haven't encountered this failure mode yet will. It shows up in every domain where LLMs are asked to do analysis — code review, document summarization, risk assessment, question answering over a knowledge base. The model sounds authoritative. It cites real evidence. And it has quietly ignored everything that pointed the other way.

The Warm Standby Problem: Why Your AI Override Button Isn't a Safety Net

· 11 min read
Tian Pan
Software Engineer

Most teams building AI agents are designing for success. They instrument success rates, celebrate when the agent handles 90% of tickets autonomously, and put a "click here to override" button in the corner of the UI for the remaining 10%. Then they move on.

The button is not a safety net. It is a liability dressed as a feature.

The failure mode is not the agent breaking. It's the human nominally in charge not being able to take over when it does. The AI absorbed the task gradually — one workflow at a time, one edge case at a time — until the operator who used to handle it has not touched it in six months, has lost the context, and is being handed a live situation they are no longer equipped to manage. This is the warm standby problem, and it compounds silently until an incident forces it into view.

Treating Your LLM Provider as an Unreliable Upstream: The Distributed Systems Playbook for AI

· 11 min read
Tian Pan
Software Engineer

Your monitoring dashboard is green. Response times look fine. Error rates are near zero. And yet your users are filing tickets about garbage answers, your agent is making confidently wrong decisions, and your support queue is filling up with complaints that don't correlate with any infrastructure alert you have.

Welcome to the unique hell of depending on an LLM API in production. It's an upstream service that can fail you while returning a perfectly healthy 200 OK.

AI in the SRE Loop: What Works, What Breaks, and Where to Draw the Line

· 12 min read
Tian Pan
Software Engineer

Most production incidents don't fail because of missing tools. They fail because the person holding the pager doesn't have enough context fast enough. An engineer wakes up at 3 AM to a wall of firing alerts, spends the first 20 minutes piecing together what actually broke, another 20 minutes deciding which runbook applies, and by the time they're executing the fix, the incident has been open for nearly an hour. The raw fix might take 5 minutes.

AI can compress that context-gathering window from 40 minutes to under 2. That's the genuine value on the table. But "LLM helps your oncall" is not one product decision — it's a stack of decisions, each with its own failure mode, and some of those failure modes have consequences that a customer service chatbot hallucination doesn't.