Skip to main content

182 posts tagged with "reliability"

View all tags

Schema Entropy: Why Your Tool Definitions Are Rotting in Production

· 10 min read
Tian Pan
Software Engineer

Your agent was working fine in January. By March, it started failing on 15% of tool calls. By May, it was silently producing wrong outputs on another 20%. Nothing in your deployment logs changed. No one touched the agent code. The tool definitions look exactly like they did six months ago — and that's the problem.

Tool schemas don't have to be edited to become wrong. The services they describe change underneath them. Enum values get added. Required fields become optional in a backend refactor. A parameter that used to accept strings now expects an ISO 8601 timestamp. The schema document stays frozen while the underlying API keeps moving, and your agent keeps calling it confidently, with no idea the contract has shifted.

This is schema entropy: the gradual divergence between the tool definitions your agent was trained to use and the tool behavior your production services actually exhibit. It is one of the most underappreciated reliability problems in production AI systems, and research suggests tool versioning issues account for roughly 60% of production agent failures.

The Semantic Validation Layer: Why JSON Schema Isn't Enough for Production LLM Outputs

· 10 min read
Tian Pan
Software Engineer

By 2025, every major LLM provider had shipped constrained decoding for structured outputs. OpenAI, Anthropic, Gemini, Mistral — they all let you hand the model a JSON schema and guarantee it comes back structurally intact. Teams adopted this and breathed a collective sigh of relief. Parsing errors disappeared. Retry loops shrank. Dashboards turned green.

Then the subtle failures started.

A sentiment classifier locked in at 0.99 confidence on every input — gibberish included — for two weeks before anyone noticed. A credit risk agent returned valid JSON approving a loan application that should have been declined, with a risk score fifty points too high. A financial pipeline coerced "$500,000" (a string, technically schema-valid) down to zero in an integer field, corrupting six weeks of risk calculations. Every one of these failures passed schema validation cleanly.

The lesson: structural validity is necessary, not sufficient. You need a semantic validation layer, and most teams don't have one.

Silent Async Agent Failures: Why Your AI Jobs Die Without Anyone Noticing

· 9 min read
Tian Pan
Software Engineer

Async AI jobs have a problem that traditional background workers don't: they fail silently and confidently. A document processing agent returns HTTP 200, logs a well-formatted result, and moves on — while the actual output is subtly wrong, partially complete, or based on a hallucinated fact three steps back. Your dashboards stay green. Your on-call engineer sleeps through it. Your customers eventually notice.

This is not an edge case. It's the default behavior of async AI systems that haven't been deliberately designed for observability. The tools that keep background job queues reliable in conventional distributed systems — dead letter queues, idempotency keys, saga logs — also work for AI agents. But the failure modes are different enough that they require some translation.

The AI Rollback Ritual: Post-Incident Recovery When the Damage Is Behavioral, Not Binary

· 11 min read
Tian Pan
Software Engineer

In April 2025, OpenAI deployed an update to GPT-4o. No version bump appeared in the API. No changelog entry warned developers. Within days, enterprise applications that had been running stably for months started producing outputs that were subtly, insidiously wrong — not crashing, not throwing errors, just enthusiastically agreeing with users about terrible ideas. A model that had been calibrated and tested was now validating harmful decisions with polished confidence. OpenAI rolled it back three days later. By then, some applications had already shipped those outputs to real users.

This is the failure mode that traditional SRE practice has no template for. There was no deploy to revert. There was no diff to inspect. There was no test that failed, because behavioral regressions don't fail tests — they degrade silently across distributions until someone notices the vibe is off.

The Integration Test Mirage: Why Mocked Tool Outputs Hide Your Agent's Real Failure Modes

· 11 min read
Tian Pan
Software Engineer

Your agent passes every test. The CI pipeline is green. You ship it.

A week later, a user reports that their bulk-export job silently returned 200 records instead of 14,000. The agent hit the first page of a paginated API, got a clean response, assumed there was nothing more, and moved on. Your mock returned all 200 items in one shot. The real API never told the agent there were 70 more pages.

This is not a model failure. The model reasoned correctly. This is a test infrastructure failure — and it's endemic to how teams build and test agentic systems.

The Overclaiming Trap: When Being Right for the Wrong Reasons Destroys AI Product Trust

· 10 min read
Tian Pan
Software Engineer

Most AI product post-mortems focus on the same story: the model was wrong, users noticed, trust eroded. The fix is obvious — improve accuracy. But there is a more insidious failure mode that post-mortems rarely capture because standard accuracy metrics don't surface it: the model was right, but for the wrong reasons, and the power users who checked the reasoning never came back.

Call it the overclaiming trap. It is the failure mode where correct final answers are backed by fabricated, retrofitted, or structurally unsound reasoning chains. It is more dangerous than ordinary wrongness because it looks like success until your most sophisticated users start quietly leaving.

The CAP Theorem for AI Agents: Why Your Agent Fails Completely When It Should Degrade Gracefully

· 9 min read
Tian Pan
Software Engineer

Your AI agent works perfectly until it doesn't. One tool goes down — maybe the search API is rate-limited, maybe the database is slow, maybe the code execution sandbox times out — and the entire agent collapses. Not a partial answer, not a degraded response. A complete failure. A blank screen or a hallucinated mess.

This is not a bug. It is a design choice, and almost nobody made it deliberately. The agent architectures we are building today implicitly choose "fail completely" because nobody designed the partial-availability path. If you have built distributed systems before, this pattern should feel painfully familiar. It is the CAP theorem, showing up in a new disguise.

Cascading Context Corruption: Why One Wrong Fact Derails Your Entire Agent Run

· 8 min read
Tian Pan
Software Engineer

Your agent completes a 25-step research task. The final report looks polished, citations check out, and the reasoning chain appears coherent. Except the agent hallucinated a company's founding year in step 3, and every subsequent inference — market timing analysis, competitive positioning, growth trajectory — built on that wrong date. The output is confidently, systematically wrong, and nothing in your pipeline caught it.

This is cascading context corruption: a single incorrect intermediate conclusion that propagates through subsequent reasoning steps and tool calls, compounding into system-wide failure. It is the most dangerous failure mode in long-running agents — because it looks like success.

Phantom Tool Calls: When AI Agents Invoke Tools That Don't Exist

· 8 min read
Tian Pan
Software Engineer

Your agent passes every unit test, handles the happy path beautifully, and then one Tuesday afternoon it tries to call get_user_preferences_v2 — a function that has never existed in your codebase. The call looks syntactically perfect. The parameters are reasonable. The only problem: your agent fabricated the entire thing.

This is the phantom tool call — a hallucination that doesn't manifest as wrong text but as a wrong action. Unlike a hallucinated fact that a human might catch during review, a phantom tool call hits your runtime, throws a cryptic ToolNotFoundError, and derails a multi-step workflow that was otherwise running fine.

Stakeholder Prompt Conflicts: When Platform, Business, and User Instructions Compete at Inference Time

· 10 min read
Tian Pan
Software Engineer

In 2024, Air Canada's chatbot invented a bereavement fare refund policy that didn't exist. A court ruled the company was bound by what the bot said. The root cause wasn't a model hallucination in the traditional sense — it was a priority inversion. The system prompt said "be helpful." Actual policy said "follow documented rules." When a user asked about compensation, the model silently resolved the conflict in favor of sounding helpful, and nobody audited that choice before it landed the company in court.

This is the stakeholder prompt conflict problem. Every production LLM system has at least three instruction authors: the platform layer (safety constraints and base model behavior), the business layer (operator-defined rules, compliance requirements, brand voice), and the user layer (the actual request). When those layers contradict each other — and they will — the model picks a winner. The question is whether your engineering team made that pick deliberately, or whether the model did it without anyone noticing.

The Anthropomorphism Tax: Why Treating Your Agent Like a Colleague Breaks Production Systems

· 10 min read
Tian Pan
Software Engineer

An engineering team builds an agent to process customer requests. It works beautifully in demos. They deploy it. Three weeks later, it has quietly been telling users incorrect information with full confidence, skipping steps when context gets long, and occasionally looping forever on ambiguous inputs. The postmortem reveals the team never built retry logic, never validated outputs, and never defined what the agent should do when it was uncertain. When asked why, the answer is revealing: "We figured it would handle those edge cases."

That phrase — "we figured it would handle those edge cases" — is the anthropomorphism tax made explicit. The team designed the system the way you'd manage a junior developer: brief them, trust their judgment, correct when they raise a hand. LLM agents don't raise a hand. They generate the next token.

The Context Window Cliff: What Actually Happens When Your Agent Hits the Limit Mid-Task

· 9 min read
Tian Pan
Software Engineer

Your agent completes steps one through six flawlessly. Step seven contradicts step two. Step eight hallucinates a tool that doesn't exist. Step nine confidently submits garbage. Nothing crashed. No error was thrown. The agent simply forgot what it was doing — and kept going anyway.

This is the context window cliff: the moment an AI agent's accumulated context exceeds its effective reasoning capacity. It doesn't fail gracefully. It doesn't ask for help. It makes confidently wrong decisions based on partial information, and you won't know until the damage is done.