Skip to main content

238 posts tagged with "reliability"

View all tags

API Documentation Is Reliability Infrastructure: How Your Docs Determine Agent Success Rates

· 10 min read
Tian Pan
Software Engineer

Most engineering teams think of API documentation as a developer experience concern — something you improve to reduce support tickets and onboarding time. That framing made sense when your primary consumer was a human reading docs in a browser. It is no longer adequate.

When an AI agent calls your API via tool use, your documentation stops being a guide and becomes runtime behavior. A vague parameter description isn't a UX inconvenience — it is a direct instruction to the model that produces hallucinated values. A missing error code isn't a gap in your reference docs — it is an ambiguous signal that can send an agent into a retry loop with no exit condition. The documentation you wrote three years ago for a human audience is now being parsed by a stateless language model that will execute confidently regardless of whether it understood correctly.

The Cross-User Consistency Problem: When Your AI Gives Different Answers to the Same Question

· 9 min read
Tian Pan
Software Engineer

Two analysts at the same company both ask your AI assistant: "What was our Q3 churn rate?" One gets 4.2%. The other gets 4.8%. Neither is wrong — they just queried at different times, in different session contexts, against a retrieval index that ranked slightly different chunks. The AI answered both confidently, without hedging, without flagging the discrepancy. The analysts go into the same meeting with different numbers and your tool has just become a liability.

This is the cross-user consistency problem, and it's one of the most common reasons enterprise AI deployments quietly lose trust. The failure isn't a hallucination in the classic sense — no facts were invented. The failure is that your system is non-deterministic at scale, and that non-determinism is invisible until two users compare notes.

Human Override as a First-Class Feature: Designing AI Systems That Fail Gracefully to Human Control

· 10 min read
Tian Pan
Software Engineer

When an AI-powered customer support agent can't resolve an issue and escalates to a human, what happens next? In most systems: the customer is transferred cold, with no context, and must re-explain everything from the beginning. The human agent has no idea what the AI attempted, what information was collected, or why the handoff occurred.

This is the most common form of human override failure — not a dramatic AI meltdown, but a quiet UX collapse at the seam between automated and human handling. It happens because engineers built the AI path carefully and treated human takeover as an afterthought, a fallback for when things go wrong. The result is that override feels like a system error rather than a designed operational mode.

The engineering teams that get this right treat human override as a first-class feature from day one. Here's what that looks like in practice.

Your Load Tests Are Lying: LLM Provider Capacity Contention in Production

· 11 min read
Tian Pan
Software Engineer

You ran a load test. Your p95 latency was 450ms. You felt good about it, shipped the feature, and then your on-call rotation lit up two weeks later because users were seeing 25-second response times at 9 AM on a Tuesday.

Nothing changed in your code. No deployment, no config change. The provider's status page said "operational." And yet your app was unusable for 20 minutes during peak business hours.

This is the LLM capacity contention problem, and it's one of the most common failure modes engineers don't see coming until they've already been burned.

The SLA Illusion: Why 99.9% Uptime Means Nothing for AI-Powered Features

· 9 min read
Tian Pan
Software Engineer

Your dashboards are green. Latency is nominal. Error rate is 0.2%. Uptime is 99.97% for the month. And your AI assistant is confidently telling users the wrong thing, in the wrong format, at twice the expected length — and has been doing so for eleven days.

This is the SLA illusion: the infrastructure contract that covers the pipe, not the water flowing through it. For AI-powered features, the gap between "is it responding?" and "is it responding well?" is the gap where product quality quietly dies.

The Automation Cliff Edge: When Partial AI Automation Is Worse Than None

· 11 min read
Tian Pan
Software Engineer

The first time a team automates 70% of a manual process and ships worse outcomes than before, the diagnosis almost always starts in the wrong place. Engineers look at the automated portion: maybe the model accuracy is off, maybe the pipeline has a bug. What they rarely examine is whether the automation itself—by existing—made the remaining 30% of human work structurally impossible to do well.

This is the automation cliff edge. Not a failure of the automated component, but a failure of the seam between automated and manual.

When AI Sounds Right but Isn't: LLM Confabulation in Technical and Scientific Domains

· 9 min read
Tian Pan
Software Engineer

The insidious thing about LLM confabulation in technical domains isn't that the model produces obviously wrong answers. It's that the model produces beautifully structured, confidently stated, technically plausible answers that are subtly wrong in ways that only domain experts catch — and often only after the damage is done.

A Monte Carlo physics simulation that initializes correctly but resamples particle positions from scratch at each step rather than making incremental updates. A chemical formula that follows the right naming conventions but has an incorrect oxidation state. An engineering specification that cites the right standard, references the right units, and has exactly the wrong load coefficient. Each output looks right. Each sounds authoritative. Each is wrong in ways that won't surface until someone runs the experiment, stress-tests the component, or critically reads the derivation.

Agent Memory Contamination: How One Bad Tool Response Poisons a Whole Session

· 10 min read
Tian Pan
Software Engineer

Your agent completes 80% of a multi-step research task correctly, then confidently delivers a conclusion that's completely wrong. You trace back through the logs and find the culprit at step three: a tool call returned stale data, the agent integrated that data as fact, and every subsequent reasoning step built on that poisoned premise. By the end of the session, the agent was correct about everything except the thing that mattered.

This is agent memory contamination — and it's one of the most insidious reliability failures in production agentic systems. Unlike a crash or timeout, it produces a confident wrong answer. Observability tooling records a successful run. The user walks away with bad information.

Agentic Systems Are Distributed Systems: Apply Microservices Lessons Before You Learn Them the Hard Way

· 12 min read
Tian Pan
Software Engineer

The failure rates for multi-agent AI systems in production are embarrassing. A landmark study analyzing over 1,600 execution traces across seven popular frameworks found failure rates ranging from 41% to 87%. Carnegie Mellon researchers put leading agent systems at 30–35% task completion on multi-step benchmarks. Gartner is predicting 40% of agentic AI projects will be cancelled by the end of 2027.

Here is the uncomfortable truth: these aren't AI problems. They're distributed systems problems that engineers already solved between 2010 and 2018, documented exhaustively in blog posts, conference talks, and eventually in Martin Kleppmann's Designing Data-Intensive Applications. The teams that are shipping reliable agent systems today aren't doing anything magical — they're applying circuit breakers, bulkheads, event sourcing, and idempotency keys. The teams that are failing are treating agents as a new paradigm when they're a new deployment target for old patterns.

AI Model APIs Are Software Dependencies You Can't See, Pin, or Track

· 9 min read
Tian Pan
Software Engineer

When OpenAI silently pulled a GPT-4o update in April 2025 after engineers discovered the model had become wildly sycophantic — validating bad ideas, agreeing with factually wrong claims, and generally becoming useless for any task requiring honest feedback — most affected teams found out through Reddit and Hacker News. Their package.json showed nothing changed. Their lockfile was identical. Their deployment pipeline flagged zero dependency updates. From every standard software-supply-chain perspective, nothing happened.

That's the dependency you can't see: the foundation model behind your application.

Building Trust Recovery Flows: What Happens After Your AI Makes a Visible Mistake

· 9 min read
Tian Pan
Software Engineer

When Google's AI Overview told users to add glue to pizza sauce and eat rocks for digestive health, it didn't just embarrass a product team — it exposed a systemic gap in how we think about AI reliability. The failure wasn't just that the model was wrong. The failure was that the model was confidently wrong, in a high-visibility context, with no recovery path for the users it misled.

Trust in AI systems doesn't erode gradually. Research shows it follows a cliff-like collapse pattern: a single noticeable error can produce a disproportionate trust decline with measurable effect sizes. Only 29% of developers say they trust AI tools — an 11-point drop from the previous year, even as adoption climbs to 84%. We're building systems that people use but don't trust. That gap matters when your product ships agentic features that act on behalf of users.

This post is about what engineers and product builders should do after the mistake happens — not just how to prevent it.

The Compound Hallucination Problem: How Multi-Stage AI Pipelines Amplify Errors

· 10 min read
Tian Pan
Software Engineer

Most hallucination research focuses on what comes out of a single model call. That framing misses the scarier problem: what happens in a four-stage pipeline where each stage unconditionally trusts the previous output. A single hallucinated fact in Stage 1 doesn't just persist—it becomes the load-bearing premise for every subsequent inference. By Stage 4, the pipeline delivers a confident, internally coherent answer that happens to be entirely wrong.

This isn't a capability problem that better models will solve. It's a systems architecture problem, and it requires a systems-level fix.