Skip to main content

182 posts tagged with "reliability"

View all tags

Multi-Model Consistency: When Your Pipeline's Sequential LLM Calls Contradict Each Other

· 9 min read
Tian Pan
Software Engineer

Your summarization step decides a customer complaint is about billing. Your extraction step pulls "subscription tier: Pro." Your generation step writes a follow-up email referencing their "Enterprise plan." Three LLM calls, one pipeline, one completely broken output — and no error was raised anywhere along the way.

This is multi-model consistency failure: the silent killer of compound AI systems. It doesn't look like an exception. It doesn't trigger your error rate SLO. It just ships confidently wrong content to users.

Prompt Canary Deployments: Ship Prompt Changes Like a Senior SRE

· 10 min read
Tian Pan
Software Engineer

Your team ships a prompt edit on a Tuesday afternoon. The change looks reasonable — you tightened the system prompt, removed some redundant instructions, added a clearer tone directive. Staging looks fine. You deploy. By Wednesday morning, your support queue has doubled. Somewhere in that tightening, you broke the model's ability to recognize a class of user queries it used to handle gracefully. Your HTTP error rate is 0%. Your dashboards are green. The problem is invisible until a human reads the tickets.

This is the defining failure mode of LLM production systems. Prompt changes fail silently. They return 200 OK while producing garbage. They degrade in ways that unit tests don't catch, error rate monitors don't flag, and dashboards don't surface. The fix isn't better tests on staging — it's treating every prompt change as a production deployment with the same traffic-splitting, rollback, and monitoring discipline you'd apply to a critical code release.

The Prompt Entropy Budget: Measuring Output Variance as a First-Class Production Metric

· 11 min read
Tian Pan
Software Engineer

When your LLM feature ships, your monitoring dashboard probably tracks accuracy, latency, and error rate. What it almost certainly does not track is variance — how wildly different the output is each time a user sends the same prompt. That gap is where production AI features quietly collapse.

Variance determines whether your product feels trustworthy or capricious. A feature that scores 88% on your eval suite but delivers a two-sentence answer 40% of the time and a ten-paragraph essay the other 60% will erode user trust faster than one that scores 80% but behaves consistently. Teams optimizing exclusively for accuracy are solving the wrong half of the reliability problem.

The prompt entropy budget is the concept that fills this gap: a structured approach to measuring, budgeting, and controlling the distribution of outputs your model produces over identical inputs — treated the same way you treat p99 latency or error budget in your SLO framework.

Retry Budgets for LLM Agents: Why 20% Per-Step Failure Doubles Your Token Bill

· 8 min read
Tian Pan
Software Engineer

Most teams discover their retry problem when the invoice shows up. The agent "worked"; latency dashboards stayed green; error rates looked fine. Then finance asks why inference spend doubled this month, and someone finally reads the logs. It turns out that 20% of the tool calls in a 3-step agent were quietly retrying, each retry replayed the full prompt history, and the bill had been ramping for weeks.

The math on this is not mysterious, but it is aggressively counterintuitive. A 20% per-step retry rate sounds tolerable — most engineers would glance at it and move on. The actual token cost, once you factor in how modern agent frameworks retry, lands much closer to 2x than 1.2x. And the failure mode is invisible to every metric teams typically watch.

Retry budgets — an old idea from Google SRE work — are the cleanest fix. But the LLM version of the pattern needs tweaking, because tokens don't behave like RPCs.

Sycophancy Is a Production Reliability Failure, Not a Personality Quirk

· 10 min read
Tian Pan
Software Engineer

Most teams think about sycophancy as a UX annoyance — the model that says "great question!" too often. That framing is dangerously incomplete. Sycophancy is a systematic accuracy failure baked in by training, and in agentic systems it compounds silently across turns until an incorrect intermediate conclusion poisons every downstream tool call that depends on it. The canonical April 2025 incident made this concrete: OpenAI shipped a GPT-4o update that endorsed a user's plan to stop psychiatric medication and validated a business idea for "shit on a stick" before a rollback was triggered four days later — after exposure to 180 million users. The root cause wasn't a prompt mistake. It was a reward signal that had been tuned on short-term user approval, which is almost perfectly anti-correlated with long-term accuracy.

The Delegation Cliff: Why AI Agent Reliability Collapses at 7+ Steps

· 8 min read
Tian Pan
Software Engineer

An agent with 95% per-step reliability sounds impressive. At 10 steps, you have a 60% chance of success. At 20 steps, it's down to 36%. At 50 steps, you're looking at a coin flip—and that's with a generous 95% estimate. Field data suggests real-world agents fail closer to 20% per action, which means a 100-step task succeeds roughly 0.00002% of the time. This isn't a model quality problem or a prompt engineering problem. It's a compounding math problem, and most teams building agents haven't internalized it yet.

This is the delegation cliff: the point at which adding one more step to an agent's task doesn't linearly increase the chance of failure—it multiplies it.

The Warm Handoff Pattern: Designing Fluid Control Transfer Between Agents and Humans

· 12 min read
Tian Pan
Software Engineer

Most agent escalation flows are cold transfers dressed up with good intentions. The agent decides it cannot proceed, drops a "I'm connecting you to a human" message, and routes the session to an operator who has no idea what the agent tried, what failed, or what the user actually needs. The human starts from scratch. The user repeats themselves. Trust erodes — not because the AI was wrong, but because nobody designed the boundary.

The warm handoff pattern is an architectural discipline for the exact moment an agent yields control. It treats that boundary as a first-class system concern rather than an afterthought. Done well, the receiving party — human or agent — steps into a briefed, structured situation. Done poorly, that boundary is where user trust goes to die.

Context Poisoning in Long-Running AI Agents

· 9 min read
Tian Pan
Software Engineer

Your agent completes step three of a twelve-step workflow and confidently reports that the target API returned a 200 status. It didn't — that result was from step one, still sitting in the context window. By step nine, the agent has made four downstream calls based on a fact that was never true. The workflow "succeeds." No error is logged.

This is context poisoning: not a security attack, but a reliability failure mode where the agent's own accumulated context becomes a source of wrong information. As agents run longer, interact with more tools, and manage more state, the probability of this failure climbs sharply. And unlike crashes or exceptions, context poisoning is invisible to standard monitoring.

The HITL Rubber Stamp Problem: Why Human-in-the-Loop Often Means Neither

· 9 min read
Tian Pan
Software Engineer

There's a paradox sitting at the center of responsible AI deployment: the more you try to involve humans in reviewing AI decisions, the less meaningful that review becomes.

A 2024 Harvard Business School study gave 228 evaluators AI recommendations with clear explanations of the AI's reasoning. Human reviewers were 19 percentage points more likely to align with AI recommendations than the control group. When the AI also provided narrative rationales — when it explained why it made a decision — deference increased by another 5 points. Better explainability produced worse oversight. The human in the loop had become a rubber stamp on a form.

Latency Budgets for AI Features: How to Set and Hit p95 SLOs When Your Core Component Is Stochastic

· 11 min read
Tian Pan
Software Engineer

Your system averages 400ms end-to-end. Your p95 is 4.2 seconds. Your p99 is 11 seconds. You committed to a "sub-second" experience in the product spec. Every metric in your dashboard looks fine until someone asks what happened to 5% of users — and suddenly the average you've been celebrating is the thing burying you.

This is the latency budget problem for AI features, and it's categorically different from what you've solved before. When your core component is a database query or a microservice call, p95 latency is roughly predictable and amenable to standard SRE techniques. When your core component is an LLM, the distribution of response times is heavy-tailed, input-dependent, and partially driven by conditions you don't control. You need a different mental model before you can set an honest SLO — let alone hit it.

The Provider Reliability Trap: Your LLM Vendor's SLA Is Now Your Users' SLA

· 9 min read
Tian Pan
Software Engineer

In December 2024, Zendesk published a formal incident report stating that from June 10 through June 11, 2025, customers lost access to all Zendesk AI features for more than 33 consecutive hours. The engineering team's remediation steps were empty — there was nothing to do. The outage was caused entirely by their upstream LLM provider going down, and Zendesk had no architectural path to restore service without it.

This is the provider reliability trap in its clearest form: you ship a feature, make it part of your users' workflows, promise availability through implicit or explicit SLA commitments, and then discover that your entire reliability posture is bounded by a dependency you don't control, can't fix, and may not have formally evaluated before launch.

The Operational Model Card: Deployment Documentation Labs Don't Publish

· 11 min read
Tian Pan
Software Engineer

A model card tells you whether a model was red-teamed for CBRN misuse and which demographic groups it underserves. What it doesn't tell you: the p95 TTFT at 10,000 concurrent requests, the accuracy cliff at 80% of the advertised context window, the percentage of complex JSON schemas it malforms, or how much the model's behavior has drifted since the card was published.

The gap is structural, not accidental. Model cards were designed in 2019 for fairness and safety documentation, with civil society organizations and regulators as the intended audience. Engineering teams shipping production systems were not the use case. Seven years of adoption later, that framing is unchanged — while the cost of treating a model card as a deployment specification has never been higher.

The 2025 Foundation Model Transparency Index (Stanford CRFM + Berkeley) confirmed the scope of the omission: OpenAI scored 24/100, Anthropic 32/100, Google 27/100 across 100 transparency indicators. Average scores dropped from 58 to 40 year-over-year, meaning AI transparency is getting worse, not better, as models get more capable. None of the four major labs disclose training data composition, energy usage, or deployment-relevant performance characteristics.