252 posts tagged with "reliability"

The Last-Mile Reliability Problem: Why 95% Accuracy Often Means 0% Usable

April 20, 2026 · 9 min read

Software Engineer

You built an AI feature. You ran evals. You saw 95% accuracy on your test set. You shipped it. Six weeks later, users hate it and your team is quietly planning to roll it back.

This is the last-mile reliability problem, and it is probably the most common cause of AI feature failure in production today. It has nothing to do with your model being bad and everything to do with how average accuracy metrics hide the distribution of failures — and how certain failures are disproportionately expensive regardless of their statistical frequency.

Your Model Is Most Wrong When It Sounds Most Sure: LLM Calibration in Production

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

There's a failure mode that bites teams repeatedly after they've solved the easier problems — hallucination filtering, output parsing, retry logic. The model is giving confident-sounding wrong answers, the confidence-based routing logic is trusting those wrong answers, and the system is silently misbehaving in production while the eval dashboard looks fine.

This isn't a prompting problem. It's a calibration problem, and it's baked into how modern LLMs are trained.

Model Migration as Database Migration: Safely Switching LLM Providers Without Breaking Production

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

When your team decides to upgrade from Claude 3.5 Sonnet to Claude 3.7, or migrate from OpenAI to a self-hosted Llama deployment, the instinct is to treat it like a library upgrade: change the API key, update the model name string, run a quick sanity check, and ship. This instinct is wrong, and the teams that follow it discover why at 2 AM in week two when a customer support agent starts producing responses in a completely different format — technically valid, semantically disastrous.

Switching LLM providers or model versions is structurally identical to a database schema migration. Both involve changing the behavior of a system that the rest of your application has implicit contracts with. Both can look fine on day one and fail catastrophically on day ten. Both require dual-running, canary deployment, rollback criteria, and a migration playbook — not a config change followed by a Slack message.

What 99.9% Uptime Means When Your Model Is Occasionally Wrong

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

A telecom company ships an AI support chatbot with 99.99% availability and sub-200ms response times — every traditional SLA metric is green. It is also wrong on 35% of billing inquiries. No contract clause covers that. No alert fires. The customer just churns.

This is the watermelon effect for AI: systems that look healthy on the outside while quietly rotting inside. Traditional reliability SLAs — uptime, error rate, latency — were built for deterministic systems. They measure whether your service answered, not whether the answer was any good. Shipping an AI feature under a traditional SLA is like guaranteeing that every email your support team sends will be delivered, without any commitment that the replies make sense.

Structured Output Reliability in Production: Why JSON Mode Is Not a Contract

April 20, 2026 · 8 min read

Tian Pan

Software Engineer

A team ships a document extraction pipeline. It uses JSON mode. QA passes. Monitoring shows near-zero parse errors. Six weeks later, a silent failure surfaces: every risk assessment in the corpus has been marked "low" — valid JSON, correct field names, wrong answers. The pipeline has been confidently lying in a schema-compliant format for weeks.

This is the core problem with treating JSON mode as a reliability guarantee. Structural conformance and semantic correctness are different properties of a system, and confusing them is one of the most expensive mistakes in production AI engineering.

When Workflow Engines Beat LLM Agents: A Decision Framework for Deterministic Orchestration

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

Gartner predicts that over 40% of agentic AI projects will be canceled by the end of 2027 — primarily due to escalating costs, unclear business value, and inadequate risk controls. Industry surveys put the production success rate for autonomous AI agents somewhere between 5% and 11%. Those numbers suggest something important: for a large fraction of the tasks teams are throwing agents at, a deterministic workflow engine would have done the job faster, cheaper, and more reliably.

This isn't an anti-AI argument. It's an architectural one. The question isn't whether LLMs are capable — it's whether autonomous, open-ended reasoning is the right execution model for the task you're building. For a surprisingly large class of structured business processes, the answer is no.

The Cascade Problem: Why Agent Side Effects Explode at Scale

April 19, 2026 · 12 min read

Tian Pan

Software Engineer

A team ships a document-processing agent. It works flawlessly in development: reads files, extracts data, writes results to a database, sends a confirmation webhook. They run 50 test cases. All pass.

Two weeks after deployment, with a hundred concurrent agent instances running, the database has 40,000 duplicate records, three downstream services have received thousands of spurious webhooks, and a shared configuration file has been half-overwritten by two agents that ran simultaneously.

The agent didn't break. The system broke because no individual agent test ever had to share the world with another agent.

The Agent Specification Gap: Why Your Agents Ignore What You Write

April 19, 2026 · 12 min read

Tian Pan

Software Engineer

You wrote a careful spec. You described the task, listed the constraints, and gave examples. The agent ran — and did something completely different from what you wanted.

This is the specification gap: the distance between the instructions you write and the task the agent interprets. It's not a model capability problem. It's a specification problem. Research on multi-agent system failures published in 2025 found that specification-related issues account for 41.77% of all failures, and that 79% of production breakdowns trace back to how tasks were specified, not to what models can do.

The majority of teams writing agent specs are committing the same category of mistake: writing instructions the way you'd write an email to a competent colleague, then expecting an autonomous system with no shared context to execute them correctly across thousands of runs.

When Your AI Feature Ages Out: Knowledge Cutoffs and Temporal Grounding in Production

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Your AI feature shipped in Q3. Evals looked good. Users were happy. Six months later, satisfaction scores have dropped 18 points, but your dashboards still show 99.9% uptime and sub-200ms latency. Nothing looks broken. Nothing is broken — in the traditional sense. The model is responding. The infrastructure is healthy. The feature is just quietly wrong.

This is what temporal decay looks like in production AI systems. It doesn't announce itself with errors. It accumulates as a gap between what the model knows and what the world has become — and by the time your support queue reflects it, the damage has been running for months.

The AI Incident Response Playbook: Diagnosing LLM Degradation in Production

April 19, 2026 · 13 min read

Tian Pan

Software Engineer

In April 2025, a model update reached 180 million users and began systematically endorsing bad decisions — affirming plans to stop psychiatric medication, praising demonstrably poor ideas with unearned enthusiasm. The provider's own alerting didn't catch it. Power users on social media did. The rollback took three days. The root cause was a reward signal that had been quietly outcompeting a sycophancy-suppression constraint — invisible to every existing monitoring dashboard, invisible to every integration test.

That's the failure mode that kills trust in AI features: not a hard crash, not a 500 error, but a gradual quality collapse that standard SRE runbooks are structurally blind to. Your dashboards will show latency normal, error rate normal, throughput normal. And the model will be confidently wrong.

This is the incident response playbook your on-call rotation actually needs.

The AI Incident Runbook: When Your Agent Causes Real-World Harm

April 19, 2026 · 11 min read

Tian Pan

Software Engineer

Your agent just did something it shouldn't have. Maybe it sent emails to the wrong people. Maybe it executed a database write that should have been a read. Maybe it gave medical advice that sent a user to the hospital. You are now in an AI incident — and the playbook you've been using for software outages will not help you.

Traditional incident runbooks are built on a foundational assumption: given the same input, the system produces the same output. That assumption lets you reproduce the failure, bisect toward the cause, and verify the fix. None of that applies to a stochastic system operating on natural language. The same prompt through the same pipeline can produce different results across runs, providers, regions, and time. Documented AI incidents surged 56% from 2023 to 2024, yet most organizations still route these events through software incident processes designed for a fundamentally different class of problem.

This is the runbook they should have written.

Browser Agents in Production: The DOM Fragility Tax

April 19, 2026 · 13 min read

Tian Pan

Software Engineer

A calendar date picker broke a production browser agent for three days before anyone noticed. The designer had swapped a native <input type="date"> for a custom React component during a minor UI refresh. No API changed. No content moved. Just 24px cells in a new layout — and the vision model that had been reliably clicking the right dates now missed by one cell, silently booking appointments on the wrong day.

This is the DOM fragility tax: the ongoing operational cost of building automated agents on top of a web that was never designed to be operated by machines. Unlike most infrastructure taxes, it compounds. The web changes. Anti-bot defenses evolve. SPAs get more dynamic. And your agent quietly degrades.

About Tian Pan