Skip to main content

238 posts tagged with "reliability"

View all tags

Model Migration as Database Migration: Safely Switching LLM Providers Without Breaking Production

· 10 min read
Tian Pan
Software Engineer

When your team decides to upgrade from Claude 3.5 Sonnet to Claude 3.7, or migrate from OpenAI to a self-hosted Llama deployment, the instinct is to treat it like a library upgrade: change the API key, update the model name string, run a quick sanity check, and ship. This instinct is wrong, and the teams that follow it discover why at 2 AM in week two when a customer support agent starts producing responses in a completely different format — technically valid, semantically disastrous.

Switching LLM providers or model versions is structurally identical to a database schema migration. Both involve changing the behavior of a system that the rest of your application has implicit contracts with. Both can look fine on day one and fail catastrophically on day ten. Both require dual-running, canary deployment, rollback criteria, and a migration playbook — not a config change followed by a Slack message.

What 99.9% Uptime Means When Your Model Is Occasionally Wrong

· 10 min read
Tian Pan
Software Engineer

A telecom company ships an AI support chatbot with 99.99% availability and sub-200ms response times — every traditional SLA metric is green. It is also wrong on 35% of billing inquiries. No contract clause covers that. No alert fires. The customer just churns.

This is the watermelon effect for AI: systems that look healthy on the outside while quietly rotting inside. Traditional reliability SLAs — uptime, error rate, latency — were built for deterministic systems. They measure whether your service answered, not whether the answer was any good. Shipping an AI feature under a traditional SLA is like guaranteeing that every email your support team sends will be delivered, without any commitment that the replies make sense.

Structured Output Reliability in Production: Why JSON Mode Is Not a Contract

· 8 min read
Tian Pan
Software Engineer

A team ships a document extraction pipeline. It uses JSON mode. QA passes. Monitoring shows near-zero parse errors. Six weeks later, a silent failure surfaces: every risk assessment in the corpus has been marked "low" — valid JSON, correct field names, wrong answers. The pipeline has been confidently lying in a schema-compliant format for weeks.

This is the core problem with treating JSON mode as a reliability guarantee. Structural conformance and semantic correctness are different properties of a system, and confusing them is one of the most expensive mistakes in production AI engineering.

When Workflow Engines Beat LLM Agents: A Decision Framework for Deterministic Orchestration

· 9 min read
Tian Pan
Software Engineer

Gartner predicts that over 40% of agentic AI projects will be canceled by the end of 2027 — primarily due to escalating costs, unclear business value, and inadequate risk controls. Industry surveys put the production success rate for autonomous AI agents somewhere between 5% and 11%. Those numbers suggest something important: for a large fraction of the tasks teams are throwing agents at, a deterministic workflow engine would have done the job faster, cheaper, and more reliably.

This isn't an anti-AI argument. It's an architectural one. The question isn't whether LLMs are capable — it's whether autonomous, open-ended reasoning is the right execution model for the task you're building. For a surprisingly large class of structured business processes, the answer is no.

The Cascade Problem: Why Agent Side Effects Explode at Scale

· 12 min read
Tian Pan
Software Engineer

A team ships a document-processing agent. It works flawlessly in development: reads files, extracts data, writes results to a database, sends a confirmation webhook. They run 50 test cases. All pass.

Two weeks after deployment, with a hundred concurrent agent instances running, the database has 40,000 duplicate records, three downstream services have received thousands of spurious webhooks, and a shared configuration file has been half-overwritten by two agents that ran simultaneously.

The agent didn't break. The system broke because no individual agent test ever had to share the world with another agent.

The Agent Specification Gap: Why Your Agents Ignore What You Write

· 12 min read
Tian Pan
Software Engineer

You wrote a careful spec. You described the task, listed the constraints, and gave examples. The agent ran — and did something completely different from what you wanted.

This is the specification gap: the distance between the instructions you write and the task the agent interprets. It's not a model capability problem. It's a specification problem. Research on multi-agent system failures published in 2025 found that specification-related issues account for 41.77% of all failures, and that 79% of production breakdowns trace back to how tasks were specified, not to what models can do.

The majority of teams writing agent specs are committing the same category of mistake: writing instructions the way you'd write an email to a competent colleague, then expecting an autonomous system with no shared context to execute them correctly across thousands of runs.

When Your AI Feature Ages Out: Knowledge Cutoffs and Temporal Grounding in Production

· 10 min read
Tian Pan
Software Engineer

Your AI feature shipped in Q3. Evals looked good. Users were happy. Six months later, satisfaction scores have dropped 18 points, but your dashboards still show 99.9% uptime and sub-200ms latency. Nothing looks broken. Nothing is broken — in the traditional sense. The model is responding. The infrastructure is healthy. The feature is just quietly wrong.

This is what temporal decay looks like in production AI systems. It doesn't announce itself with errors. It accumulates as a gap between what the model knows and what the world has become — and by the time your support queue reflects it, the damage has been running for months.

The AI Incident Response Playbook: Diagnosing LLM Degradation in Production

· 13 min read
Tian Pan
Software Engineer

In April 2025, a model update reached 180 million users and began systematically endorsing bad decisions — affirming plans to stop psychiatric medication, praising demonstrably poor ideas with unearned enthusiasm. The provider's own alerting didn't catch it. Power users on social media did. The rollback took three days. The root cause was a reward signal that had been quietly outcompeting a sycophancy-suppression constraint — invisible to every existing monitoring dashboard, invisible to every integration test.

That's the failure mode that kills trust in AI features: not a hard crash, not a 500 error, but a gradual quality collapse that standard SRE runbooks are structurally blind to. Your dashboards will show latency normal, error rate normal, throughput normal. And the model will be confidently wrong.

This is the incident response playbook your on-call rotation actually needs.

The AI Incident Runbook: When Your Agent Causes Real-World Harm

· 11 min read
Tian Pan
Software Engineer

Your agent just did something it shouldn't have. Maybe it sent emails to the wrong people. Maybe it executed a database write that should have been a read. Maybe it gave medical advice that sent a user to the hospital. You are now in an AI incident — and the playbook you've been using for software outages will not help you.

Traditional incident runbooks are built on a foundational assumption: given the same input, the system produces the same output. That assumption lets you reproduce the failure, bisect toward the cause, and verify the fix. None of that applies to a stochastic system operating on natural language. The same prompt through the same pipeline can produce different results across runs, providers, regions, and time. Documented AI incidents surged 56% from 2023 to 2024, yet most organizations still route these events through software incident processes designed for a fundamentally different class of problem.

This is the runbook they should have written.

Browser Agents in Production: The DOM Fragility Tax

· 13 min read
Tian Pan
Software Engineer

A calendar date picker broke a production browser agent for three days before anyone noticed. The designer had swapped a native <input type="date"> for a custom React component during a minor UI refresh. No API changed. No content moved. Just 24px cells in a new layout — and the vision model that had been reliably clicking the right dates now missed by one cell, silently booking appointments on the wrong day.

This is the DOM fragility tax: the ongoing operational cost of building automated agents on top of a web that was never designed to be operated by machines. Unlike most infrastructure taxes, it compounds. The web changes. Anti-bot defenses evolve. SPAs get more dynamic. And your agent quietly degrades.

Compaction Traps: Why Long-Running Agents Forget What They Already Tried

· 9 min read
Tian Pan
Software Engineer

An agent calls a file-writing tool. The tool fails with a permission error. The agent records this, moves on to a different approach, and eventually runs long enough that the runtime triggers context compaction. The summary reads: "the agent has been working on writing output files." What it drops: that the permission error ever happened, and why the original approach was abandoned. Three hundred tokens later, the agent tries the same write again.

This pattern — call it the compaction trap — is one of the most persistent reliability failures in production agent systems. It's not a model bug. It's an architecture mismatch between how compaction works and what agents actually need to stay coherent across long sessions.

The Data Quality Tax in LLM Systems: Why Bad Input Hits Differently

· 9 min read
Tian Pan
Software Engineer

Your gradient boosting model degrades politely when data gets noisy. Accuracy drops, precision drops, a monitoring alert fires, and the on-call engineer knows exactly where to look. LLMs don't do that. Feed an LLM degraded, stale, or malformed input and it produces fluent, confident, authoritative-sounding output that is partially or entirely wrong — and the downstream system consuming it has no way to tell the difference.

This is the data quality tax: the compounding cost you pay when bad data enters an LLM pipeline, expressed not as lower confidence scores but as hallucinations dressed in the syntax of facts.