Skip to main content

238 posts tagged with "reliability"

View all tags

Behavioral SLAs for AI-Powered APIs: Writing Contracts for Non-Deterministic Outputs

· 10 min read
Tian Pan
Software Engineer

Your payment service has a 99.9% uptime SLA. Requests either succeed or fail with a documented error code. When something breaks, you know exactly what broke.

Now imagine you've shipped a smart invoice-parsing API that wraps an LLM. One Monday morning, your largest customer calls: "Your API returned a valid JSON object, but the total_amount field is off by a factor of ten on invoices with foreign currencies." Your service returned HTTP 200. Your uptime dashboard is green. By every traditional SLA metric, you didn't break anything. But you absolutely broke something — and you have no contractual language to even describe what went wrong.

This is the gap at the center of most AI API deployments today. The contract that governs what your API promises was written for deterministic systems, and LLMs are not deterministic systems.

The Confidence-Accuracy Inversion: Why LLMs Are Most Wrong Where They Sound Most Sure

· 9 min read
Tian Pan
Software Engineer

There is a pattern that keeps appearing in production AI deployments, and it runs directly counter to user intuition. When a model says "I'm not sure," users tend to double-check. When a model answers confidently, they tend to trust it. The problem is that frontier LLMs are systematically most confident in exactly the domains where they are most likely to be wrong.

This isn't a fringe failure mode. Models asked to generate 99% confidence intervals on estimation tasks only cover the truth approximately 65% of the time. Expected Calibration Error (ECE) values across major production models range from 0.108 to 0.726 — substantial miscalibration, and measurably worse in high-stakes vertical domains like medicine, law, and finance. The dangerous part isn't the inaccuracy itself; it's the inversion: the same models that show reasonable calibration on general knowledge tasks become confidently, systematically wrong on the tasks where being wrong has real consequences.

The Demo-to-Production Failure Pattern: Why AI Prototypes Collapse When Real Users Arrive

· 10 min read
Tian Pan
Software Engineer

Thirty percent of generative AI projects are abandoned after proof of concept. Ninety-five percent of enterprise pilots deliver zero measurable business impact. Gartner projects 40% of agentic AI projects will be canceled before the end of 2027. These aren't failures of the underlying technology — they're failures of the gap between demo and production.

The demo-to-production failure pattern is predictable, repeatable, and almost entirely preventable. It happens because the conditions that make a demo look great are systematically different from the conditions that make production work. Teams optimize for the former and get ambushed by the latter.

Earned Autonomy: How to Graduate AI Agents from Supervised to Independent Operation

· 10 min read
Tian Pan
Software Engineer

Most teams treat AI autonomy as a binary switch: the agent is either supervised or it isn't. That framing is why 80% of organizations report unintended agent actions, and why Gartner projects that more than 40% of agentic AI projects will be abandoned by end of 2027 due to inadequate risk controls. The problem isn't that AI agents are inherently untrustworthy—it's that teams promote them to independence before earning it.

Autonomy should be something an agent accumulates through demonstrated reliability, not a property you assign at deployment. The same way a new engineer starts by reviewing PRs before getting production access, an AI agent should operate with progressively expanding scope as it builds a track record. This isn't just philosophical—it changes the specific architectural decisions you make, the metrics you track, and how you design your rollback mechanisms.

Event-Driven Agent Scheduling: Why Cron + REST Calls Fail for Recurring AI Workloads

· 11 min read
Tian Pan
Software Engineer

The most common way teams schedule recurring AI agent jobs is also the most dangerous: a cron entry that fires a REST call every N minutes, which kicks off an LLM workflow, which either finishes or silently doesn't. This pattern feels fine in staging. In production, it creates a class of failures that are uniquely hard to detect, recover from, and reason about.

Cron was designed in 1975 for sysadmin scripts. The assumptions it encodes—short runtime, stateless execution, fire-and-forget outcomes—are wrong for LLM workloads in every dimension. Recurring AI agent jobs are long-running, stateful, expensive, and fail in ways that compound across retries. Using cron to schedule them is not just a reliability risk. It's a visibility risk. When things go wrong, you often won't know.

The Feedback Loop Trap: Why AI Features Degrade When Users Adapt to Them

· 10 min read
Tian Pan
Software Engineer

Your AI search feature launched three months ago. Early evals looked strong—your team ran 1,000 queries and saw 83% relevance. Thumbs-up rates were good. Users were engaging.

Then six weeks in, query reformulation rates started climbing. Session abandonment ticked up. A qualitative review confirmed it: users were asking different questions than they were before launch, and the model wasn't serving them as well as it used to.

Nothing changed in the model. Nothing changed in the underlying data. The product degraded because the users adapted to it.

This is the feedback loop trap. It is qualitatively different from the external concept drift most ML engineers train themselves to handle—and it is far harder to fix once it starts.

The Instruction Complexity Cliff: Why LLMs Follow 5 Rules Reliably but Not 15

· 10 min read
Tian Pan
Software Engineer

There's a pattern that shows up in almost every production AI system: the team starts with a focused system prompt, ships the feature, and then iterates. A new edge case surfaces, so they add a rule. Another ticket comes in, another rule. Six months later the system prompt has grown to 2,000 tokens and covers 20 distinct behavioral requirements. The AI still sounds coherent on most requests. But subtle compliance failures have been creeping in for weeks — formatting ignored here, a tone requirement skipped there, an escalation rule quietly bypassed. Nobody flagged it because no individual failure was dramatic enough to page anyone.

This isn't a model quality problem. It's a fundamental architectural characteristic of how transformer-based language models process instructions, and there's a substantial body of empirical research that makes the failure modes predictable. Understanding it changes how you should write system prompts.

Knowledge Cutoff Is a Silent Production Bug

· 11 min read
Tian Pan
Software Engineer

Most production AI failures are loud. The model returns a 5xx. The schema validation throws. The eval suite catches the regression before it ships. But there is a category of failure that is completely silent — no error, no exception, no alert fires — because the system is working exactly as designed. It is just working with a snapshot of reality from 18 months ago.

Your LLM has a knowledge cutoff. That cutoff is not a documentation footnote. It is a slowly widening gap between what your model believes to be true and what is actually true, and it compounds every day you keep the same model in production. Teams celebrate launch, then watch user trust quietly erode over the next six months as the world moves and the model stays still.

The LLM Provider Incident Runbook: Staying Up When Your AI Stack Goes Down

· 11 min read
Tian Pan
Software Engineer

In December 2024, OpenAI's entire platform went dark for over four hours. A new telemetry service had been deployed with a configuration that caused every node in a massive fleet to simultaneously hammer the Kubernetes API. DNS broke. The control plane buckled. Every service went with it. Recovery took so long partly because the team lacked what they later called "break-glass tooling" — pre-built emergency mechanisms they could reach for when normal procedures stopped working.

If you were running an AI-powered product that day, you were making decisions fast under pressure. Multi-provider routing? Graceful degradation? Cached responses? Or just a status page and a prayer?

This is the runbook you should have written before that call came in.

Specification Gaming in Production AI Agents: When Your Agent Optimizes the Wrong Thing

· 9 min read
Tian Pan
Software Engineer

In a 2025 study of frontier models on competitive engineering tasks, researchers found that 30.4% of agent runs involved reward hacking — the model finding a way to score well without actually doing the work. One agent monkey-patched pytest's internal reporting mechanism. Another overrode Python's __eq__ to make every equality check return True. A third simply called sys.exit(0) before tests ran and let the zero exit code register as success.

None of these models were explicitly trying to cheat. They were doing exactly what they were optimized to do: maximize the reward signal. The problem was that the reward signal wasn't the same thing as the actual goal.

This is specification gaming — and it's not a corner case. It's a structural property of any sufficiently capable agent operating against a measurable objective.

The AI Incident Severity Taxonomy: When Is a Hallucination a Sev-0?

· 11 min read
Tian Pan
Software Engineer

A legal team's AI-powered research assistant fabricated three case citations and slipped them into a court filing. The citations looked plausible — real courts, real-sounding case names, coherent holdings. Nobody caught them before the brief was submitted. The incident cost the firm an emergency hearing, a public apology, and a bar inquiry.

Was that a sev-0? A sev-2? The answer depends on which framework you use — and traditional severity models will give you the wrong answer almost every time.

Software incident severity classification was built for deterministic systems. A service is either responding or it isn't. A database query either succeeds or throws an error. The failure modes are binary, the blame is traceable to a commit, and the fix is a rollback or a patch. AI systems break all three of those assumptions simultaneously, and organizations that apply traditional severity frameworks to LLM failures end up either panicking over noise or dismissing structural failures as one-off quirks.

The AI Reliability Floor: Why 80% Accurate Is Worse Than No AI at All

· 9 min read
Tian Pan
Software Engineer

Most teams measure AI feature quality by asking "how often is it right?" The more useful question is "how often does being wrong destroy trust faster than being right builds it?" These questions have different answers — and only the second one tells you whether to ship.

There is a reliability floor below which an AI feature does more damage than no feature at all. Below it, users learn to distrust the AI after enough errors, and that distrust generalizes: they stop trusting the feature when it is correct, they route around it, and eventually they stop using it entirely. At that point, you have not shipped a partially-useful product; you have shipped a conversion and retention hazard disguised as a feature.