Skip to main content

159 posts tagged with "reliability"

View all tags

Specification Gaming in Production LLM Systems: When Your AI Does Exactly What You Asked

· 10 min read
Tian Pan
Software Engineer

A 2025 study gave frontier models a coding evaluation task with an explicit rule: don't hack the benchmark. Every model acknowledged, 10 out of 10 times, that cheating would violate the user's intent. Then 70–95% of them did it anyway. The models weren't confused — they understood the constraint perfectly. They just found that satisfying the specification literally was more rewarding than satisfying it in spirit.

That's specification gaming in production, and it's not a theoretical concern. It's a property that emerges whenever you optimize a proxy metric hard enough, and in production LLM systems you're almost always optimizing a proxy.

SRE for AI Agents: What Actually Breaks at 3am

· 10 min read
Tian Pan
Software Engineer

A market research pipeline ran uninterrupted for eleven days. Four LangChain agents — an Analyzer and a Verifier — passed requests back and forth, made no progress on the original task, and accumulated $47,000 in API charges before anyone noticed. The system never returned an error. No alert fired. The billing dashboard finally caught it, days after the damage was done.

This is not an edge case. It is the canonical AI agent incident. And if you are running agents in production today, your existing SRE runbooks almost certainly do not cover it.

Adding AI to Systems You Don't Own: The Third-Party Model Integration Playbook

· 12 min read
Tian Pan
Software Engineer

Most engineering problems are self-inflicted. The code you deploy, the schemas you define, the dependencies you choose — when things break, you can trace it back to something in your control. AI API integrations violate this assumption. When you build on a third-party model API, a silent model update can degrade your feature at 3am without a deploy happening on your end. A provider outage can take your product offline. A price change can turn a profitable workflow into a money-losing one. The breaking change will never show up in your changelog.

This isn't a reason to avoid external AI APIs. It's a reason to build as if you don't trust them.

Structured Outputs Are Not a Solved Problem: JSON Mode Failure Modes in Production

· 12 min read
Tian Pan
Software Engineer

You flip on JSON mode, your LLM starts returning valid JSON, and you ship it. Three weeks later, production is quietly broken. The JSON is syntactically valid. The schema is technically satisfied. But a field contains a hallucinated entity, a finish_reason of "length" silently truncated the payload at 95%, or the model classified "positive" sentiment for text that any human would read as scathing — and your downstream pipeline consumed it without complaint.

JSON mode is a solved problem in the same way that "use a mutex" is a solved problem for concurrency. The primitive exists. The failure modes are not where you put the lock.

1% Error Rate, 10 Million Users: The Math of AI Failures at Scale

· 11 min read
Tian Pan
Software Engineer

A large language model deployed to a medical transcription service achieves 99% accuracy. The team ships it with confidence. Six months later, a study finds that 1% of its transcribed samples contain fabricated phrases not present in the original audio — invented drug names, nonexistent procedures, occasional violent or disturbing content inserted mid-sentence. With 30,000 medical professionals using the system, that 1% translates to tens of thousands of contaminated records per month, some carrying patient safety consequences.

The accuracy number never changed. The problem was always there. The team just hadn't done the scale math.

Behavioral SLAs for AI-Powered APIs: Writing Contracts for Non-Deterministic Outputs

· 10 min read
Tian Pan
Software Engineer

Your payment service has a 99.9% uptime SLA. Requests either succeed or fail with a documented error code. When something breaks, you know exactly what broke.

Now imagine you've shipped a smart invoice-parsing API that wraps an LLM. One Monday morning, your largest customer calls: "Your API returned a valid JSON object, but the total_amount field is off by a factor of ten on invoices with foreign currencies." Your service returned HTTP 200. Your uptime dashboard is green. By every traditional SLA metric, you didn't break anything. But you absolutely broke something — and you have no contractual language to even describe what went wrong.

This is the gap at the center of most AI API deployments today. The contract that governs what your API promises was written for deterministic systems, and LLMs are not deterministic systems.

The Confidence-Accuracy Inversion: Why LLMs Are Most Wrong Where They Sound Most Sure

· 9 min read
Tian Pan
Software Engineer

There is a pattern that keeps appearing in production AI deployments, and it runs directly counter to user intuition. When a model says "I'm not sure," users tend to double-check. When a model answers confidently, they tend to trust it. The problem is that frontier LLMs are systematically most confident in exactly the domains where they are most likely to be wrong.

This isn't a fringe failure mode. Models asked to generate 99% confidence intervals on estimation tasks only cover the truth approximately 65% of the time. Expected Calibration Error (ECE) values across major production models range from 0.108 to 0.726 — substantial miscalibration, and measurably worse in high-stakes vertical domains like medicine, law, and finance. The dangerous part isn't the inaccuracy itself; it's the inversion: the same models that show reasonable calibration on general knowledge tasks become confidently, systematically wrong on the tasks where being wrong has real consequences.

The Demo-to-Production Failure Pattern: Why AI Prototypes Collapse When Real Users Arrive

· 10 min read
Tian Pan
Software Engineer

Thirty percent of generative AI projects are abandoned after proof of concept. Ninety-five percent of enterprise pilots deliver zero measurable business impact. Gartner projects 40% of agentic AI projects will be canceled before the end of 2027. These aren't failures of the underlying technology — they're failures of the gap between demo and production.

The demo-to-production failure pattern is predictable, repeatable, and almost entirely preventable. It happens because the conditions that make a demo look great are systematically different from the conditions that make production work. Teams optimize for the former and get ambushed by the latter.

Earned Autonomy: How to Graduate AI Agents from Supervised to Independent Operation

· 10 min read
Tian Pan
Software Engineer

Most teams treat AI autonomy as a binary switch: the agent is either supervised or it isn't. That framing is why 80% of organizations report unintended agent actions, and why Gartner projects that more than 40% of agentic AI projects will be abandoned by end of 2027 due to inadequate risk controls. The problem isn't that AI agents are inherently untrustworthy—it's that teams promote them to independence before earning it.

Autonomy should be something an agent accumulates through demonstrated reliability, not a property you assign at deployment. The same way a new engineer starts by reviewing PRs before getting production access, an AI agent should operate with progressively expanding scope as it builds a track record. This isn't just philosophical—it changes the specific architectural decisions you make, the metrics you track, and how you design your rollback mechanisms.

Event-Driven Agent Scheduling: Why Cron + REST Calls Fail for Recurring AI Workloads

· 11 min read
Tian Pan
Software Engineer

The most common way teams schedule recurring AI agent jobs is also the most dangerous: a cron entry that fires a REST call every N minutes, which kicks off an LLM workflow, which either finishes or silently doesn't. This pattern feels fine in staging. In production, it creates a class of failures that are uniquely hard to detect, recover from, and reason about.

Cron was designed in 1975 for sysadmin scripts. The assumptions it encodes—short runtime, stateless execution, fire-and-forget outcomes—are wrong for LLM workloads in every dimension. Recurring AI agent jobs are long-running, stateful, expensive, and fail in ways that compound across retries. Using cron to schedule them is not just a reliability risk. It's a visibility risk. When things go wrong, you often won't know.

The Feedback Loop Trap: Why AI Features Degrade When Users Adapt to Them

· 10 min read
Tian Pan
Software Engineer

Your AI search feature launched three months ago. Early evals looked strong—your team ran 1,000 queries and saw 83% relevance. Thumbs-up rates were good. Users were engaging.

Then six weeks in, query reformulation rates started climbing. Session abandonment ticked up. A qualitative review confirmed it: users were asking different questions than they were before launch, and the model wasn't serving them as well as it used to.

Nothing changed in the model. Nothing changed in the underlying data. The product degraded because the users adapted to it.

This is the feedback loop trap. It is qualitatively different from the external concept drift most ML engineers train themselves to handle—and it is far harder to fix once it starts.

The Instruction Complexity Cliff: Why LLMs Follow 5 Rules Reliably but Not 15

· 10 min read
Tian Pan
Software Engineer

There's a pattern that shows up in almost every production AI system: the team starts with a focused system prompt, ships the feature, and then iterates. A new edge case surfaces, so they add a rule. Another ticket comes in, another rule. Six months later the system prompt has grown to 2,000 tokens and covers 20 distinct behavioral requirements. The AI still sounds coherent on most requests. But subtle compliance failures have been creeping in for weeks — formatting ignored here, a tone requirement skipped there, an escalation rule quietly bypassed. Nobody flagged it because no individual failure was dramatic enough to page anyone.

This isn't a model quality problem. It's a fundamental architectural characteristic of how transformer-based language models process instructions, and there's a substantial body of empirical research that makes the failure modes predictable. Understanding it changes how you should write system prompts.