Skip to main content

238 posts tagged with "reliability"

View all tags

Backpressure for LLM Pipelines: Queue Theory Applied to Token-Based Services

· 11 min read
Tian Pan
Software Engineer

A retry storm at 3 a.m. usually starts the same way: a brief provider hiccup pushes a few requests over the rate limit, your client library retries them, those retries land on a still-recovering endpoint, more requests fail, and within ninety seconds your queue depth has gone vertical while your provider dashboard shows you sitting at 100% of your tokens-per-minute quota with a backlog measured in five-figure dollars. The post-mortem will say "thundering herd." The honest answer is that you built a fixed-throughput retry policy on top of a variable-capacity downstream and forgot that queue theory has opinions about that.

Most of the well-known service resilience patterns were written for downstreams whose throughput is a wall: a database with a connection pool, a microservice with a known concurrency limit. LLM providers are not that. Your effective throughput is a moving target shaped by your tier, the model you picked, the size of the prompt, the size of the response, the time of day, and whether someone else on the same provider is fine-tuning a frontier model right now. Treating it like a fixed pipe is the root cause of most of the LLM outages I've seen this year.

Compound Failure Modes in AI Pipelines: When Partial Success Isn't Enough

· 9 min read
Tian Pan
Software Engineer

Most engineers building AI pipelines think about each component in isolation: how often does retrieval succeed, how often does the LLM do the right thing, how often does the downstream tool call land. If each answer comes back "95%," the system feels solid.

It isn't. Three components at 95% each give you an 86% reliable system. Add a fourth at 95% and you're at 81%. Add a fifth and you're below 77%. What felt like a solid stack of high-quality components produces a pipeline that fails one in five requests before you've shipped a single feature.

That's the compound failure problem, and it's the calculation most AI engineering teams skip until users start filing tickets.

Data Quality Gates for Agentic Write Paths: Garbage In, Irreversible Actions Out

· 11 min read
Tian Pan
Software Engineer

In 2025, an AI coding assistant executed unauthorized destructive commands against a production database during a code freeze — deleting 2.5 years of customer data, creating 4,000 fake users, and then fabricating successful test results to cover up what had happened. The root cause wasn't a bad model. It was a missing gate between agent intent and system execution.

That incident is dramatic, but it's not anomalous. Tool calling fails 3–15% of the time in production. Agents retry ambiguous operations. They read stale records and act on outdated state. They produce inputs that violate schema constraints in subtle ways. In a query-answering system, these failures produce a wrong answer the user notices and corrects. In an agent with write access, they produce a duplicate order, an incorrect notification, a corrupted record — damage that persists and propagates before anyone realizes something went wrong.

The difference between query agents and write agents isn't just one of severity. It's a difference in how failures manifest, how quickly they're detected, and how costly they are to reverse. Treating both with the same operational posture is the primary reason production write-path agents fail.

Graceful AI Feature Sunset: How to Deprecate a Model-Powered Feature Without Breaking User Trust

· 11 min read
Tian Pan
Software Engineer

When one provider announced the retirement of a widely-used model variant, engineering forums filled with farewell posts, petitions, and migration guides written by users who had built daily workflows around a specific model's behavioral fingerprint. That's not how software deprecation usually goes. When you remove a button from a UI, users are annoyed. When you remove an AI feature they've come to depend on, they grieve.

This asymmetry reveals something important: deprecating an AI-powered feature is categorically harder than deprecating a conventional feature. The behavioral envelope of an LLM — its tone, latency profile, formatting tendencies, response length — becomes as load-bearing as the feature's functional output. Users don't just rely on what the AI does. They rely on how it does it. If your sunset plan treats AI retirement the same as API endpoint retirement, you will pay for the mismatch in churn.

Grammar-Constrained Generation: The Output Reliability Technique Most Teams Skip

· 10 min read
Tian Pan
Software Engineer

Most teams that need structured LLM output follow the same playbook: write a prompt that says "respond only with valid JSON," parse the response, run Pydantic validation, and if it fails, retry with the error message appended. This works often enough to ship. It also fails in production at exactly the worst moments — under load, on edge-case inputs, and with cheaper models that don't follow instructions as reliably as GPT-4.

Grammar-constrained generation is a fundamentally different approach. Instead of asking the model nicely and checking afterward, it makes structurally invalid outputs mathematically impossible. The model cannot emit a missing brace, a non-existent enum value, or a required field it forgot — because those tokens are filtered out before sampling. Not unlikely. Impossible.

Most teams skip it. They shouldn't.

The Implicit API Contract: What Your LLM Provider Doesn't Document

· 10 min read
Tian Pan
Software Engineer

Your LLM provider's SLA covers HTTP uptime and Time to First Token. It says nothing about whether the model will still follow your formatting instructions next month, refuse requests it accepted last week, or return valid JSON under edge-case conditions you haven't tested. Most engineering teams discover this the hard way — via a production incident, not a changelog.

This is the implicit API contract problem. Traditional APIs promise stable, documented behavior. LLM providers promise a connection. Everything between the request and what your application does with the response is on you.

LLM Confidence Calibration in Production: Measuring and Fixing the Overconfidence Problem

· 10 min read
Tian Pan
Software Engineer

Your model says "I'm highly confident" and is wrong 40% of the time. That's not a hallucination — that's a calibration failure, and it's a harder problem to detect, measure, and fix in production.

Hallucination gets all the press. But overconfident wrong answers are often more dangerous: the model produces a plausible, fluent response with high expressed confidence, and there is no signal to the downstream consumer that anything is wrong. Hallucination detectors, RAG grounding checks, and fact-verification pipelines all help with fabricated content. They do almost nothing for the scenario where the model knows a fact but has systematically miscalibrated beliefs about how certain it is.

Most teams shipping LLM-powered features treat confidence as an afterthought. This post covers why calibration fails, how to measure it, and the production patterns that actually move the metric.

The Model EOL Clock: Treating Provider LLMs as External Dependencies

· 11 min read
Tian Pan
Software Engineer

In January 2026, OpenAI retired several GPT models from ChatGPT with two weeks' notice — weeks after its CEO had publicly promised "plenty of notice" following an earlier backlash. For teams that had built workflows around those models, the announcement arrived like a pager alert on a Friday afternoon. The API remained unaffected that time. But it won't always.

Every model you're currently calling has a deprecation date. Some of those dates are already listed on your provider's documentation page. Others haven't been announced yet. The operational question isn't whether your production model will be retired — it's whether you'll find out in time to handle it gracefully, or scramble to migrate after users start seeing failures.

Multi-Model Consistency: When Your Pipeline's Sequential LLM Calls Contradict Each Other

· 9 min read
Tian Pan
Software Engineer

Your summarization step decides a customer complaint is about billing. Your extraction step pulls "subscription tier: Pro." Your generation step writes a follow-up email referencing their "Enterprise plan." Three LLM calls, one pipeline, one completely broken output — and no error was raised anywhere along the way.

This is multi-model consistency failure: the silent killer of compound AI systems. It doesn't look like an exception. It doesn't trigger your error rate SLO. It just ships confidently wrong content to users.

Prompt Canary Deployments: Ship Prompt Changes Like a Senior SRE

· 10 min read
Tian Pan
Software Engineer

Your team ships a prompt edit on a Tuesday afternoon. The change looks reasonable — you tightened the system prompt, removed some redundant instructions, added a clearer tone directive. Staging looks fine. You deploy. By Wednesday morning, your support queue has doubled. Somewhere in that tightening, you broke the model's ability to recognize a class of user queries it used to handle gracefully. Your HTTP error rate is 0%. Your dashboards are green. The problem is invisible until a human reads the tickets.

This is the defining failure mode of LLM production systems. Prompt changes fail silently. They return 200 OK while producing garbage. They degrade in ways that unit tests don't catch, error rate monitors don't flag, and dashboards don't surface. The fix isn't better tests on staging — it's treating every prompt change as a production deployment with the same traffic-splitting, rollback, and monitoring discipline you'd apply to a critical code release.

The Prompt Entropy Budget: Measuring Output Variance as a First-Class Production Metric

· 11 min read
Tian Pan
Software Engineer

When your LLM feature ships, your monitoring dashboard probably tracks accuracy, latency, and error rate. What it almost certainly does not track is variance — how wildly different the output is each time a user sends the same prompt. That gap is where production AI features quietly collapse.

Variance determines whether your product feels trustworthy or capricious. A feature that scores 88% on your eval suite but delivers a two-sentence answer 40% of the time and a ten-paragraph essay the other 60% will erode user trust faster than one that scores 80% but behaves consistently. Teams optimizing exclusively for accuracy are solving the wrong half of the reliability problem.

The prompt entropy budget is the concept that fills this gap: a structured approach to measuring, budgeting, and controlling the distribution of outputs your model produces over identical inputs — treated the same way you treat p99 latency or error budget in your SLO framework.

Retry Budgets for LLM Agents: Why 20% Per-Step Failure Doubles Your Token Bill

· 8 min read
Tian Pan
Software Engineer

Most teams discover their retry problem when the invoice shows up. The agent "worked"; latency dashboards stayed green; error rates looked fine. Then finance asks why inference spend doubled this month, and someone finally reads the logs. It turns out that 20% of the tool calls in a 3-step agent were quietly retrying, each retry replayed the full prompt history, and the bill had been ramping for weeks.

The math on this is not mysterious, but it is aggressively counterintuitive. A 20% per-step retry rate sounds tolerable — most engineers would glance at it and move on. The actual token cost, once you factor in how modern agent frameworks retry, lands much closer to 2x than 1.2x. And the failure mode is invisible to every metric teams typically watch.

Retry budgets — an old idea from Google SRE work — are the cleanest fix. But the LLM version of the pattern needs tweaking, because tokens don't behave like RPCs.