Skip to main content

238 posts tagged with "reliability"

View all tags

Dead Letters for Agents: What to Do When No Agent Can Complete a Task

· 10 min read
Tian Pan
Software Engineer

A team building a multi-agent research tool discovered, on day eleven of a runaway job, that two of their agents had been cross-referencing each other's outputs in a loop the entire time. The bill: $47,000. No human had seen the results. No alarm had fired. The system simply kept running, confident it was making progress, because nothing in the architecture asked the question: what happens when a task genuinely cannot be completed?

Message queues solved this problem decades ago with the dead-letter queue (DLQ). A message that exceeds its delivery retry limit gets routed to a holding area where operators can inspect it, fix the root cause, and replay it when the system is ready. The pattern is simple, battle-tested, and almost entirely missing from production agent systems today.

Epistemic Trust in Agent Chains: How Uncertainty Compounds Through Multi-Step Delegation

· 10 min read
Tian Pan
Software Engineer

Most teams building multi-agent systems spend a lot of time thinking about authorization trust: what is Agent B allowed to do, which tools can it call, what data can it access. That's an important problem. But there's a second trust problem that doesn't get nearly enough attention, and it's the one that actually kills production systems.

The problem is epistemic: when Agent A delegates a task to Agent B and gets back an answer, how much should A believe what B returned?

This isn't a question of whether B was authorized to answer. It's a question of whether B actually could.

Feature Interaction Failures in AI Systems: When Two Working Pieces Break Together

· 10 min read
Tian Pan
Software Engineer

Your streaming works. Your retry logic works. Your safety filter works. Your personalization works. Deploy them together, and something strange happens: a rate-limit error mid-stream leaves the user staring at a truncated response that the system records as a success. The retry mechanism fires, but the stream is already gone. The personalization layer serves a customized response that the safety filter would have blocked — except the filter saw a sanitized version of the prompt, not the one the personalization layer acted on.

Each feature passed every test you wrote. The system failed the user anyway.

This is the feature interaction failure, and it is the most underdiagnosed class of production bug in AI systems today.

Ghost Context: How Contradictory Beliefs Break Long-Running Agent Memory

· 11 min read
Tian Pan
Software Engineer

Your agent has talked to the same user 400 times. Six months ago she said she preferred Python. Three months ago her team migrated to Go. Last week she mentioned a new TypeScript project. All three facts are sitting in your vector store right now — semantically similar, chronologically unordered, equally weighted. The next time she asks for code help, your agent retrieves all three, hands a contradictory mess to the model, and confidently generates Python with Go idioms for a TypeScript context.

This is ghost context: stale beliefs that never die, retrieved alongside their replacements, silently corrupting agent reasoning.

The problem is underappreciated because it doesn't produce visible errors. The agent doesn't crash. It doesn't refuse to respond. It produces fluent, confident output that's just subtly, expensively wrong.

Multi-Model Consensus: When One LLM Isn't Enough to Sign Off

· 11 min read
Tian Pan
Software Engineer

Your AI feature ships with 85% accuracy. Leadership is thrilled. Then a compliance audit finds that the 15% wrong answers cluster around a specific regulatory interpretation — one that every model in your provider's family gets wrong in the same way. You called one model. It failed. And because you never compared it to anything else, you had no signal that the failure was systematic.

Multi-model consensus architecture is the structural answer to this problem. Instead of trusting a single LLM, you fan out to multiple models from different provider families, aggregate their responses, and route based on agreement. The disagreement pattern itself becomes a first-class signal in your system, not just a debugging artifact.

This approach costs 2–4× more per inference. For most use cases, that's obviously not worth it. But for a specific class of outputs — legal summaries, medical triage routing, financial risk flags, security assessments — the cost of a wrong answer so far exceeds the cost of extra inference that the math inverts almost immediately.

Timeout-Aware Agent Design: How to Deliver Partial Results Instead of Silent Failure

· 10 min read
Tian Pan
Software Engineer

An agent successfully creates a GitHub issue, opens a Jira ticket, and updates a shared spreadsheet. Then it times out before sending the Slack announcement. The framework records the run as delivered. The user never gets notified. The side effects exist in three systems; the result that matters to the human doesn't.

This is the most common timeout failure mode in production agent systems, and it's almost never the one teams prepare for. Most agent implementations treat a timeout like any other exception: catch it, log it, return an error. The user gets nothing, even though the agent completed 90% of the work. The question isn't whether to set timeouts — every production system needs them. The question is what an agent does when the clock runs out.

AI Fallback Design Is an Architecture Problem, Not an Afterthought

· 9 min read
Tian Pan
Software Engineer

When McDonald's pulled the plug on its AI drive-thru after three years of operation, the failure wasn't that the model was bad at understanding orders. The failure was architectural: there was no clear escalation path to a human cashier, no confidence threshold that would trigger a retry, and no defined behavior for the system when it was confused. The AI just kept trying. Customers kept getting frustrated. The happy path was well-designed. Everything else wasn't.

That pattern repeats across almost every failed AI deployment. The model works in demos. It fails in production. And the post-mortem reveals the same root cause: fallback design was never part of the architecture. It was something someone planned to add later.

The Invisible Handoff: Why Production AI Failures Cluster at Component Boundaries

· 9 min read
Tian Pan
Software Engineer

When your AI feature ships a wrong answer, the first question is always: "Was it the model?" Most engineers reach for model evaluation, run a few test prompts, and conclude the model looks fine. They're usually right. The model is fine. The breakage happened somewhere else—at one of the invisible seams where your components talk to each other.

The evidence for this is consistent. Analysis of production RAG deployments shows 73% of failures are retrieval failures, not generation failures. In multi-agent systems, the most common failure modes are message ordering violations, state synchronization gaps, and schema mismatches—none of which show up in any per-component health check. GPT-4 produces invalid responses on complex extraction tasks nearly 12% of the time, not because the model is broken, but because the output format contract between the model and the downstream parser was never enforced.

The model gets blamed. The boundary is the culprit.

The Disable Switch Is the Real Product: Designing the Non-AI Fallback Path

· 10 min read
Tian Pan
Software Engineer

Every AI feature ships with a moment its team hasn't planned for: the moment it has to be turned off. A model regression lands during the morning standup. A cost spike from a marketing campaign nobody told engineering about doubles the bill in twelve hours. A privacy review flags a prompt-context leak. The provider goes down for ninety minutes. A compliance team waves a flag at noon and the feature has to disappear before the close of business.

The disable switch most teams ship for that moment is "the feature returns an error" — a spinner that never resolves, a banner that says "AI assistant unavailable, try again later." That is a strictly worse user experience than the pre-AI status quo, which is exactly what users will compare you to the moment AI degrades. The status quo had a button. Now they get an apology.

Eval-as-Code: When Your Release Gate Is a Notebook on Someone's Laptop

· 13 min read
Tian Pan
Software Engineer

The number that decides whether a model goes to production is being produced by a Jupyter notebook running on a single engineer's MacBook, against a CSV that lives in a Slack DM, scored by a judge model that nobody pinned. Two weeks later, after the engineer has touched the notebook three more times and the API provider has silently shipped a minor model update, nobody on the team can reproduce the number — including the engineer who originally generated it. And yet that number is the gate. It decided that GPT-4o-mini was good enough to replace GPT-4 in the customer support flow. It decided the new prompt template shipped. It decided the fine-tune was promoted. The team is treating it like a load-bearing artifact and storing it like a sticky note.

This is the eval gap. The industry has spent five years writing about evaluation as a methodology problem — which scoring technique, which judge model, which rubric, which dataset — and almost no time writing about evaluation as an engineering problem. But the moment your eval suite starts gating production releases, it inherits every requirement that the rest of your production stack lives by: reproducibility, version control, ownership, observability, dependency management, latency and reliability budgets, and a pipeline that survives the engineer who built it leaving the team. Most teams skip this layer entirely and discover its absence only after a major incident, usually one where the eval score said green and the customer experience said red.

The Fallback Cascade: Why Your AI Feature Needs Five Failure Modes, Not One

· 9 min read
Tian Pan
Software Engineer

Most AI features ship with exactly two states: working and broken. The model call succeeds and the feature responds; the model call fails and the user sees an error. This is the equivalent of building a web service with no load balancing, no cache, and a single database replica — technically functional until the moment it isn't.

The difference is that engineers learned database resilience patterns in the 1990s and have internalized them deeply. AI feature resilience is still being discovered the hard way, one production outage at a time. A payment processor lost $2.3M in a four-hour AI outage. A logistics company missed delivery windows for 30,000 packages when their routing model went down. Both failures shared a root cause: when the primary model was unavailable, there was nothing to fall back to.

When Your Agents Disagree: Conflict Resolution Patterns for Parallel AI Systems

· 9 min read
Tian Pan
Software Engineer

Here is the uncomfortable fact that multi-agent system designs rarely surface in architecture reviews: when you run two agents over the same task, they will not agree on the answer somewhere between 20% and 40% of the time, depending on task type. Most systems respond to this by silently picking one answer. The logs show a final decision; the intermediate disagreement disappears. Everything looks healthy until something downstream breaks, and you spend three to five times longer debugging it than you would a single-agent failure — because you can't tell which agent was wrong, or even that they disagreed at all.

Disagreement between agents is not a fringe case to handle later. As parallel agent topologies become a standard architecture pattern, conflict resolution graduates from a footnote into a first-class reliability discipline.