Skip to main content

6 posts tagged with "ai-reliability"

View all tags

The Agent That Refuses to Fail Loud: How Over-Eager Fallbacks Hide Production Regressions

· 11 min read
Tian Pan
Software Engineer

Your status page is green. Your error rate is zero. Your p95 latency looks slightly better than last week. And quietly, eval-on-traffic dropped four points last Tuesday and nobody knows why for nine days, because by the time the regression rolled past the alerting threshold there were four interleaved root causes layered on top of each other and the team couldn't tell which one started the slide.

This is the dominant failure mode of mature agentic systems in 2026, and it's not a bug in any single component. It's the cumulative effect of a defensive stack the team built deliberately, one well-intentioned safety net at a time. The primary model returns garbage; the retry succeeds. The retry fails; the cheaper fallback model answers. The fallback's output is malformed; the wrapper rewrites it into a plausible shape. The wrapper logs a soft warning. Nobody alerts on the soft warning. The user receives an answer that's correct-looking, smoothly delivered, and quietly worse than the system was designed to produce.

The robustness layer worked. The quality story collapsed. And the alerting was built for the world before the robustness layer existed.

Stale Docs, Confident Answers: The Hidden Failure Mode in AI Help Centers

· 10 min read
Tian Pan
Software Engineer

Here is an uncomfortable finding from Google Research: when a RAG system retrieves insufficient or outdated context, the hallucination rate doesn't stay flat — it jumps from 10.2% to 66.1%. Adding a stale knowledge base doesn't make your AI help center neutral. It makes it sixfold more likely to give a confident wrong answer than if you had shipped nothing at all.

"Stale Docs, Confident Answers: The Hidden Failure Mode in AI Help Centers"

Most teams building AI-powered search and help centers focus on retrieval quality, embedding models, and chunk size. Almost none of them have a process for tracking whether the documents in the corpus are still accurate. That gap — documentation debt — is now showing up as a production reliability problem, not just a content problem.

The Overcorrection Trap: Why Removing Your AI Feature After a Public Failure Makes Recovery Slower

· 9 min read
Tian Pan
Software Engineer

When Google's image generation tool started producing historically inaccurate results in early 2024, the response was swift: pause all people-image generation entirely. That pause lasted months. Users who wanted to use the feature for legitimate cases had no option. And when it came back, adoption was slow — only available to a small tier of subscribers, heavily restricted, and carrying a reputation baggage that hadn't fully cleared. The overcorrection became its own problem.

This is the trap most teams fall into after a public AI failure. The intuition is correct — if something is causing harm, stop it — but the implementation is wrong. Removing the feature entirely, or adding wall-to-wall guardrails that render it useless, doesn't rebuild trust. It signals that you don't know how to operate AI responsibly, and that you can't distinguish between the 0.1% of outputs that were wrong and the 99.9% that weren't.

The LLM-as-Compiler Pattern: Separating Plan Generation from Execution

· 10 min read
Tian Pan
Software Engineer

When a PlanCompiler-style agent is benchmarked against direct LLM-to-code generation on 300 stratified multi-step tasks, the structured approach achieves 92.67% success at $0.00128 per task. The direct approach — where the LLM decides actions step-by-step in a free-form loop — achieves 62% success at $0.0106 per task. That is 50% more accurate at one-eighth the cost.

The difference isn't model capability. Both approaches use the same model. The difference is architecture: one separates plan generation from plan execution; the other conflates them.

Your AI Feature Is Only As Reliable As The ETL Pipeline Nobody Owns

· 10 min read
Tian Pan
Software Engineer

The AI feature has the dashboard. The prompt has the version control. The eval suite has the on-call rotation. And then there is the upstream cron job, written in 2022, owned by a team that rotated out of analytics two reorgs ago, that produces the CSV your retrieval index is built from. That cron job has no SLA. That CSV has no schema contract. The team that owns it does not know it feeds an AI feature. When it changes — and it will change — the AI team will spend three weeks debugging a prompt that did nothing wrong.

The AI quality regression you are about to chase is almost never an AI problem. It is an ETL problem wearing an AI costume. The discipline that has to land is the seam between the two — the contract, the lineage, the freshness signal, the paired on-call — and the team that does not formalize it ships an AI feature whose reliability is bounded by the least-loved cron job in the company.

The Semantic Failure Mode: When Your AI Runs Perfectly and Does the Wrong Thing

· 9 min read
Tian Pan
Software Engineer

Your AI agent completes the task. No errors in the logs. Latency looks normal. The output is well-formatted JSON, grammatically perfect prose, or a valid SQL query that executes without complaint. Every dashboard is green.

And the user stares at the result, sighs, and starts over from scratch.

This is the semantic failure mode — the class of production AI failures where the system runs correctly, the model responds confidently, and the output is delivered on time, but the agent didn't do what the user actually needed. Traditional error monitoring is completely blind to these failures because there is no error. The HTTP status is 200. The model didn't refuse. The output conforms to the schema. By every technical metric, the system succeeded.