990 posts tagged with "insider"

The AI On-Call Playbook: Incident Response When the Bug Is a Bad Prediction

April 18, 2026 · 12 min read

Software Engineer

Your pager fires at 2 AM. The dashboard shows no 5xx errors, no timeout spikes, no unusual latency. Yet customer support is flooded: "the AI is giving weird answers." You open the runbook—and immediately realize it was written for a different kind of system entirely.

This is the defining failure mode of AI incident response in 2026. The system is technically healthy. The bug is behavioral. Traditional runbooks assume discrete failure signals: a stack trace, an error code, a service that won't respond. LLM-based systems break this assumption completely. The output is grammatically correct, delivered at normal latency, and thoroughly wrong. No alarm catches it. The only signal is that something "feels off."

This post is the playbook I wish existed when I first had to respond to a production AI incident.

The AI Ops Dashboard Nobody Builds Until It's Too Late

April 18, 2026 · 11 min read

Tian Pan

Software Engineer

The most dangerous indicator on your AI system's health dashboard is a green status light next to a 99.9% uptime number. If your first signal of a failing model is a support ticket, you don't have observability — you have vibes.

Traditional APM tools were built for a world where failure is binary: the request succeeded or it didn't. For LLM-powered features, that model breaks down completely. A request can complete in 300ms, return HTTP 200, consume tokens, and produce an answer that is confidently wrong, unhelpful, or quietly degraded from what it produced six weeks ago. None of those failure states trigger your existing alerts.

Research consistently shows that latency and error rate together cover less than 20% of the failure space for LLM-powered features. The other 80% hides in five failure modes that most teams discover only after users have already noticed.

Chatbot, Copilot, or Agent: The Taxonomy That Changes Your Architecture

April 18, 2026 · 10 min read

Tian Pan

Software Engineer

The most expensive architectural mistake in AI engineering is not picking the wrong model. It's picking the wrong interaction paradigm. Teams that should be building an agent spend six months refining a chatbot, then wonder why users can't get anything done. Teams that should be building a copilot wire up full agentic autonomy and spend the next quarter firefighting unauthorized actions and runaway costs.

The taxonomy matters before you write a single line of code, because chatbots, copilots, and agents have fundamentally different trust models, context-window strategies, and error-recovery requirements. Getting this wrong doesn't just produce a worse product — it produces a product that cannot be fixed by tuning prompts or swapping models.

The Cold Start Problem in AI Personalization: Being Useful Before You Have Data

April 18, 2026 · 11 min read

Tian Pan

Software Engineer

Most personalization systems are built around a flywheel: users interact, you learn their preferences, you show better recommendations, they interact more. The flywheel spins faster as data accumulates. The problem is the flywheel needs velocity to generate lift — and a new user has none.

This is the cold start problem. And it's more dangerous than most teams recognize when they first ship personalization. A new user arrives with no history, no signal, and often a skeptical prior: "AI doesn't know me." You have roughly 5–15 minutes to prove otherwise before they form an opinion that determines whether they'll stay long enough to generate the data that would let you actually help them. Up to 75% of new users abandon products in the first week if that window goes badly.

The cold start problem isn't a data problem. It's an initialization problem. The engineering question is: what do you inject in place of history?

Why '92% Accurate' Is Almost Always a Lie

April 18, 2026 · 8 min read

Tian Pan

Software Engineer

You launch an AI feature. The model gets 92% accuracy on your holdout set. You present this to the VP of Product, the legal team, and the head of customer success. Everyone nods. The feature ships.

Three months later, a customer segment you didn't specifically test is experiencing a 40% error rate. Legal is asking questions. Customer success is fielding escalations. The VP of Product wants to know why no one flagged this.

The 92% figure was technically correct. It was also nearly useless as a decision-making input — because headline accuracy collapses exactly the information that matters most.

The Data Flywheel Is Not Free: Engineering Feedback Loops That Actually Improve Your AI Product

April 18, 2026 · 11 min read

Tian Pan

Software Engineer

There is a pattern that plays out in nearly every AI product team: the team ships an initial model, users start interacting with it, and someone adds a thumbs-up/thumbs-down widget at the bottom of responses. They call it their feedback loop. Three months later, the model has not improved. The team wonders why the flywheel isn't spinning.

The problem isn't execution. It's that explicit ratings are not a feedback loop — they're a survey. Less than 1% of production interactions yield explicit user feedback. The 99% who never clicked anything are sending you far richer signals; you're just not collecting them. Building a real feedback loop means instrumenting your system to capture behavioral traces, label them efficiently at scale, and route them back into training and evaluation in a way that compounds over time.

The Implicit Feedback Trap: Why Engagement Metrics Lie About AI Quality

April 18, 2026 · 8 min read

Tian Pan

Software Engineer

A Canadian airline's support chatbot invented a bereavement fare policy that didn't exist. The chatbot was confident, well-formatted, and polite. Passengers believed it. A court later held the airline liable for the fabricated policy. Meanwhile, the chatbot's satisfaction scores were probably fine.

This is the implicit feedback trap. The signals most teams use to measure AI quality — thumbs-up ratings, click-through rates, satisfaction scores — are not just noisy. They are systematically biased toward measuring the wrong thing. And optimizing for them makes your AI worse.

Knowledge Graph vs. Vector Store: Choosing Your Retrieval Primitive

April 18, 2026 · 9 min read

Tian Pan

Software Engineer

Most teams stumble into vector stores because they're easy to start with, then discover a category of queries that simply won't work no matter how well they tune chunk size or embedding model. That's not a tuning problem — it's an architectural mismatch. Vector similarity and graph traversal are fundamentally different retrieval mechanisms, and the gap matters more as your queries get harder.

This is not a "use both" post. There are real trade-offs, and getting the choice wrong costs months of engineering time. Here's what the decision actually looks like in practice.

The LLM Local Development Loop: Fast Iteration Without Burning Your API Budget

April 18, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams building LLM applications discover the same problem around week three: every time someone runs the test suite, it fires live API calls, costs real money, takes 30+ seconds, and returns different results on each run. The "just hit the API" approach that felt fine during the prototype phase becomes a serious tax on iteration speed — and a meaningful line item on the bill. One engineering team audited their monthly API spend and found $1,240 out of $2,847 (43%) was pure waste from development and test traffic hitting live endpoints unnecessarily.

The solution is not to stop testing. It is to build the right kind of development loop from the start — one where the fast path is cheap and deterministic, and the slow path (real API calls) is reserved for when it actually matters.

Model Deprecation Readiness: Auditing Your Behavioral Dependency Before the 90-Day Countdown

April 18, 2026 · 8 min read

Tian Pan

Software Engineer

When Anthropic deprecated a Claude model last year, a company noticed — but only because a downstream parser started throwing errors in production. The culprit? The new model occasionally wrapped its JSON responses in markdown code blocks. The old model never did. Nobody had documented that assumption. Nobody had tested for it. The fix took an afternoon; the diagnosis took three days.

That pattern — silent behavioral dependency breaking loudly in production — is the defining failure mode of model migrations. You update a model ID, run a quick sanity check, and ship. Six weeks later, something subtle is wrong. Your JSON parsing is 0.6% more likely to fail. Your refusal rate on edge cases doubled. Your structured extraction misses a field it used to reliably populate. The diff isn't in the code — it's in the model's behavior, and you never wrote a contract for it.

With major providers now running on 60–180 day deprecation windows, and the pace of model releases accelerating, this is no longer a theoretical concern. It's a recurring operational challenge. Here's how to get ahead of it.

Model Routing in Production: When the Router Costs More Than It Saves

April 18, 2026 · 10 min read

Tian Pan

Software Engineer

A team at a mid-size SaaS company deployed a model router six months ago with a clear goal: stop paying frontier-model prices for the 70% of queries that are simple lookups and reformatting tasks. They ran it for three months before someone did the math. Total inference cost had gone up by 12%.

The router itself was cheap — a lightweight classifier adding about 2ms of overhead per request. But the classifier's decision boundary was miscalibrated. It escalated 60% of queries to the expensive model, not 30%. The 40% it handled locally had worse quality, which increased user retry rates, which increased total request volume. The router's telemetry showed "routing working correctly" because it was routing — it just wasn't routing well.

This failure pattern is more common than the success stories suggest. Here's how to build routing that actually saves money.

Prompt Regression Tests That Actually Block PRs

April 18, 2026 · 10 min read

Tian Pan

Software Engineer

Ask any AI engineering team if they test their prompts and they'll say yes. Ask if a bad prompt can fail a pull request and block a merge, and you'll get a much quieter room. The honest answer for most teams is no — they have eval notebooks they run occasionally, maybe a shared Notion doc of known prompt quirks, and a vague sense that things are worse than they used to be. That is not testing. That is hoping.

The gap exists because prompt testing feels qualitatively different from unit testing. Code either behaves correctly or it doesn't. Prompts produce outputs on a spectrum, outputs are non-deterministic, and running enough examples to feel confident costs real money. Those are real constraints. None of them are insurmountable. Teams that have built prompt CI that actually blocks merges are not spending fifty dollars a build — they're running in under three minutes at under a dollar using a few design decisions that make the problem tractable.

About Tian Pan